elasticsearch(11)通过ngram分词机制实现搜索推荐

本文主要是介绍elasticsearch(11)通过ngram分词机制实现搜索推荐，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

转载自简书本文链接地址: Elasticsearch通过ngram分词机制实现搜索推荐

1、什么是ngram

例如英语单词 quick，5种长度下的ngram

ngram length=1，q u i c k
ngram length=2，qu ui ic ck
ngram length=3，qui uic ick
ngram length=4，quic uick
ngram length=5，quick

2、什么是edge ngram

quick这个词，抛锚首字母后进行ngram

q
qu
qui
quic
quick

使用edge ngram将每个单词都进行进一步的分词和切分，用切分后的ngram来实现前缀搜索推荐功能

hello world
hello we

h
he
hel
hell
hello    doc1,doc2w         doc1,doc2
wo
wor
worl
world
e       doc2

比如搜索hello w

doc1和doc2都匹配hello和w，而且position也匹配，所以doc1和doc2被返回。

搜索的时候，不用在根据一个前缀，然后扫描整个倒排索引了；简单的拿前缀去倒排索引中匹配即可，如果匹配上了，那么就完事了。

3、最大最小参数

min ngram = 1
max ngram = 3

最小几位最大几位。（这里是最小1位最大3位）

比如有helloworld单词

那么就是如下

h
he
hel

最大三位就停止了。

4、试验一下ngram

PUT /my_index
{"settings": {"analysis": {"filter": {"autocomplete_filter" : {"type" : "edge_ngram","min_gram" : 1,"max_gram" : 20}},"analyzer": {"autocomplete" : {"type" : "custom","tokenizer" : "standard","filter" : ["lowercase","autocomplete_filter"]}}}}
}

PUT /my_index/_mapping/my_type
{"properties": {"title": {"type":     "string","analyzer": "autocomplete","search_analyzer": "standard"}}
}

注意这里search_analyzer为什么是standard而不是autocomplete？

因为搜索的时候没必要在进行每个字母都拆分，比如搜索hello w。直接拆分成hello和w去搜索就好了，没必要弄成如下这样：

h
he
hel
hell
hello   w

弄成这样的话效率反而更低了。

插入4条数据

PUT /my_index/my_type/1
{"title" : "hello world"
}PUT /my_index/my_type/2
{"title" : "hello we"
}PUT /my_index/my_type/3
{"title" : "hello win"
}PUT /my_index/my_type/4
{"title" : "hello dog"
}

执行搜索

GET /my_index/my_type/_search
{"query": {"match_phrase": {"title": "hello w"}}
}

结果

{"took": 6,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 3,"max_score": 1.1983768,"hits": [{"_index": "my_index","_type": "my_type","_id": "2","_score": 1.1983768,"_source": {"title": "hello we"}},{"_index": "my_index","_type": "my_type","_id": "1","_score": 0.8271048,"_source": {"title": "hello world"}},{"_index": "my_index","_type": "my_type","_id": "3","_score": 0.797104,"_source": {"title": "hello win"}}]}
}