本文主要是介绍ElasticSearch六--ES--Analyzer分词,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
Analyzer分词
Analysis 和 Analyzer
Analysis - 文本分析是把全文本转换成一系列单词(term/token)的过程,也叫分词
Analysis 是通过 Analyzer 来实现的
- 可使用 Elasticsearch 内置的分析器,或者按需制定分析器
除了在数据写入时转换词条,匹配 Query 语句时候也需要用相同的分析器对查询语句进行分析
Analyzer 的组成
分词器是专门处理分词的组件,由三部分组成
- Character Filters (针对原始文本处理,例如去除html)
- Tokenizer (按照规则切分为单词)
- Token Filter(将切分的单词进行加工,小写,删除 stopwords,增加同义词)
例子:
将 Mastering Elasticsearch & Elasticsearch in Action
经过上面的步骤就会产生
- master
- elasticsearch
- action
ES的内置分词器
- Standard Analyzer - 默认分词器,按词切分,小写处理
- Simple Analyzer - 按照非字母切分(符号被过滤),小写处理
- Stop Analyzer - 小写处理,停用词过滤(the a is)
- WhiteSpace Analyzer - 按照空格切分,不转小写
- Keyword Analyzer - 不分词,直接将输入当做输出
- Patter Analyzer - 正则表达式,默认\W+(非字符分隔)
- Language - 提供了30多种常见语言的分词器
- Customer Analyzer 自定义分词器
使用 _analyzer API
直接指定 Analyzer 进行测试
指令
GET /_analyze
{"analyzer":"standard","text":"Mastering Elasticsearch & Elasticsearch in Action"
}
结果
{"tokens" : [{"token" : "mastering","start_offset" : 0,"end_offset" : 9,"type" : "<ALPHANUM>","position" : 0},{"token" : "elasticsearch","start_offset" : 10,"end_offset" : 23,"type" : "<ALPHANUM>","position" : 1},{"token" : "elasticsearch","start_offset" : 25,"end_offset" : 38,"type" : "<ALPHANUM>","position" : 2},{"token" : "in","start_offset" : 39,"end_offset" : 41,"type" : "<ALPHANUM>","position" : 3},{"token" : "action","start_offset" : 42,"end_offset" : 48,"type" : "<ALPHANUM>","position" : 4}]
}
指定索引的字段进行测试
指令
POST test_home/_analyze
{"field": "job_name","text":"Mastering Elasticsearch, Elasticsearch in Action"
}
结果
{"tokens" : [{"token" : "mastering","start_offset" : 0,"end_offset" : 9,"type" : "<ALPHANUM>","position" : 0},{"token" : "elasticsearch","start_offset" : 10,"end_offset" : 23,"type" : "<ALPHANUM>","position" : 1},{"token" : "elasticsearch","start_offset" : 25,"end_offset" : 38,"type" : "<ALPHANUM>","position" : 2},{"token" : "in","start_offset" : 39,"end_offset" : 41,"type" : "<ALPHANUM>","position" : 3},{"token" : "action","start_offset" : 42,"end_offset" : 48,"type" : "<ALPHANUM>","position" : 4}]
}
自定义分词进行测试
指令
POST _analyze
{"tokenizer": "standard","filter": ["lowercase"],"text":"Mastering Elasticsearch, Elasticsearch in Action"
}
结果
{"tokens" : [{"token" : "mastering","start_offset" : 0,"end_offset" : 9,"type" : "<ALPHANUM>","position" : 0},{"token" : "elasticsearch","start_offset" : 10,"end_offset" : 23,"type" : "<ALPHANUM>","position" : 1},{"token" : "elasticsearch","start_offset" : 25,"end_offset" : 38,"type" : "<ALPHANUM>","position" : 2},{"token" : "in","start_offset" : 39,"end_offset" : 41,"type" : "<ALPHANUM>","position" : 3},{"token" : "action","start_offset" : 42,"end_offset" : 48,"type" : "<ALPHANUM>","position" : 4}]
}
中文分词的难点
中文句子,切分成一个个词(不是一个个字)
英文字,单词有自然地空格进行切分
一句中文,在不同的上下文,有不同的理解
推荐的一些中文分词器:
- icu
- ik : https://github.com/medcl/elasticsearch-analysis-ik
- thulac : https://github.com/microbun/elasticsearch-thulac-plugin
极客时间 ES 学习笔记
这篇关于ElasticSearch六--ES--Analyzer分词的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!