ElasticSearch六--ES--Analyzer分词

本文主要是介绍ElasticSearch六--ES--Analyzer分词，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Analyzer分词

Analysis 和 Analyzer

Analysis - 文本分析是把全文本转换成一系列单词（term/token）的过程，也叫分词

Analysis 是通过 Analyzer 来实现的

可使用 Elasticsearch 内置的分析器，或者按需制定分析器

除了在数据写入时转换词条，匹配 Query 语句时候也需要用相同的分析器对查询语句进行分析

Analyzer 的组成

分词器是专门处理分词的组件，由三部分组成

Character Filters (针对原始文本处理，例如去除html)
Tokenizer (按照规则切分为单词)
Token Filter(将切分的单词进行加工，小写，删除 stopwords,增加同义词)

例子：

将 Mastering Elasticsearch & Elasticsearch in Action 经过上面的步骤就会产生

master
elasticsearch
action

ES的内置分词器

Standard Analyzer - 默认分词器，按词切分，小写处理
Simple Analyzer - 按照非字母切分（符号被过滤），小写处理
Stop Analyzer - 小写处理，停用词过滤（the a is）
WhiteSpace Analyzer - 按照空格切分，不转小写
Keyword Analyzer - 不分词，直接将输入当做输出
Patter Analyzer - 正则表达式，默认\W+(非字符分隔)
Language - 提供了30多种常见语言的分词器
Customer Analyzer 自定义分词器

使用 _analyzer API

直接指定 Analyzer 进行测试

指令

GET /_analyze
{"analyzer":"standard","text":"Mastering Elasticsearch & Elasticsearch in Action"
}

结果

{"tokens" : [{"token" : "mastering","start_offset" : 0,"end_offset" : 9,"type" : "<ALPHANUM>","position" : 0},{"token" : "elasticsearch","start_offset" : 10,"end_offset" : 23,"type" : "<ALPHANUM>","position" : 1},{"token" : "elasticsearch","start_offset" : 25,"end_offset" : 38,"type" : "<ALPHANUM>","position" : 2},{"token" : "in","start_offset" : 39,"end_offset" : 41,"type" : "<ALPHANUM>","position" : 3},{"token" : "action","start_offset" : 42,"end_offset" : 48,"type" : "<ALPHANUM>","position" : 4}]
}

指定索引的字段进行测试

指令

POST test_home/_analyze
{"field": "job_name","text":"Mastering Elasticsearch, Elasticsearch in Action"
}

结果

{"tokens" : [{"token" : "mastering","start_offset" : 0,"end_offset" : 9,"type" : "<ALPHANUM>","position" : 0},{"token" : "elasticsearch","start_offset" : 10,"end_offset" : 23,"type" : "<ALPHANUM>","position" : 1},{"token" : "elasticsearch","start_offset" : 25,"end_offset" : 38,"type" : "<ALPHANUM>","position" : 2},{"token" : "in","start_offset" : 39,"end_offset" : 41,"type" : "<ALPHANUM>","position" : 3},{"token" : "action","start_offset" : 42,"end_offset" : 48,"type" : "<ALPHANUM>","position" : 4}]
}

自定义分词进行测试

指令

POST _analyze
{"tokenizer": "standard","filter": ["lowercase"],"text":"Mastering Elasticsearch, Elasticsearch in Action"
}

结果

{"tokens" : [{"token" : "mastering","start_offset" : 0,"end_offset" : 9,"type" : "<ALPHANUM>","position" : 0},{"token" : "elasticsearch","start_offset" : 10,"end_offset" : 23,"type" : "<ALPHANUM>","position" : 1},{"token" : "elasticsearch","start_offset" : 25,"end_offset" : 38,"type" : "<ALPHANUM>","position" : 2},{"token" : "in","start_offset" : 39,"end_offset" : 41,"type" : "<ALPHANUM>","position" : 3},{"token" : "action","start_offset" : 42,"end_offset" : 48,"type" : "<ALPHANUM>","position" : 4}]
}