Elasticsearch Analyzer 内置分词器使用示例详解

桦艺 · 发布于 2023-4-19 14:06:12

前置知识

主要介绍一下 Elasticsearch中 Analyzer 分词器的构成和一些Es中内置的分词器以及如何使用它们

Elasticsearch Analyzer 内置分词器使用示例详解-1.png

es 提供了 analyze api 可以方便我们快速的指定某个分词器然后对输入的text文本停止分词协助我们学习和实验分词器

POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

复制代码

1.Analyzer

在ES中有很重要的一个概念就是分词，ES的全文检索也是基于分词结合倒排索引做的。所以这一文我们来看下何谓之分词。如何分词。
分词器是专门处置分词的组件，在很多中间件设计中每个组件的职责都划分的很清楚，单一职责原则，以后改的时候好扩展。
分词器由三部分组成。

分词场景:

Elasticsearch Analyzer 内置分词器使用示例详解-2.png

2.Elasticsearch 内置分词器

在es中有不少内置分词器

3. Standard Analyzer

Standard 是es中默认的分词器 , 它是依照 Unicode 文本分割算法去对文本停止分词的

POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

复制代码

3.1 Definition

包括了转小写的 token filter 和 stop token filter 去除停顿词
Tokenizer

Token Filters

3.2 Configuration

3.3 实验

// 使用自定义的分词器基于 standard
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5, // 最大词数
"stopwords": "_english_" // 开启过滤停顿词使用 englisth 语法
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_english_analyzer",
"text": "The hellogoodname jack"
}
// 可以看到最长5个字符就需要停止分词了, 并且停顿词 the 没有了
["hello", "goodn", "ame", "jack"]

复制代码

4. Simple Analyzer

简单的分词器分词规则就是遇到非字母的就分词, 并且转化为小写,(lowercase tokennizer )

POST _analyze
{
"analyzer": "simple",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

复制代码

4.1 Definition

Tokenizer

4.2 Configuation

无配置参数

4.3 实验

simple analyzer 分词器的实现就是如下

PUT /simple_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_simple": {
"tokenizer": "lowercase",
"filter": [
]
}
}
}
}
}

复制代码

5. Stop Analyzer

stop analyzer 和 simple analyzer 一样, 只是多了过滤 stop word 的 token filter , 并且默认使用 english 停顿词规则

POST _analyze
{
"analyzer": "stop",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
// 可以看到非字母停止分词并且转小写然后去除了停顿词
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

复制代码

5.1 Definition

Tokenizer

Token filters

5.2 Configuration

5.3 实验

如下就是对 Stop Analyzer 的实现 , 先转小写后停止停顿词的过滤

PUT /stop_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
},
"analyzer": {
"rebuilt_stop": {
"tokenizer": "lowercase",
"filter": [
"english_stop"
]
}
}
}
}
}

复制代码

设置 stopwords 参数指定过滤的停顿词列表

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_stop_analyzer": {
"type": "stop",
"stopwords": ["the", "over"]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_stop_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

复制代码

6. Whitespace Analyzer

空格分词器, 顾名思义遇到空格就停止分词, 不会转小写

POST _analyze
{
"analyzer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

复制代码

6.1 Definition

Tokenizer

6.2 Configuration

无配置

6.3 实验

whitespace analyzer 的实现就是如下, 可以根据实际情况停止添加 filter

PUT /whitespace_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_whitespace": {
"tokenizer": "whitespace",
"filter": [
]
}
}
}
}
}

复制代码

7. Keyword Analyzer

很特殊它不会停止分词, 怎么输入就怎么输出

POST _analyze
{
"analyzer": "keyword",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
//注意这里并没有停止分词而是原样输出
[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]

复制代码

7.1 Definition

Tokennizer

7.2 Configuration

无配置

7.3 实验

rebuit 如下就是 Keyword Analyzer 实现

PUT /keyword_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_keyword": {
"tokenizer": "keyword",
"filter": [
]
}
}
}
}
}

复制代码

8. Patter Analyzer

正则表达式停止拆分 ,注意正则匹配的是标志, 就是要被分词的标志默认是依照 \w+ 正则分词

POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
// 默认是依照 \w+ 正则
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

复制代码

8.1 Definition

Tokennizer

Token Filters

8.2 Configuration

pattern	A Java regular expression, defaults to \W+.
flags	Java regular expression.
lowercase	转小写默认开启 true.
stopwords	停顿词过滤默认none 未开启 , Defaults to _none_.
stopwords_path	停顿词文件途径

8.3 实验

Pattern Analyzer 的实现就是如下

PUT /pattern_example
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_non_word": {
"type": "pattern",
"pattern": "\\W+"
}
},
"analyzer": {
"rebuilt_pattern": {
"tokenizer": "split_on_non_word",
"filter": [
"lowercase"
]
}
}
}
}
}

复制代码

9. Language Analyzer

提供了如下这么多语言分词器 , 其中 english 也在其中
arabic, armenian, basque, bengali, bulgarian, catalan, czech, dutch, english, finnish, french, galician, german, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, portuguese, romanian, russian, sorani, spanish, swedish, turkish.

GET _analyze
{
"analyzer": "english",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
[ 2, quick, brown, foxes, jumped, over, lazy, dog, bone ]

复制代码

10. Customer Analyzer

没啥好说的就是当提供的内置分词器不满足你的需求的时候 ,你可以结合如下3部分

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons"
],
"tokenizer": "punctuation",
"filter": [
"lowercase",
"english_stop"
]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": {
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
[ i'm, _happy_, person, you ]

复制代码

总结

本篇主要介绍了 Elasticsearch 中的一些内置的 Analyzer 分词器, 这些内置分词器可能不会常用,但是假设你能好好梳理一下这些内置分词器,一定会对你理解Analyzer 有很大的协助, 可以协助你理解 Character Filters , Tokenizer 和 Token Filters 的用处.
有时机再聊聊一些中文分词器如 IKAnalyzer, ICU Analyzer ,Thulac 等等.. 毕竟开发中中文分词器用到更多些
以上就是Elasticsearch Analyzer 内置分词器使用示例详解的详细内容，更多关于Elasticsearch Analyzer分词器的资料请关注网站其它相关文章！