ElasticSearch (5) – ICU分词器

icu_分词器 和 标准分词器 使用同样的 Unicode 文本分段算法, 只是为了更好的支持亚洲语,添加了泰语、老挝语、中文、日文、和韩文基于词典的词汇识别方法,并且可以使用自定义规则将缅甸语和柬埔寨语文本拆分成音节。

icu 分词器在默认的ElasticSearch当中是不自带的,需要另外安装。

ICU分词器的安装方法

1. 去到ElasticSeach / bin 路径

cd elasticsearch/bin

2. 安装插件:ICU分词器

./elasticsearch-plugin install analysis-icu

3. 安装完毕了需要重启ElasticSearch

ICU 分词器使用展示

#ICU分词测试
GET _analyze
{
   "analyzer": "icu_analyzer",
   "text":"股市投资稳赚不赔必修课:如何做好仓位管理和情绪管理"
}

# 响应如下:
{
  "tokens" : [
    {
      "token" : "股市",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "投资",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "稳赚",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "不",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "赔",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "必修",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "课",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "如何",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "做好",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "仓",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "位",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "管理",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "和",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "情绪",
      "start_offset" : 21,
      "end_offset" : 23,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    },
    {
      "token" : "管理",
      "start_offset" : 23,
      "end_offset" : 25,
      "type" : "<IDEOGRAPHIC>",
      "position" : 14
    }
  ]
}

Loading

Facebook评论