Lucene 4.x Spellcheck使用说明
Spellcheck是Lucene新版本的功能,在介绍spellcheck之前,我们需要弄清楚Spellcheck支持几种数据源。Spellcheck构造函数需要传入Dictionary接口:
package org.apache.lucene.search.spell; /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import org.apache.lucene.search.suggest.InputIterator; /** * A simple interface representing a Dictionary. A Dictionary * here is a list of entries, where every entry consists of * term, weight and payload. * */ public interface Dictionary { /** * Returns an iterator over all the entries * @return Iterator */ InputIterator getEntryIterator() throws IOException; }
常用的Dictionary主要有以下几种,常用的主要有基于文本型的和基于lucene索引构建的:
下面是我测试用的一段代码,代码包括索引构建和索引查询:
package com.tianditu.com.search; import java.io.File; import java.io.IOException; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.search.spell.LuceneDictionary; import org.apache.lucene.search.spell.SpellChecker; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.MMapDirectory; import org.apache.lucene.util.Version; public class GlobalSuggest { //拼写检查构建的索引 private final String SPELL_CHECK_FOLDER = "c:\\spellcheck\\"; //根据已有的索引 private final String GLOBAL_PINYIN_SUGGEST = "O:\\searchwork_custom\\data_index\\pinyin2008\\"; //构建索引 public void testIndexPinyin2008() throws IOException{ long start = System.currentTimeMillis(); //北京吉威时代软件股份有限公司 //String indexDir ="O:\\searchwork_custom\\data_index\\GlobalIndex\\"; Directory direct = new MMapDirectory(new File(GLOBAL_PINYIN_SUGGEST)); LuceneDictionary ld = new LuceneDictionary(DirectoryReader.open(direct), "name"); ld.getEntryIterator(); Directory spd = FSDirectory.open(new File(SPELL_CHECK_FOLDER)); SpellChecker sc = new SpellChecker(spd); //sc.in IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_30,null); //往spellcheck目录下写索引-------------- sc.indexDictionary(ld, iwc, true); sc.close(); long end = System.currentTimeMillis(); System.out.println("索引完毕,耗时:"+(end-start)+"ms"); } public void testIndex() throws IOException{ long start = System.currentTimeMillis(); //北京吉威时代软件股份有限公司 String indexDir ="O:\\searchwork_custom\\data_index\\GlobalIndex\\"; Directory direct = new MMapDirectory(new File(indexDir)); LuceneDictionary ld = new LuceneDictionary(DirectoryReader.open(direct), "name"); ld.getEntryIterator(); Directory spd = FSDirectory.open(new File(SPELL_CHECK_FOLDER)); SpellChecker sc = new SpellChecker(spd); //sc.in IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_30,null); sc.indexDictionary(ld, iwc, true); sc.close(); long end = System.currentTimeMillis(); System.out.println("索引完毕,耗时:"+(end-start)+"ms"); } public void testSearch(String wd) throws IOException{ //构建Directory Directory spd = FSDirectory.open(new File(SPELL_CHECK_FOLDER)); //实例化 spellcheck组件 SpellChecker sc = new SpellChecker(spd); //根据输入关键字 获得N条最相近的几率 第三个鄙视精确度 越大越匹配 安装实际需要调整 String[] suggests = sc.suggestSimilar(wd, 10,0.6f); if(suggests!=null){ for(String word:suggests){ System.out.println("Dou you mean:"+word); } } } /** * @param args * @throws IOException */ public static void main(String[] args) throws IOException { GlobalSuggest spellcheck = new GlobalSuggest(); //spellcheck.testIndexPinyin2008(); spellcheck.testSearch("beijing京鸭"); //spellcheck.testSearch("beijng"); } }
其中索引构建处代码:
//构建索引 public void testIndexPinyin2008() throws IOException{ long start = System.currentTimeMillis(); //北京吉威时代软件股份有限公司 //String indexDir ="O:\\searchwork_custom\\data_index\\GlobalIndex\\"; Directory direct = new MMapDirectory(new File(GLOBAL_PINYIN_SUGGEST)); LuceneDictionary ld = new LuceneDictionary(DirectoryReader.open(direct), "name"); ld.getEntryIterator(); Directory spd = FSDirectory.open(new File(SPELL_CHECK_FOLDER)); SpellChecker sc = new SpellChecker(spd); //sc.in IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_30,null); //往spellcheck目录下写索引-------------- sc.indexDictionary(ld, iwc, true); sc.close(); long end = System.currentTimeMillis(); System.out.println("索引完毕,耗时:"+(end-start)+"ms"); }
此处代码,就是根据已有的索引来构建Spellcheck所需的索引。
Spellcheck查询索引代码片段如下:
//构建Directory Directory spd = FSDirectory.open(new File(SPELL_CHECK_FOLDER)); //实例化 spellcheck组件 SpellChecker sc = new SpellChecker(spd); //根据输入关键字 获得N条最相近的几率 第三个鄙视精确度 越大越匹配 安装实际需要调整 String[] suggests = sc.suggestSimilar(wd, 10,0.6f); if(suggests!=null){ for(String word:suggests){ System.out.println("Dou you mean:"+word); } }
相关算法:默认是 LevensteinDistance 。
查询样例:
1、查询汉字,有错别字情况:
2、查询拼音:
3、拼音汉字夹杂:
(备注:发现问题了,拼音和汉字夹杂的情况不行,如果想使用,需要进行某种处理。)
4、如果处理一长串汉字,中间夹杂错别字:
总结:看来spellcheck能力还是有限,如果需要用还可能改造。
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。