Java语言使用Lucene4.7.0生成全文索引并实现高亮关键词功能

浏览数：35 / 时间：2015年06月08日

1.lucene简介：

　　Lucene提供了一个优秀的全文索引的解决方案，本文主要关注Lucene的实际应用，所以各位如果想更多更细致的了解Lucene的发展历程和原理可以自行百度或谷歌。

2.配置Lucene4.7.0

　　1).下载Lucene,下载地址为：http://lucene.apache.org/

　　2).新建项目，引入Jar包，我的项目主要实现对指定目录生成全文索引，通过生成的索引进行关键词检索并高亮，所以我的项目只需要引入lucene- core-4.7.0.jar、lucene-analyzers-common-4.7.0.jar、lucene-highlighter- 4.7.0.jar这三个基本包，其中：

　　lucene-core-4.7.0.jar是lucene的核心包，此包必不可少。

　　lucene-analyzers-common-4.7.0.jar 是lucene建立索引时确定分词策略的相关包，在生成索引过程中，也是不可少的。

　　lucene-highlighter-4.7.0.jar 是lucene根据索引检索过程中，高亮关键词的相关jar包。

　　我本地的项目结构如图所示：

3.通过使用lucene生成全文索引

　　在进行具体操作之前，各位需要明白在Lucene的概念中，每个即将被索引的文档都叫做Document,每个Document对象都包含一些不同类型的Field，可以将Document理解成数据库中的一张表，而每张表又包含标题、发表时间、文章内容等字段，这每一个字段就相当于一个Field。将一个文档转化成Document对象以后就可以使用IndexWriter的实例将这个Document对象加入到索引文件中，在这之中会将 Document对象进行分词的操作，借助的是Lucene的Analyzer的不同实例，每种实例都相当于一种具体的分词策略。

　　代码如下：

 1 ArrayList<File> AllFiles=new ArrayList<File>();//遍历后得到的文件夹路径保存在此ArrayList<File>中
 2     
 3     /**
 4      * 为指定的文件夹下的文件创建索引，并将索引文件保存在指定的文件夹下
 5      * @param indexFilePath 索引文件的保存路径
 6      * @param dataFilePath    文本文件的存放路径
 7      */
 8     public void createIndex(String indexFilePath,String dataFilePath){
 9         try {
10             File   dataDir = new File(dataFilePath);
11             File   indexDir  = new File(indexFilePath);
12             Analyzer luceneAnalyzer=new CJKAnalyzer(Version.LUCENE_47);
13             //Analyzer luceneAnalyzer =new ChineseAnalyzer();
14             IndexWriterConfig iwc=new IndexWriterConfig(Version.LUCENE_47, luceneAnalyzer);
15             IndexWriter indexWriter=new IndexWriter(FSDirectory.open(indexDir), iwc);
16             long startTime=new Date().getTime();
17             ArrayList<File> GetAllFiles=getAllFilesOfDirectory(dataDir);
18             //根据所得到的文件夹的路径，变量此文件夹下的所有文件
19             for(int index=0;index<GetAllFiles.size();index++){
20                 File[]  dataFiles=GetAllFiles.get(index).listFiles();
21                 for(int i=0;i<dataFiles.length;i++){
22                     if(dataFiles[i].isFile()&&dataFiles[i].getName().endsWith(".txt")){
23                         System.out.println("Indexing File:"+dataFiles[i].getCanonicalPath());
24                         Document doc=new Document();
25                         StringBuffer reader=readFile(dataFiles[i]);
26                         //Reader reader=new FileReader(dataFiles[i]);
27                         doc.add(new StringField("path", dataFiles[i].getCanonicalPath(), Store.YES));
28                         doc.add(new TextField("content", reader.toString(),Store.YES));
29                         indexWriter.addDocument(doc);
30                     }
31                 }
32             }
33             indexWriter.close();
34             long endTime=new Date().getTime();
35             System.out.println("It takes:"+(endTime-startTime)+"milliseconds to create index for the files in directory:"+dataDir.getPath());
36         } catch (IOException e) {
37             e.printStackTrace();
38         }
39     }

4.使用Lucene通过已生成的索引文件根据关键词检索文档

　　进行完上述操作以后，各位会发现你所指定的索引生成目录中会多出一些.cfe/.cfs/.si文件，这些文件就是Lucene生成的索引文件。下面为大家介绍如何利用这些索引文件对文档进行全文检索。

　　这里检索主要使用的是org.apache.lucene.search包下的IndexSearcher类的search(TermQuery query)方法，此方法需要传入一个TermQuery对象，TermQuery对象相当于我们执行数据库查询时的SQL语句，只是这里是将代表 Filed和keyword的(key,value)值转换成了Lucene可以识别的查询语言。通过获取Document之后便可以取出Field的值了。

　　高亮功能的基本原理就是将所查询到的关键词，用指定的HTML标签进行替换，如将原有的“message”字符，替换为 “<font color=‘red’>message</font>”，这样在文本显示是便可以将message进行红色高亮显示。此操作 Lucene中使用了SimpleHTMLFormatter、setTextFragmenter、TokenStream等类完成这个过程。

/**
     * 使用已建立的索引检索指定的关键词，并高亮关键词后返回结果
     * @param indexFilePath 索引文件路径
     * @param keyWord 检索关键词
     */
    public void searchByIndex(String indexFilePath,String keyWord){
        try {
            String indexDataPath=indexFilePath;
            Directory dir=FSDirectory.open(new File(indexDataPath));
            IndexReader reader=DirectoryReader.open(dir);
            IndexSearcher searcher=new IndexSearcher(reader);
            Term term=new Term("content",keyWord);
            TermQuery query=new TermQuery(term);
            //检索出匹配相关指数最高的30条记录
            TopDocs topdocs=searcher.search(query,10);
            ScoreDoc[] scoredocs=topdocs.scoreDocs;
            System.out.println("查询结果总数:" + topdocs.totalHits);
            System.out.println("最大的评分:"+topdocs.getMaxScore());
            for(int i=0;i<scoredocs.length;i++){
                int doc=scoredocs[i].doc;
                Document document=searcher.doc(doc);
                System.out.println("====================文件【"+(i+1)+"】=================");
                System.out.println("检索关键词："+term.toString());
                System.out.println("文件路径:"+document.get("path"));
                System.out.println("文件ID:"+scoredocs[i].doc);
                String content=document.get("content");
                /*Begin:开始关键字高亮*/
                SimpleHTMLFormatter formatter=new SimpleHTMLFormatter("<b><font color=‘red‘>","</font></b>");
                Highlighter highlighter=new Highlighter(formatter, new QueryScorer(query));
                highlighter.setTextFragmenter(new SimpleFragmenter(400));
                Analyzer luceneAnalyzer=new CJKAnalyzer(Version.LUCENE_47);
                if(content!=null){
                    TokenStream tokenstream=luceneAnalyzer.tokenStream(keyWord, new StringReader(content));
                    try {
                        content=highlighter.getBestFragment(tokenstream, content);
                    } catch (InvalidTokenOffsetsException e) {
                        e.printStackTrace();
                    } 
                }
                /*End:结束关键字高亮*/
                System.out.println("文件内容:"+content);
                
                System.out.println("匹配相关度："+scoredocs[i].score);
            }
            reader.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    
    /**
     * 读取File文件的内容
     * @param file File实体
     * @return StringBuffer类型的文件内容
     */
    public StringBuffer readFile(File file){
        StringBuffer sb=new StringBuffer();
        try {
            BufferedReader reader=new BufferedReader(new FileReader(file));
            String str;
            while((str=reader.readLine())!=null){
                sb.append(str);
            }
            reader.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return sb;
    }
    /**
     * 根据指定的根文件夹，递归遍历此文件夹下的所有文件
     * @param RootFile 待遍历的目录
     * @return 目录列表
     */
    public ArrayList<File> getAllFilesOfDirectory(File RootFile){
        File[] files = RootFile.listFiles();
        for(File file:files){
            if(file.isDirectory()){
                System.out.println("========开始遍历文件夹======");
                getAllFilesOfDirectory(file);
                AllFiles.add(file);
            }
        }
        return AllFiles;
    }

5.测试主入口

　　各位可以自行新建一些文件夹和文档作为测试数据。

/**
     * 方法主入口
     * @param args
     */
    public static void main(String[] args){
        String dataPath = "E:\\willion"; //文本文件存放路径
        String indexPath  ="E:\\luceneData"; //索引文件存放路径
        String keyWord="政府";
        LucenceDao indexdao=new LucenceDao();
        /*为指定文件夹创建索引*/
        //indexdao.createIndex(indexPath,dataPath);
        /*根据索引全文检索*/
        indexdao.searchByIndex(indexPath,keyWord);
    }