lucene基础

浏览数：24 / 时间：2015年06月09日

Lucene是一个高效的，基于Java的全文检索库。

文档地址：http://lucene.apache.org/core/5_0_0/core/overview-summary.html

技术分享

我们从下往上看，很容易发现索引（index）是lucene的核心。

那lucene的索引（index）是怎么样的呢？

假设我们有1000份文档分别用编号1-1000表示吧。然后能得到以下结构

技术分享

左边作为索引而右边作为一个文档链表。

比如第一行代表了lucene单词在2、3、10、35、92文档中

那么lucene是怎么建索引的呢？

首先先是分析器analysis进行拆解

1 将文档分成一个一个单独的单词

2 去除标点符号

3 去除无意义的单词： a , the , this

4 单词小写

5 单词原型化：比如过去式、分词形式转换为原型

6 提取常量：teamwork homework hoursework 这里work就可以提取出来

然后得到原始的单词组合再进行索引

索引的数据结构

Lucene 索引文件中，用一下基本类型来保存信息：
Byte：是最基本的类型，长8 位(bit)。
UInt32：由4 个Byte 组成。
UInt64：由8 个Byte 组成。
VInt：
变长的整数类型，它可能包含多个Byte，对于每个Byte 的8 位，其中后7 位表示
数值，最高1 位表示是否还有另一个Byte，0 表示没有，1 表示有。
越前面的Byte 表示数值的低位，越后面的Byte 表示数值的高位。
例如130 化为二进制为 1000, 0010，总共需要8 位，一个Byte 表示不了，因而需
要两个Byte 来表示，第一个Byte 表示后7 位，并且在最高位置1 来表示后面还有
一个Byte，所以为(1) 0000010，第二个Byte 表示第8 位，并且最高位置0 来表示
后面没有其他的Byte 了，所以为(0) 0000001。

OK，我们开始上代码去实验一下吧。

我们先想想一下整个过程再配合官方资料动手

索引

1 你首先有很多份文档或者数据需要储存

2 那么你得先指定一个建立index的目录

3 然后再用分析器把需要索引的文档或者数据进行解析和拆解

4 对简化的数据进行索引的建立

--------------------------

搜索

1 对搜索内容用分析器进行解析和拆解

2 搜索

3 对返回结果进行读取

pom.xml文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>Test</groupId>
    <artifactId>Test</artifactId>
    <packaging>war</packaging>
    <version>0.0.1-SNAPSHOT</version>
    <name>Test Maven Webapp</name>
    <url>http://maven.apache.org</url>
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.4</version>
        </dependency>
        <dependency>
            <groupId>javax</groupId>
            <artifactId>javaee-api</artifactId>
            <version>7.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>5.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>5.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>5.0.0</version>
        </dependency>
    </dependencies>
    
</project>

LuceneDemo.java(官方demo)

package com.newtouchone.lucene;

import static org.junit.Assert.assertEquals;


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

public class LuceneDemo {

    public static void main(String[] args) throws Exception {
        Analyzer analyzer = new StandardAnalyzer();//解析器，用于将文档中的单词进行处理减少索引空间、同时也会对查询单词进行处理
        // Store the index in memory:
        Directory directory = new RAMDirectory();//打开内存空间
        // To store an index on disk, use this instead:
        // Directory directory = FSDirectory.open("/tmp/testindex");//打开本地磁盘
        IndexWriterConfig config = new IndexWriterConfig(analyzer);//配置写入流的解析器
        IndexWriter iwriter = new IndexWriter(directory, config);//indexwriter是索引写入的流
        
        Document doc = new Document();//建立文档
        String text = "This is the text to be indexed.";//需要保存的写入的内容
        doc.add(new Field("body", text, TextField.TYPE_STORED));//文档的一个属性（field），这里我写入body->内容
        doc.add(new Field("title","first", TextField.TYPE_STORED));//title->first
        iwriter.addDocument(doc);//索引里面添加此文档
        String text2 = "learning lecene";//同理上面
        Document doc2 = new Document();
        doc2.add(new Field("body", text2, TextField.TYPE_STORED));
        doc.add(new Field("title","second", TextField.TYPE_STORED));
        iwriter.addDocument(doc2);
        
        iwriter.close();//关闭索引流
        

        // Now search the index://搜索
        DirectoryReader ireader = DirectoryReader.open(directory); //打开索引地址
        IndexSearcher isearcher = new IndexSearcher(ireader);//创建搜索器
        // Parse a simple query that searches for "text":
        QueryParser parser = new QueryParser("body", analyzer);//在body里面进行搜索，简单来说就是搜索文档内容（我定义的文档是title和body（内容））
        Query query = parser.parse("indexed");//搜索含有index的文档
        ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;//拿到命中结果

        assertEquals(1, hits.length);//验证
        
        // Iterate through the results:读取结果
        for (int i = 0; i < hits.length; i++) {for (IndexableField indexableField : hitDoc.getFields()) {//一个文档可以有多个field的，比如说我这次的文档有title和body
                System.out.println(indexableField.stringValue());//读取filed的内容
            }
        }
        ireader.close();
        directory.close();
    }

}