lucene分词流程

这一个星期花时间好好学习了一下lucene/solr,今天好好总结一下,写点文章记录点重要的东西,以便日后不至于丈二和尚摸不着头脑,


这一篇文章主要是简单的介绍一下lucene分词过程中的分词流程,和一些简单原理的讲解,希望不妥这处读者能够指正,不胜感激!!


(一)主要分词器

WhitespaceAnalyzer、StopAnalyzer、SimpleAnalyzer、KeywordAnalyzer,他们都的父类都是Analazer,在Analazer类中有一个抽象方法叫做tokenStream

package org.apache.lucene.analysis;


import java.io.Reader;
import java.io.IOException;
import java.io.Closeable;
import java.lang.reflect.Modifier;

import org.apache.lucene.util.CloseableThreadLocal;
import org.apache.lucene.store.AlreadyClosedException;

import org.apache.lucene.document.Fieldable;

/** An Analyzer builds TokenStreams, which analyze text.  It thus represents a
 *  policy for extracting index terms from text.
 *  <p>
 *  Typical implementations first build a Tokenizer, which breaks the stream of
 *  characters from the Reader into raw Tokens.  One or more TokenFilters may
 *  then be applied to the output of the Tokenizer.
 * <p>The {@code Analyzer}-API in Lucene is based on the decorator pattern.
 * Therefore all non-abstract subclasses must be final or their {@link #tokenStream}
 * and {@link #reusableTokenStream} implementations must be final! This is checked
 * when Java assertions are enabled.
 */
public abstract class Analyzer implements Closeable {
<span style="white-space:pre">	</span>//.....此段代码只提取了关键部分
 
  /** Creates a TokenStream which tokenizes all the text in the provided
   * Reader.  Must be able to handle null field name for
   * backward compatibility.
   */
  public abstract TokenStream tokenStream(String fieldName, Reader reader);

 }

TokenStream两个子类分别是Tokenizer和TokenFIlter,由字面意义上可以大致看出前者是把一段话变成一个个的语汇单元(analyzer把reader流传递),然后经过一系列的filter最后得到一个完整的TOkenStream,请看下图


技术分享


下图流出来一下常见的tokenizer

技术分享

下面简单的说一下simpletokenizer的流程。我们看看simple的源代码

public final class SimpleAnalyzer extends ReusableAnalyzerBase {

  private final Version matchVersion;
  
  /**
   * Creates a new {@link SimpleAnalyzer}
   * @param matchVersion Lucene version to match See {@link <a href="#version">above</a>}
   */
  public SimpleAnalyzer(Version matchVersion) {
    this.matchVersion = matchVersion;
  }
  
  /**
   * Creates a new {@link SimpleAnalyzer}
   * @deprecated use {@link #SimpleAnalyzer(Version)} instead 
   */
  @Deprecated  public SimpleAnalyzer() {
    this(Version.LUCENE_30);
  }
  @Override
  protected TokenStreamComponents createComponents(final String fieldName,
      final Reader reader) {
    return new TokenStreamComponents(new LowerCaseTokenizer(matchVersion, reader));//这里覆盖了父类创建tokenstream的方法。并且传递进去一个lowercaseTokeniz//er这就是为什么英文字母都会转换成小写的原因,
  }
}

由上述类图可以看出他们都有一个父类叫做charTokenizer,这个类里面做什么的呢?明显这个类是用来分割字符的,在tokenStream类中有一个方法叫做increamentTOken

public abstract class TokenStream extends AttributeSource implements Closeable {

 
  
  /**
   * Consumers (i.e., {@link IndexWriter}) use this method to advance the stream to
   * the next token. Implementing classes must implement this method and update
   * the appropriate {@link AttributeImpl}s with the attributes of the next
   * token.
   * <P>
   * The producer must make no assumptions about the attributes after the method
   * has been returned: the caller may arbitrarily change it. If the producer
   * needs to preserve the state for subsequent calls, it can use
   * {@link #captureState} to create a copy of the current attribute state.
   * <p>
   * This method is called for every token of a document, so an efficient
   * implementation is crucial for good performance. To avoid calls to
   * {@link #addAttribute(Class)} and {@link #getAttribute(Class)},
   * references to all {@link AttributeImpl}s that this stream uses should be
   * retrieved during instantiation.
   * <p>
   * To ensure that filters and consumers know which attributes are available,
   * the attributes must be added during instantiation. Filters and consumers
   * are not required to check for availability of attributes in
   * {@link #incrementToken()}.
   * 
   * @return false for end of stream; true otherwise
   */
  public abstract boolean incrementToken() throws IOException;
  
  
}

use this method to advance the stream to * the next token. Implementing classes must implement this method and update * the appropriate 

注释已经说得很清楚了,这个方法必须要子类实现用来判断是否还有下一个词汇,返回一个boolean值,下面是chartokenizer的increamentToken方法

 @Override
  public final boolean incrementToken() throws IOException {
    clearAttributes();
    if(useOldAPI) // TODO remove this in LUCENE 4.0
      return incrementTokenOld();
    int length = 0;
    int start = -1; // this variable is always initialized
    char[] buffer = termAtt.buffer();
    while (true) {
      if (bufferIndex >= dataLen) {
        offset += dataLen;
        if(!charUtils.fill(ioBuffer, input)) { // read supplementary char aware with CharacterUtils
          dataLen = 0; // so next offset += dataLen won't decrement offset
          if (length > 0) {
            break;
          } else {
            finalOffset = correctOffset(offset);
            return false;
          }
        }
        dataLen = ioBuffer.getLength();
        bufferIndex = 0;
      }
      // use CharacterUtils here to support < 3.1 UTF-16 code unit behavior if the char based methods are gone
      final int c = charUtils.codePointAt(ioBuffer.getBuffer(), bufferIndex);
      bufferIndex += Character.charCount(c);

      if (isTokenChar(c)) {               // if it's a token char
        if (length == 0) {                // start of token
          assert start == -1;
          start = offset + bufferIndex - 1;
        } else if (length >= buffer.length-1) { // check if a supplementary could run out of bounds
          buffer = termAtt.resizeBuffer(2+length); // make sure a supplementary fits in the buffer
        }
        length += Character.toChars(normalize(c), buffer, length); // buffer it, normalized
        if (length >= MAX_WORD_LEN) // buffer overflow! make sure to check for >= surrogate pair could break == test
          break;
      } else if (length > 0)             // at non-Letter w/ chars
        break;                           // return 'em
    }

    termAtt.setLength(length);
    assert start != -1;
    offsetAtt.setOffset(correctOffset(start), finalOffset = correctOffset(start+length));
    return true;
    
  }

所以simpletokenizer就已经经过了多个tokenizer的包装(没有经过filter),可以看看stopanalazer的Tokenstreamcompent的方法

 @Override
  protected TokenStreamComponents createComponents(String fieldName,
      Reader reader) {
    final Tokenizer source = new LowerCaseTokenizer(matchVersion, reader);//经过了一个lowercase和一个stopfilter
    return new TokenStreamComponents(source, new StopFilter(matchVersion,
          source, stopwords));
  }


,经过了这些之后最后才能成为了一个tokenStream流。那么常见的filter有哪些呢?

技术分享


这就是一些常见的filter,比如有stopfilter,lowercasefilter,等等,得到tokenizer的流之后再经过这些过滤器之后才能形成一个真正的tokenSTREAM


(二)、词汇信息如何保存

这里必须要提到3个类 CharTermAttribute(保存具体词汇),OffsetAttribute(保存词汇之间的偏移量),PositionIncrementAttribute(保存词与词之间的位置增量)。

有了这3个东西,就能确定一篇文档当中具体的位置,比如how are you  thank you这句话在lucene中其实上是这个样子的(位置偏亮错了应该是how后面应该是3,以此类推)

技术分享

这些东西都有一个叫做AttributeSource的类决定,这个类中保存了这些信息。里面有一个叫做STATE的静态内部类。这个类中存储了当前stram类的位置信息。我们在以后的过程中可以用下面这个方法捕获当前状态

  /**
   * Captures the state of all Attributes. The return value can be passed to
   * {@link #restoreState} to restore the state of this or another AttributeSource.
   */
  public State captureState() {
    final State state = this.getCurrentState();
    return (state == null) ? null : (State) state.clone();
  }

我们能够得到这些词汇的位置信息之后,我们可以做很多事情。比如同义词(加上一个词使他的偏移量和位置增量与之相同),删除敏感词等等!第一篇总结就到此结束了。

转载请注明http://blog.csdn.net/a837199685/article



郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。