lucene分词流程
这一个星期花时间好好学习了一下lucene/solr,今天好好总结一下,写点文章记录点重要的东西,以便日后不至于丈二和尚摸不着头脑,
这一篇文章主要是简单的介绍一下lucene分词过程中的分词流程,和一些简单原理的讲解,希望不妥这处读者能够指正,不胜感激!!
(一)主要分词器
WhitespaceAnalyzer、StopAnalyzer、SimpleAnalyzer、KeywordAnalyzer,他们都的父类都是Analazer,在Analazer类中有一个抽象方法叫做tokenStream
package org.apache.lucene.analysis; import java.io.Reader; import java.io.IOException; import java.io.Closeable; import java.lang.reflect.Modifier; import org.apache.lucene.util.CloseableThreadLocal; import org.apache.lucene.store.AlreadyClosedException; import org.apache.lucene.document.Fieldable; /** An Analyzer builds TokenStreams, which analyze text. It thus represents a * policy for extracting index terms from text. * <p> * Typical implementations first build a Tokenizer, which breaks the stream of * characters from the Reader into raw Tokens. One or more TokenFilters may * then be applied to the output of the Tokenizer. * <p>The {@code Analyzer}-API in Lucene is based on the decorator pattern. * Therefore all non-abstract subclasses must be final or their {@link #tokenStream} * and {@link #reusableTokenStream} implementations must be final! This is checked * when Java assertions are enabled. */ public abstract class Analyzer implements Closeable { <span style="white-space:pre"> </span>//.....此段代码只提取了关键部分 /** Creates a TokenStream which tokenizes all the text in the provided * Reader. Must be able to handle null field name for * backward compatibility. */ public abstract TokenStream tokenStream(String fieldName, Reader reader); }
TokenStream两个子类分别是Tokenizer和TokenFIlter,由字面意义上可以大致看出前者是把一段话变成一个个的语汇单元(analyzer把reader流传递),然后经过一系列的filter最后得到一个完整的TOkenStream,请看下图
下图流出来一下常见的tokenizer
下面简单的说一下simpletokenizer的流程。我们看看simple的源代码
public final class SimpleAnalyzer extends ReusableAnalyzerBase { private final Version matchVersion; /** * Creates a new {@link SimpleAnalyzer} * @param matchVersion Lucene version to match See {@link <a href="#version">above</a>} */ public SimpleAnalyzer(Version matchVersion) { this.matchVersion = matchVersion; } /** * Creates a new {@link SimpleAnalyzer} * @deprecated use {@link #SimpleAnalyzer(Version)} instead */ @Deprecated public SimpleAnalyzer() { this(Version.LUCENE_30); } @Override protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) { return new TokenStreamComponents(new LowerCaseTokenizer(matchVersion, reader));//这里覆盖了父类创建tokenstream的方法。并且传递进去一个lowercaseTokeniz//er这就是为什么英文字母都会转换成小写的原因, } }
由上述类图可以看出他们都有一个父类叫做charTokenizer,这个类里面做什么的呢?明显这个类是用来分割字符的,在tokenStream类中有一个方法叫做increamentTOken
public abstract class TokenStream extends AttributeSource implements Closeable { /** * Consumers (i.e., {@link IndexWriter}) use this method to advance the stream to * the next token. Implementing classes must implement this method and update * the appropriate {@link AttributeImpl}s with the attributes of the next * token. * <P> * The producer must make no assumptions about the attributes after the method * has been returned: the caller may arbitrarily change it. If the producer * needs to preserve the state for subsequent calls, it can use * {@link #captureState} to create a copy of the current attribute state. * <p> * This method is called for every token of a document, so an efficient * implementation is crucial for good performance. To avoid calls to * {@link #addAttribute(Class)} and {@link #getAttribute(Class)}, * references to all {@link AttributeImpl}s that this stream uses should be * retrieved during instantiation. * <p> * To ensure that filters and consumers know which attributes are available, * the attributes must be added during instantiation. Filters and consumers * are not required to check for availability of attributes in * {@link #incrementToken()}. * * @return false for end of stream; true otherwise */ public abstract boolean incrementToken() throws IOException; }
use this method to advance the stream to * the next token. Implementing classes must implement this method and update * the appropriate
注释已经说得很清楚了,这个方法必须要子类实现用来判断是否还有下一个词汇,返回一个boolean值,下面是chartokenizer的increamentToken方法
@Override public final boolean incrementToken() throws IOException { clearAttributes(); if(useOldAPI) // TODO remove this in LUCENE 4.0 return incrementTokenOld(); int length = 0; int start = -1; // this variable is always initialized char[] buffer = termAtt.buffer(); while (true) { if (bufferIndex >= dataLen) { offset += dataLen; if(!charUtils.fill(ioBuffer, input)) { // read supplementary char aware with CharacterUtils dataLen = 0; // so next offset += dataLen won't decrement offset if (length > 0) { break; } else { finalOffset = correctOffset(offset); return false; } } dataLen = ioBuffer.getLength(); bufferIndex = 0; } // use CharacterUtils here to support < 3.1 UTF-16 code unit behavior if the char based methods are gone final int c = charUtils.codePointAt(ioBuffer.getBuffer(), bufferIndex); bufferIndex += Character.charCount(c); if (isTokenChar(c)) { // if it's a token char if (length == 0) { // start of token assert start == -1; start = offset + bufferIndex - 1; } else if (length >= buffer.length-1) { // check if a supplementary could run out of bounds buffer = termAtt.resizeBuffer(2+length); // make sure a supplementary fits in the buffer } length += Character.toChars(normalize(c), buffer, length); // buffer it, normalized if (length >= MAX_WORD_LEN) // buffer overflow! make sure to check for >= surrogate pair could break == test break; } else if (length > 0) // at non-Letter w/ chars break; // return 'em } termAtt.setLength(length); assert start != -1; offsetAtt.setOffset(correctOffset(start), finalOffset = correctOffset(start+length)); return true; }
所以simpletokenizer就已经经过了多个tokenizer的包装(没有经过filter),可以看看stopanalazer的Tokenstreamcompent的方法
@Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { final Tokenizer source = new LowerCaseTokenizer(matchVersion, reader);//经过了一个lowercase和一个stopfilter return new TokenStreamComponents(source, new StopFilter(matchVersion, source, stopwords)); }
,经过了这些之后最后才能成为了一个tokenStream流。那么常见的filter有哪些呢?
这就是一些常见的filter,比如有stopfilter,lowercasefilter,等等,得到tokenizer的流之后再经过这些过滤器之后才能形成一个真正的tokenSTREAM
(二)、词汇信息如何保存
这里必须要提到3个类 CharTermAttribute(保存具体词汇),OffsetAttribute(保存词汇之间的偏移量),PositionIncrementAttribute(保存词与词之间的位置增量)。
有了这3个东西,就能确定一篇文档当中具体的位置,比如how are you thank you这句话在lucene中其实上是这个样子的(位置偏亮错了应该是how后面应该是3,以此类推)
这些东西都有一个叫做AttributeSource的类决定,这个类中保存了这些信息。里面有一个叫做STATE的静态内部类。这个类中存储了当前stram类的位置信息。我们在以后的过程中可以用下面这个方法捕获当前状态
/** * Captures the state of all Attributes. The return value can be passed to * {@link #restoreState} to restore the state of this or another AttributeSource. */ public State captureState() { final State state = this.getCurrentState(); return (state == null) ? null : (State) state.clone(); }
我们能够得到这些词汇的位置信息之后,我们可以做很多事情。比如同义词(加上一个词使他的偏移量和位置增量与之相同),删除敏感词等等!第一篇总结就到此结束了。
转载请注明http://blog.csdn.net/a837199685/article
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。