Nutch源代码研究 网页抓取 数据结构
今天我们看看Nutch网页抓取,所用的几种数据结构:
主要涉及到了这几个类:FetchListEntry,Page,
首先我们看看FetchListEntry类:
public final class FetchListEntry implements Writable, Cloneable
实现了Writable, Cloneable接口,Nutch许多类实现了Writable, Cloneable。
自己负责自己的读写操作其实是个很合理的设计方法,分离出来反倒有很琐碎
的感觉。
看看里面的成员变量:
1 public static final String DIR_NAME = "fetchlist";//要写入磁盘的目录 2 private final static byte CUR_VERSION = 2;//当前的版本号 3 private boolean fetch;//是否抓取以便以后更新 4 private Page page;//当前抓取的页面 5 private String[] anchors;//抓取到的该页面包含的链接
我们看看如何读取各个字段的,也就是函数
public final void readFields(DataInput in) throws IOException
读取version 字段,并判断如果版本号是否大约当前的版本号,则抛出版本不匹配的异常,
然后读取fetch 和page 字段。
判断如果版本号大于1,说明anchors已经保存过了,读取anchors,否则直接赋值一个空的字符串
代码如下:
1 byte version = in.readByte(); // read version 2 if (version > CUR_VERSION) // check version 3 throw new VersionMismatchException(CUR_VERSION, version); 4 5 fetch = in.readByte() != 0; // read fetch flag 6 7 page = Page.read(in); // read page 8 9 if (version > 1) { // anchors added in version 2 10 anchors = new String[in.readInt()]; // read anchors 11 for (int i = 0; i < anchors.length; i++) { 12 anchors[i] = UTF8.readString(in); 13 } 14 } else { 15 anchors = new String[0]; 16 } 17
同时还提供了一个静态的读取各个字段的函数,并构建出FetchListEntry对象返回:
1 public static FetchListEntry read(DataInput in) throws IOException { 2 FetchListEntry result = new FetchListEntry(); 3 result.readFields(in); 4 return result; 5 }
写得代码则比较易看,分别写每个字段:
1 public final void write(DataOutput out) throws IOException { 2 out.writeByte(CUR_VERSION); // store current version 3 out.writeByte((byte)(fetch ? 1 : 0)); // write fetch flag 4 page.write(out); // write page 5 out.writeInt(anchors.length); // write anchors 6 for (int i = 0; i < anchors.length; i++) { 7 UTF8.writeString(out, anchors[i]); 8 } 9 }
其他的clone和equals函数实现的也非常易懂。
下面我们看看Page类的代码:
public class Page implements WritableComparable, Cloneable
和FetchListEntry一样同样实现了Writable, Cloneable接口,我们看看Nutch的注释,我们就非常容易知道各个字段的意义了:
/********************************************* * A row in the Page Database. * <pre> * type name description * --------------------------------------------------------------- * byte VERSION - A byte indicating the version of this entry. * String URL - The url of a page. This is the primary key. * 128bit ID - The MD5 hash of the contents of the page. * 64bit DATE - The date this page should be refetched. * byte RETRIES - The number of times we‘ve failed to fetch this page. * byte INTERVAL - Frequency, in days, this page should be refreshed. * float SCORE - Multiplied into the score for hits on this page. * float NEXTSCORE - Multiplied into the score for hits on this page. * </pre> * * @author Mike Cafarella * @author Doug Cutting *********************************************/
各个字段:
1 private final static byte CUR_VERSION = 4; 2 private static final byte DEFAULT_INTERVAL = 3 (byte)NutchConf.get().getInt("db.default.fetch.interval", 30); 4 5 private UTF8 url; 6 private MD5Hash md5; 7 private long nextFetch = System.currentTimeMillis(); 8 private byte retries; 9 private byte fetchInterval = DEFAULT_INTERVAL; 10 private int numOutlinks; 11 private float score = 1.0f; 12 private float nextScore = 1.0f;
同样看看他是如何读取自己的各个字段的,其实代码加上本来提供的注释,使很容易看懂的,不再详述:
1 ublic void readFields(DataInput in) throws IOException { 2 byte version = in.readByte(); // read version 3 if (version > CUR_VERSION) // check version 4 throw new VersionMismatchException(CUR_VERSION, version); 5 6 url.readFields(in); 7 md5.readFields(in); 8 nextFetch = in.readLong(); 9 retries = in.readByte(); 10 fetchInterval = in.readByte(); 11 numOutlinks = (version > 2) ? in.readInt() : 0; // added in Version 3 12 score = (version>1) ? in.readFloat() : 1.0f; // score added in version 2 13 nextScore = (version>3) ? in.readFloat() : 1.0f; // 2nd score added in V4 14 }
写各个字段也很直接:
1 public void write(DataOutput out) throws IOException { 2 out.writeByte(CUR_VERSION); // store current version 3 url.write(out); 4 md5.write(out); 5 out.writeLong(nextFetch); 6 out.write(retries); 7 out.write(fetchInterval); 8 out.writeInt(numOutlinks); 9 out.writeFloat(score); 10 out.writeFloat(nextScore); 11 }
我们顺便看看提供方便读写Fetch到的内容的类FetcherOutput:这个类通过委托前面介绍的两个类的读写,提供了Fetche到的各种粒度结构的读写功能,代码都比较直接,不再详述。
补充一下Content类:
public final class Content extends VersionedWritable
我们看到继承了VersionedWritable类。VersionedWritable类实现了版本字段的读写功能。
我们先看看成员变量:
1 public static final String DIR_NAME = "content"; 2 private final static byte VERSION = 1; 3 private String url; 4 private String base; 5 private byte[] content; 6 private String contentType; 7 private Properties metadata;
DIR_NAME 为Content保存的目录,
VERSION 为版本常量
url为该Content所属页面的url
base为该Content所属页面的base url
contentType为该Content所属页面的contentType
metadata为该Content所属页面的meta信息
下面我们看看Content是如何读写自身的字段的:
public final void readFields(DataInput in) throws IOException
这个方法功能为读取自身的各个字段
1 super.readFields(in); // check version 2 3 url = UTF8.readString(in); // read url 4 base = UTF8.readString(in); // read base 5 6 content = WritableUtils.readCompressedByteArray(in); 7 8 contentType = UTF8.readString(in); // read contentType 9 10 int propertyCount = in.readInt(); // read metadata 11 metadata = new Properties(); 12 for (int i = 0; i < propertyCount; i++) { 13 metadata.put(UTF8.readString(in), UTF8.readString(in)); 14 }
代码加注释之后基本上比较清晰了.
super.readFields(in);
这句调用父类VersionedWritable读取并验证版本号
写的代码也比较简单:
1 public final void write(DataOutput out) throws IOException { 2 super.write(out); // write version 3 4 UTF8.writeString(out, url); // write url 5 UTF8.writeString(out, base); // write base 6 7 WritableUtils.writeCompressedByteArray(out, content); // write content 8 9 UTF8.writeString(out, contentType); // write contentType 10 11 out.writeInt(metadata.size()); // write metadata 12 Iterator i = metadata.entrySet().iterator(); 13 while (i.hasNext()) { 14 Map.Entry e = (Map.Entry)i.next(); 15 UTF8.writeString(out, (String)e.getKey()); 16 UTF8.writeString(out, (String)e.getValue()); 17 } 18 }
其实这些类主要是它的字段.以及怎样划分各个域模型的
下次我们看看parse-html插件,看看Nutch是如何提取html页面的。
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。