Nutch源代码研究 网页抓取 下载插件
今天我们来看看Nutch的源代码中的protocol-http插件,是如何抓取和下载web页面的。protocol-http就两个类HttpRespose和Http类,其中HttpRespose主要是向web服务器发请求来获取响应,从而下载页面。Http类则非常简单,其实可以说是HttpResponse的一个Facade,设置配置信息,然后创建HttpRespose。用户似乎只需要和Http类打交道就行了(我也没看全,所以只是猜测)。
我们来看看HttpResponse类:
看这个类的源码需要从构造函数
public HttpResponse(HttpBase http, URL url, CrawlDatum datum) throws ProtocolException, IOException开始
首先判断协议是否为http
1 if (!"http".equals(url.getProtocol())) 2 throw new HttpException("Not an HTTP url:" + url);
获得路径,如果url.getFile()的为空直接返回”/”,否则返回url.getFile()
String path = "".equals(url.getFile()) ? "/" : url.getFile();
然后根据url获取到主机名和端口名。如果端口不存在,则端口默认为80,请求的地址将不包括端口号portString= "",否则获取到端口号,并得到portString
1 String host = url.getHost(); 2 int port; 3 String portString; 4 if (url.getPort() == -1) { 5 port= 80; 6 portString= ""; 7 } else { 8 port= url.getPort(); 9 portString= ":" + port; 10 }
然后创建socket,并且设置连接超时的时间:
1 socket = new Socket(); // create the socket socket.setSoTimeout(http.getTimeout());
根据是否使用代理来得到socketHost和socketPort:
1 String sockHost = http.useProxy() ? http.getProxyHost() : host; 2 int sockPort = http.useProxy() ? http.getProxyPort() : port;
创建InetSocketAddress,并且开始建立连接:
1 InetSocketAddress sockAddr= new InetSocketAddress(sockHost, sockPort); 2 socket.connect(sockAddr, http.getTimeout());
获取输入流:
1 // make request 2 OutputStream req = socket.getOutputStream();
以下代码用来向服务器发Get请求:
1 StringBuffer reqStr = new StringBuffer("GET "); 2 if (http.useProxy()) { 3 reqStr.append(url.getProtocol()+"://"+host+portString+path); 4 } else { 5 reqStr.append(path); 6 } 7 8 reqStr.append(" HTTP/1.0\r\n"); 9 reqStr.append("Host: "); 10 reqStr.append(host); 11 reqStr.append(portString); 12 reqStr.append("\r\n"); 13 reqStr.append("Accept-Encoding: x-gzip, gzip\r\n"); 14 String userAgent = http.getUserAgent(); 15 if ((userAgent == null) || (userAgent.length() == 0)) { 16 if (Http.LOG.isFatalEnabled()) { Http.LOG.fatal("User-agent is not set!"); } 17 } else { 18 reqStr.append("User-Agent: "); 19 reqStr.append(userAgent); 20 reqStr.append("\r\n"); 21 } 22 reqStr.append("\r\n"); 23 byte[] reqBytes= reqStr.toString().getBytes(); 24 req.write(reqBytes); 25 req.flush();
接着来处理相应,获得输入流并且包装成PushbackInputStream来方便操作:
1 PushbackInputStream in = // process response 2 new PushbackInputStream( 3 new BufferedInputStream(socket.getInputStream(), Http.BUFFER_SIZE), 4 Http.BUFFER_SIZE) ;
提取状态码和响应中的HTML的header:
1 boolean haveSeenNonContinueStatus= false; 2 while (!haveSeenNonContinueStatus) { 3 // parse status code line 4 this.code = parseStatusLine(in, line); 5 // parse headers 6 parseHeaders(in, line); 7 haveSeenNonContinueStatus= code != 100; // 100 is "Continue" 8 }
接着读取内容:
1 readPlainContent(in);
获取内容的格式,如果是压缩的则处理压缩
1 String contentEncoding = getHeader(Response.CONTENT_ENCODING); 2 if ("gzip".equals(contentEncoding) || "x-gzip".equals(contentEncoding)) { 3 content = http.processGzipEncoded(content, url); 4 } else { 5 if (Http.LOG.isTraceEnabled()) { 6 Http.LOG.trace("fetched " + content.length + " bytes from " + url); 7 } 8 }
整个过程结束。
下面我们来看看parseStatusLine parseHeaders readPlainContent以及readChunkedContent的过程。
private int parseStatusLine(PushbackInputStream in, StringBuffer line)
throws IOException, HttpException:
这个函数主要来提取响应得状态,例如200 OK这样的状态码:
请求的状态行一般格式(例如响应Ok的话) HTTP/1.1 200" 或 "HTTP/1.1 200 OK
1 int codeStart = line.indexOf(" "); 2 int codeEnd = line.indexOf(" ", codeStart+1);
如果是第一种情况:
1 if (codeEnd == -1) 2 codeEnd = line.length();
状态码结束(200)位置便是line.length()
否则状态码结束(200)位置就是line.indexOf(" ", codeStart+1);
接着开始提取状态码:
1 int code; 2 try { 3 code= Integer.parseInt(line.substring(codeStart+1, codeEnd)); 4 } catch (NumberFormatException e) { 5 throw new HttpException("bad status line ‘" + line 6 + "‘: " + e.getMessage(), e); 7 }
下面看看
1 private void parseHeaders(PushbackInputStream in, StringBuffer line) 2 throws IOException, HttpException:
一个循环读取headers:
一般HTTP response的header部分和内容部分会有一个空行,使用readLine如果是空行就会返回读取的字符数为0,具体readLine实现看完这个函数在仔细看:
while (readLine(in, line, true) != 0)
如果没有空行,那紧接着就是正文了,正文一般会以<!DOCTYPE、<HTML、<html开头。如果读到的一行中包含这个,那么header部分就读完了。
1 // handle HTTP responses with missing blank line after headers 2 int pos; 3 if ( ((pos= line.indexOf("<!DOCTYPE")) != -1) 4 || ((pos= line.indexOf("<HTML")) != -1) 5 || ((pos= line.indexOf("<html")) != -1) )
接着把多读的那部分压回流中,并设置那一行的长度为pos
1 in.unread(line.substring(pos).getBytes("UTF-8")); 2 line.setLength(pos);
接着把对一行的处理委托给processHeaderLine(line)来处理:
1 try { 2 //TODO: (CM) We don‘t know the header names here 3 //since we‘re just handling them generically. It would 4 //be nice to provide some sort of mapping function here 5 //for the returned header names to the standard metadata 6 //names in the ParseData class 7 processHeaderLine(line); 8 } catch (Exception e) { 9 // fixme: 10 e.printStackTrace(LogUtil.getErrorStream(Http.LOG)); 11 } 12 return; 13 } 14 processHeaderLine(line);
下面我们看看如何处理一行header的:
private void processHeaderLine(StringBuffer line)
throws IOException, HttpException
请求的头一般格式:
Cache-Control: private
Date: Fri, 14 Dec 2007 15:32:06 GMT
Content-Length: 7602
Content-Type: text/html
Server: Microsoft-IIS/6.0
这样我们就比较容易理解下面的代码了:
1 int colonIndex = line.indexOf(":"); // key is up to colon
如果没有”:”并且这行不是空行则抛出HttpException异常
1 if (colonIndex == -1) { 2 int i; 3 for (i= 0; i < line.length(); i++) 4 if (!Character.isWhitespace(line.charAt(i))) 5 break; 6 if (i == line.length()) 7 return; 8 throw new HttpException("No colon in header:" + line); 9 }
否则,可以可以提取出键-值对了:
key为0~colonIndex部分,然后过滤掉开始的空白字符,作为value部分。
最后放到headers中:
1 String key = line.substring(0, colonIndex); 2 3 int valueStart = colonIndex+1; // skip whitespace 4 while (valueStart < line.length()) { 5 int c = line.charAt(valueStart); 6 if (c != ‘ ‘ && c != ‘\t‘) 7 break; 8 valueStart++; 9 } 10 String value = line.substring(valueStart); 11 headers.set(key, value);
下面我们看看用的比较多的辅助函数
private static int readLine(PushbackInputStream in, StringBuffer line,
boolean allowContinuedLine) throws IOException
代码的实现:
开始设置line的长度为0不断的读,直到c!=-1,对于每个c:
如果是\r并且下一个字符是\n则读入\r,如果是\n,并且如果line.length() > 0,也就是这行前面已经有非空白字符,并且还允许连续行,在读一个字符,如果是’ ’或者是\t说明此行仍未结束,读入该字符,一行结束,返回读取的实际长度。其他情况下直接往line追加所读的字符:
1 line.setLength(0); 2 for (int c = in.read(); c != -1; c = in.read()) { 3 switch (c) { 4 case ‘\r‘: 5 if (peek(in) == ‘\n‘) { 6 in.read(); 7 } 8 case ‘\n‘: 9 if (line.length() > 0) { 10 // at EOL -- check for continued line if the current 11 // (possibly continued) line wasn‘t blank 12 if (allowContinuedLine) 13 switch (peek(in)) { 14 case ‘ ‘ : case ‘\t‘: // line is continued 15 in.read(); 16 continue; 17 } 18 } 19 return line.length(); // else complete 20 default : 21 line.append((char)c); 22 } 23 } 24 throw new EOFException(); 25 }
接着看如何读取内容的,也就是
private void readPlainContent(InputStream in)
throws HttpException, IOException的实现:
首先从headers(在此之前已经读去了headers放到metadata中了)中获取响应的长度,
1 int contentLength = Integer.MAX_VALUE; // get content length 2 String contentLengthString = headers.get(Response.CONTENT_LENGTH); 3 if (contentLengthString != null) { 4 contentLengthString = contentLengthString.trim(); 5 try { 6 contentLength = Integer.parseInt(contentLengthString); 7 } catch (NumberFormatException e) { 8 throw new HttpException("bad content length: "+contentLengthString); 9 } 10 }
如果大于http.getMaxContent()(这个值在配置文件中http.content.limit来配置),
则截取maxContent那么长的字段:
1 if (http.getMaxContent() >= 0 2 && contentLength > http.getMaxContent()) // limit download size 3 contentLength = http.getMaxContent(); 4 5 ByteArrayOutputStream out = new ByteArrayOutputStream(Http.BUFFER_SIZE); 6 byte[] bytes = new byte[Http.BUFFER_SIZE]; 7 int length = 0; // read content 8 for (int i = in.read(bytes); i != -1; i = in.read(bytes)) { 9 out.write(bytes, 0, i); 10 length += i; 11 if (length >= contentLength) 12 break; 13 } 14 content = out.toByteArray(); 15 }
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。