1.htmlparser 使用
基本能实现网页抓取,不过要手动输入URL 将整个html内容保存到指定文件
*
*@author chenguoyong
*
*/
public class ScrubSelectedWeb {
privatefinal static String CRLF = System.getProperty("line.separator");
/**
* @param args
*/
publicstatic void main(String[] args) {
try{
URLur = newURL("http://10.249.187.199:8083/injs100/");
InputStreaminstr = ur.openStream();
Strings, str;
BufferedReaderin = new BufferedReader(new InputStreamReader(instr));
StringBuffersb = new StringBuffer();
BufferedWriterout = new BufferedWriter(new FileWriter(
"D:/outPut.txt"));
while((s = in.readLine()) != null) {
sb.append(s+ CRLF);
}
System.out.println(sb);
str= new String(sb);
out.write(str);
out.close();
in.close();
}catch (MalformedURLException e) {
e.printStackTrace();
}catch (IOException e) {
e.printStackTrace();
}
}
}
基本能实现网页抓取,不过要手动输入URL,此外没有重构。只是一个简单的思路。
1.htmlparser 使用
htmlparser是一个纯的java写的html解析的库,htmlparser不依赖于其它的java库,htmlparser主要用于改造或提取html。htmlparser能超高速解析html,而且不会出错。毫不夸张地说,htmlparser就是目前最好的html解析和分析的工具。无论你是想抓取网页数据还是改造html的内容,用了htmlparser绝对会忍不住称赞。由于htmlparser结构设计精良,所以扩展htmlparser非常便利。
http://c.tieba.baidu.com/p/3316726163
http://c.tieba.baidu.com/p/3316723845
http://c.tieba.baidu.com/p/3316722567
http://c.tieba.baidu.com/p/3316721327
http://c.tieba.baidu.com/p/3316717504
http://c.tieba.baidu.com/p/3316714975
http://c.tieba.baidu.com/p/3316710876
http://c.tieba.baidu.com/p/3316692502
http://c.tieba.baidu.com/p/3316689008
http://c.tieba.baidu.com/p/3316687706
http://c.tieba.baidu.com/p/3316750701
http://c.tieba.baidu.com/p/3316760692
http://c.tieba.baidu.com/p/3316760692
http://c.tieba.baidu.com/p/3316762691
http://c.tieba.baidu.com/p/3316780765
http://c.tieba.baidu.com/p/3316781850
http://c.tieba.baidu.com/p/3316787592
http://c.tieba.baidu.com/p/3316798631
http://c.tieba.baidu.com/p/3316804467
http://c.tieba.baidu.com/p/3316806665
http://c.tieba.baidu.com/p/3316811332
http://c.tieba.baidu.com/p/3316828201
http://c.tieba.baidu.com/p/3316826791
http://c.tieba.baidu.com/p/3311944721
http://c.tieba.baidu.com/p/3311943490
http://c.tieba.baidu.com/p/3311943062
http://c.tieba.baidu.com/p/3305095344
http://c.tieba.baidu.com/p/3305097954
http://c.tieba.baidu.com/p/3305100697
http://c.tieba.baidu.com/p/3305103600
http://c.tieba.baidu.com/p/3305105795
http://c.tieba.baidu.com/p/3305110305
http://c.tieba.baidu.com/p/3305112079
http://c.tieba.baidu.com/p/3305115018
http://c.tieba.baidu.com/p/3305117117
http://c.tieba.baidu.com/p/3305118990
http://c.tieba.baidu.com/p/3305123204
http://c.tieba.baidu.com/p/3305123924
http://c.tieba.baidu.com/p/3305124673
http://c.tieba.baidu.com/p/3305130305
http://c.tieba.baidu.com/p/3305136460
http://c.tieba.baidu.com/p/3305140204
http://c.tieba.baidu.com/p/3316925465
http://c.tieba.baidu.com/p/3317149335
http://c.tieba.baidu.com/p/3317148112
http://c.tieba.baidu.com/p/3317146582
http://c.tieba.baidu.com/p/3317151995
http://c.tieba.baidu.com/p/3287967193
http://c.tieba.baidu.com/p/3317242653
http://c.tieba.baidu.com/p/3317244575
http://c.tieba.baidu.com/p/3317242653
http://c.tieba.baidu.com/p/3317247843
http://c.tieba.baidu.com/p/3317248495
http://c.tieba.baidu.com/p/3317251825
http://c.tieba.baidu.com/p/3317253337
http://c.tieba.baidu.com/p/3317253840
http://c.tieba.baidu.com/p/3317146582
http://c.tieba.baidu.com/p/3317148112
http://c.tieba.baidu.com/p/3317149335
http://c.tieba.baidu.com/p/3317151995
http://c.tieba.baidu.com/p/3317151995
http://c.tieba.baidu.com/p/3317176379
http://c.tieba.baidu.com/p/3317177568
http://c.tieba.baidu.com/p/3317178811
http://c.tieba.baidu.com/p/3317192065
http://c.tieba.baidu.com/p/3317193734
http://c.tieba.baidu.com/p/3317195526
http://c.tieba.baidu.com/p/3317213453
http://c.tieba.baidu.com/p/3317218881
http://c.tieba.baidu.com/p/3317220460
http://c.tieba.baidu.com/p/3317221802
http://c.tieba.baidu.com/p/3317264965
http://c.tieba.baidu.com/p/3317266739
http://c.tieba.baidu.com/p/3317292343
http://c.tieba.baidu.com/p/3317302135
http://c.tieba.baidu.com/p/3317301165
http://c.tieba.baidu.com/p/3317315116
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。