【未完善】使用nutch命令逐步下载网页

此文未完善。是否可以使用nutch逐步下载,未知。

1、基本操作,构建环境

(1)下载软件安装包,并解压至/usr/search/apache-nutch-2.2.1/

(2)构建runtime

 cd /usr/search/apache-nutch-2.2.1/

ant runtime

(3)验证Nutch安装完成

[root@jediael44 apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]# ./nutch 
Usage: nutch COMMAND
where COMMAND is one of:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 elasticindex   run the elasticsearch indexer
 solrindex      run the solr indexer on parsed batches
 solrdedup      remove duplicates from solr
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

(4)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任务

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

(5)创建seed.txt

 cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt

http://nutch.apache.org/

(6)修改网页过滤器  vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt 

 vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt 

# accept anything else
+.

修改为

# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/


When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is
generated by Apache Nutch which is nothing but a directory and which contains
details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache
Nutch keeps all the crawling data directly in the database. In our case, we have used
Apache HBase, so all crawling data would go inside Apache HBase. 

2 injectJob

[root@jediael44 local]# ./bin/nutch inject urls
InjectorJob: starting at 2014-07-07 14:15:21
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2
Injector: finished at 2014-07-07 14:15:24, elapsed: 00:00:03

3  GenerateJob
[root@jediael44 local]# ./bin/nutch generate
Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]
    -topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE
    -crawlId <id> - the id to prefix the schemas to operate on,
                    (default: storage.crawl.id)");
    -noFilter - do not activate the filter plugin to filter the url, default is true
    -noNorm - do not activate the normalizer plugin to normalize the url, default is true
    -adddays - Adds numDays to the current time to facilitate crawling urls already
                     fetched sooner then db.fetch.interval.default. Default value is 0.
    -batchId - the batch id
----------------------
Please set the params.
[root@jediael44 local]# ./bin/nutch generate -topN 3
GeneratorJob: starting at 2014-07-07 14:22:55
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 3
GeneratorJob: finished at 2014-07-07 14:22:58, time elapsed: 00:00:03
GeneratorJob: generated batch id: 1404714175-1017128204

4 FetcherJob
The job of the fetcher is to fetch the URLs which are generated by the GeneratorJob.
It will use the input provided by GeneratorJob. The following command will be
used for the FetcherJob:

[root@jediael44 local]# bin/nutch fetch –all
FetcherJob: starting
FetcherJob: batchId: –all
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done

Here I have provided input parameters—this means that this job will fetch all
the URLs that are generated by the GeneratorJob. You can use different input
parameters according to your needs.

5 ParserJob
After the FetcherJob, the ParserJob is to parse the URLs that are fetched by
FetcherJob. The following command will be used for the ParserJob:

[root@jediael44 local]# bin/nutch parse –all
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: –all
ParserJob: success
[root@jediael44 local]# 

I have used input parameters—all of which will parse all the URLs fetched by the
FetcherJob. You can use different input parameters according to your needs.

6 DbUpdaterJob
[root@jediael44 local]# ./bin/nutch updatedb



【未完善】使用nutch命令逐步下载网页,古老的榕树,5-wow.com

郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。