【未完善】使用nutch命令逐步下载网页

浏览数：140 / 时间：2015年06月09日

此文未完善。是否可以使用nutch逐步下载，未知。

1、基本操作，构建环境

（1）下载软件安装包，并解压至/usr/search/apache-nutch-2.2.1/

（2）构建runtime

cd /usr/search/apache-nutch-2.2.1/

ant runtime

（3）验证Nutch安装完成

[root@jediael44 apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]# ./nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

（4）vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任务

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

（5）创建seed.txt

cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt

http://nutch.apache.org/

（6）修改网页过滤器 vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

将

# accept anything else
+.

修改为

# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/

When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is

generated by Apache Nutch which is nothing but a directory and which contains

details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache

Nutch keeps all the crawling data directly in the database. In our case, we have used

Apache HBase, so all crawling data would go inside Apache HBase.

2 injectJob

[root@jediael44 local]# ./bin/nutch inject urls

InjectorJob: starting at 2014-07-07 14:15:21

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 2

Injector: finished at 2014-07-07 14:15:24, elapsed: 00:00:03

3 GenerateJob

[root@jediael44 local]# ./bin/nutch generate

Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]

-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE

-crawlId <id> - the id to prefix the schemas to operate on,

(default: storage.crawl.id)");

-noFilter - do not activate the filter plugin to filter the url, default is true

-noNorm - do not activate the normalizer plugin to normalize the url, default is true

-adddays - Adds numDays to the current time to facilitate crawling urls already

fetched sooner then db.fetch.interval.default. Default value is 0.

-batchId - the batch id

----------------------

Please set the params.

[root@jediael44 local]# ./bin/nutch generate -topN 3

GeneratorJob: starting at 2014-07-07 14:22:55

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: true

GeneratorJob: normalizing: true

GeneratorJob: topN: 3

GeneratorJob: finished at 2014-07-07 14:22:58, time elapsed: 00:00:03

GeneratorJob: generated batch id: 1404714175-1017128204

4 FetcherJob

The job of the fetcher is to fetch the URLs which are generated by the GeneratorJob.

It will use the input provided by GeneratorJob. The following command will be

used for the FetcherJob:

[root@jediael44 local]# bin/nutch fetch –all

FetcherJob: starting

FetcherJob: batchId: –all

Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.

FetcherJob: threads: 10

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : -1

Using queue mode : byHost

Fetcher: threads: 10

QueueFeeder finished: total 0 records. Hit by time limit :0

-finishing thread FetcherThread0, activeThreads=0

-finishing thread FetcherThread1, activeThreads=0

-finishing thread FetcherThread2, activeThreads=0

-finishing thread FetcherThread3, activeThreads=0

-finishing thread FetcherThread4, activeThreads=0

-finishing thread FetcherThread5, activeThreads=0

-finishing thread FetcherThread6, activeThreads=0

-finishing thread FetcherThread7, activeThreads=1

-finishing thread FetcherThread8, activeThreads=0

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

-finishing thread FetcherThread9, activeThreads=0

0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

Here I have provided input parameters—this means that this job will fetch all

the URLs that are generated by the GeneratorJob. You can use different input

parameters according to your needs.

5 ParserJob

After the FetcherJob, the ParserJob is to parse the URLs that are fetched by

FetcherJob. The following command will be used for the ParserJob:

[root@jediael44 local]# bin/nutch parse –all

ParserJob: starting

ParserJob: resuming: false

ParserJob: forced reparse: false

ParserJob: batchId: –all

ParserJob: success

[root@jediael44 local]#

I have used input parameters—all of which will parse all the URLs fetched by the

FetcherJob. You can use different input parameters according to your needs.

6 DbUpdaterJob

[root@jediael44 local]# ./bin/nutch updatedb

【未完善】使用nutch命令逐步下载网页,古老的榕树,5-wow.com

郑重声明：本站内容如果来自互联网及其他传播媒体，其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享，并不代表本站赞同其观点和对其真实性负责，也不构成任何其他建议。

【未完善】使用nutch命令逐步下载网页

【未完善】使用nutch命令逐步下载网页

相关文章

随机文章

您可能还喜欢

您可能还喜欢

最新图文

您可能还喜欢

您可能还喜欢

文摘排行

文章排行

推荐文章

图文排行

推荐图文