我有一个网站爬行,其中包括一些pdf文件的链接.
我希望nutch抓取该链接并将其转储为.pdf文件.
我正在使用Apache Nutch1.6我也在使用
Java作为
ToolRunner.run(NutchConfiguration.create(), new Crawl(),
tokenize(crawlArg));
SegmentReader.main(tokenize(dumpArg));
有人可以帮我这个
最佳答案 如果您希望Nutch对您的pdf文档进行爬网和索引,则必须启用文档爬网和Tika插件:
>文档抓取
1.1编辑regex-urlfilter.txt并删除任何“pdf”
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
1.2编辑suffix-urlfilter.txt并删除“pdf”的任何出现
1.3编辑nutch-site.xml,在plugin.includes部分添加“parse-tika”和“parse-html”
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
>如果您真正想要的是从页面下载所有pdf文件,您可以在* nix中使用类似Teleport in Windows或Wget的内容.