python – Scrapy忽略allowed_domains？

2019年8月4日 712次阅读

Scrapy忽略了我的蜘蛛规则,甚至遵循不允许的域名.

self.start_urls = [ 'http://www.domain.de' ]
self.allowed_domains = [ 'domain.de' ]

但是它可以很好地工作并过滤不允许的域,请参阅日志：

DEBUG：过滤现场请求’www.clubsoundz.fm’：http：//www.clubsoundz.fm/\u0026gt;

我使用sgmllinkextractor来关注链接,这里是我的“规则”：

规则=(
规则(SgmlLinkExtractor(),callback =’get_domain_data’,follow = True),
)

有人可以帮忙吗？

最佳答案我认为这正是我遇到的问题：
https://github.com/scrapy/scrapy/issues/184

听起来这个问题没有真正的解决方案:(

我想我必须在蜘蛛继续这个过程之前过滤网址