scrapy深层爬取

2023年3月20日 253次阅读来源: 紫弟

CrawlSpider 版本
那么，scrapy shell测试完成之后，修改以下代码

提取匹配 ‘http://hr.tencent.com/position.php?&start=\d+’的链接

page_lx = LinkExtractor(allow = (‘start=\d+’))

rules = [
#提取匹配,并使用spider的parse方法进行分析;并跟进链接(没有callback意味着follow默认为True)
Rule(page_lx, callback = ‘parse’, follow = True)
]
这么写对吗？

不对！千万记住 callback 千万不能写 parse，再次强调：由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。

tencent.py

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from mySpider.items import TencentItem

class TencentSpider(CrawlSpider):
name = “tencent”
allowed_domains = [“hr.tencent.com”]
start_urls = [
“http://hr.tencent.com/position.php?&start=0#a“
]

page_lx = LinkExtractor(allow=("start=\d+"))

rules = [
    Rule(page_lx, callback = "parseContent", follow = True)
]

def parseContent(self, response):
    for each in response.xpath('//*[@class="even"]'):
        name = each.xpath('./td[1]/a/text()').extract()[0]
        detailLink = each.xpath('./td[1]/a/@href').extract()[0]
        positionInfo = each.xpath('./td[2]/text()').extract()[0]

        peopleNumber = each.xpath('./td[3]/text()').extract()[0]
        workLocation = each.xpath('./td[4]/text()').extract()[0]
        publishTime = each.xpath('./td[5]/text()').extract()[0]
        #print name, detailLink, catalog,recruitNumber,workLocation,publishTime

        item = TencentItem()
        item['name']=name.encode('utf-8')
        item['detailLink']=detailLink.encode('utf-8')
        item['positionInfo']=positionInfo.encode('utf-8')
        item['peopleNumber']=peopleNumber.encode('utf-8')
        item['workLocation']=workLocation.encode('utf-8')
        item['publishTime']=publishTime.encode('utf-8')

        yield item

# parse() 方法不需要写     
# def parse(self, response):                                              
#     pass

运行： scrapy crawl tencent

Logging
Scrapy提供了log功能，可以通过 logging 模块使用。

可以修改配置文件settings.py，任意位置添加下面两行，效果会清爽很多。

LOG_FILE = “TencentSpider.log”
LOG_LEVEL = “INFO”
Log levels
Scrapy提供5层logging级别:

CRITICAL – 严重错误(critical)

ERROR – 一般错误(regular errors)
WARNING – 警告信息(warning messages)
INFO – 一般信息(informational messages)
DEBUG – 调试信息(debugging messages)
logging设置
通过在setting.py中进行以下设置可以被用来配置logging:

LOG_ENABLED 默认: True，启用logging
LOG_ENCODING 默认: ‘utf-8’，logging使用的编码
LOG_FILE 默认: None，在当前目录里创建logging输出文件的文件名
LOG_LEVEL 默认: ‘DEBUG’，log的最低级别
LOG_STDOUT 默认: False 如果为 True，进程所有的标准输出(及错误)将会被重定向到log中。例如，执行 print “hello” ，其将会在Scrapy log中显示。

    原文作者：紫弟
    原文地址: https://www.jianshu.com/p/29dc7eb0d4f8
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。