Python网络爬虫的Scrapy实战一

任务

使用递归抓取简书用户信息

解释:获取到一个初识的用户url,我们需要对其进行请求,解析出这个用户的关注的和粉丝中再抽取url,循环往复,周而复始,获取他们的

  • 昵称-nickname

  • 关注数-followed

  • 粉丝数- following

  • 文章数-articles

  • 文字数-charlength

  • 喜欢数-likes

创建scrapy项目

scrapy startproject JianShu

生成爬虫

切换到项目文件夹中,在这里是JianShu文件夹,代码如下

cd  JianShu

然后生成爬虫,注意爬虫名不能与项目名相。

scrapy genspider 爬虫名 域名

scrapy genspider jianshu https://www.jianshu.com

构建爬虫各功能模块

scrapy爬虫框架,概括起来是

  • spider下的爬虫脚本负责业务逻辑,发起请求,解析数据。

  • middleware负责对爬虫进行伪装或者加代理

  • item将爬虫脚本中的请求解析的数据封装到数据容器

  • 并传递给pipeline以保存到csv、txt或者数据库中去。

  • settings存储项目各种参数

  • main主程序,运行开始爬数据

伪装请求头

更好地伪装浏览器,防止被Ban。

更换不同的user_agent,Scrapy使用Middleware即可

Spider 中间件(Middleware) 下载器中间件是介入到 Scrapy 的 spider 处理机制的钩子框架,可以添加代码来处理发送给Spiders的 response 及 spider 产生的 item 和 request。

步骤一

创建一个中间件(HeadersMiddleware)

middlewares.py

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from JianShu.settings import UserAgentList
import random

class HeadersDownloaderMiddleware(UserAgentMiddleware):
     """
    给请求随机加入伪装头headers
    """
    def process_request(self, request, spider):
        ua = random.choice(UserAgentList)
        if ua:
            request.headers.setdefault('User-Agent', ua)

步骤二

在scrapy中,我们先在settings.py中加入多个浏览器User-Agent,取消DOWNLOADER_MIDDLEWARES的前的注释,激活中间件。

settings.py

DOWNLOADER_MIDDLEWARES = {
   'JianShu.middlewares.HeadersDownloaderMiddleware': 400,
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

UserAgentList = ["Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)"
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
]

settings.py中的MIDDLEWARES的路径,应该是:

     yourproject.middlewares(文件名).middleware类

如果你的中间件的类名和文件名都使用了RandomUserAgentMiddleware,那这个路径应该写成:

yourproject.RandomUserAgentMiddleware.RandomUserAgentMiddleware

tem容器-整理数据

将item理解成存储数据的容器,类似于字典。只不过这个字典可以还有很多功能,可以在scrapy中飞来飞去的

from scrapy import Item,Field

class JianshuItem(Item):
    nickname = Field()
    description = Field()
    followed = Field()
    following = Field()
    articles = Field()
    charlength = Field()
    likes = Field()

pipeline-存储到csv文件中

经过item整理后的数据,我们就可以通过pipeline保存到csv中去

import csv

class CSVPipeline(object):
    
    def __init__(self):
   #初始化csv

        self.csvf = open('data.csv', 'a+', encoding='gbk', newline='')

        self.writer = csv.writer(self.csvf)

        self.writer.writerow(('nickname', 'description', 'followed', 'fpllowing', 'articles', 'charlength', 'likes'))
        
        self.csvf.close()
        
    def process_item(self, item, spider):

        with open('data.csv', 'a+', encoding='gbk', newline='') as f:

            writer = csv.writer(f)

            writer.writerow((item['nickname'], item['description'], item['followed'], item['following'], item['articles'], item['charlength'], item['likes']))
        
        return item

再打开settings.py,取消ITEM_PIPELINES注释。让item与pipeline完美衔接,一个负责整理数据,一个负责保存数据。

ITEM_PIPELINES = {
    'JianShu.pipelines.CSVPipeline': 300,}

编写爬虫

使用到xpath解析

注意,response.xpath()得到的是selector对象(而且是selector列表),selector对象有extract方法。所以

解析都是一个人的关注、粉丝、文章数等信息的提取

nickname = response.xpath("//div[@class='main-top']/div[@class='title']/a/text()").extract()[0]

#返回li的selector对象列表

info_selectors = response.xpath("//div[@class='main-top']/div[@class='info']/ul/li")
        
followed_url = 'https://www.jianshu.com'+info_selectors[0].xpath("./div/a/@href").extract()[0]

followed = info_selectors[0].xpath("./div/a/p/text()").extract()[0]following = info_selectors[1].xpath("./div/a/p/text()").extract()[0]

articles = info_selectors[2].xpath("./div/a/p/text()").extract()[0]charlength = info_selectors[3].xpath("./div/p/text()").extract()[0]
        
likes = info_selectors[4].xpath("./div/p/text()").extract()[0]
        
description = re.sub('\s','',''.join(response.xpath("//div[@class='js-intro']/text()").extract()))   
        

关注列表解析,递归批量获取简书用户信息

  pages = int(float(followed)/10)

for page in range(1,pages+1):

    userlist_url = followed_url + '?page={page}'.format(page=page)
    
    yield Request(userlist_url, callback=self.parseuserlist, dont_filter=True)


def parseuserlist(self,response):
    
    url_list = response.xpath("//ul[@class='user-list']/li/div[@class='info']/a/@href").extract()
    
    url_list = ['https://www.jianshu.com'+url for url in url_list]
    for url in url_list:
         yield Request(url,callback=self.parse,dont_filter=True)

整理汇总爬虫-jianshu.py

from scrapy import Spider, Request
from JianShu.items import JianshuItem
import re


class JianshuSpider(Spider):
    name = 'jianshu'
    allowed_domains = ['https://www.jianshu.com']
    start_urls = ['https://www.jianshu.com/u/cf09bc3817a7']


    def start_requests(self):
        start_url = 'https://www.jianshu.com/u/1562c7f16a04'
        yield Request(start_url, callback=self.parse)
        
    def parse(self, response):
        item = JianshuItem()
        
        nickname = response.xpath("//div[@class='main-top']/div[@class='title']/a/text()").extract()[0]
        info_selectors = response.xpath("//div[@class='main-top']/div[@class='info']/ul/li")
        
        followed_url = 'https://www.jianshu.com'+info_selectors[0].xpath("./div/a/@href").extract()[0]
        followed = info_selectors[0].xpath("./div/a/p/text()").extract()[0]


        pages = int(float(followed)/10)


        for page in range(1,pages+1):
            userlist_url = followed_url + '?page={page}'.format(page=page)

            yield Request(userlist_url, callback=self.parseuserlist, dont_filter=True)
            

        
        following_url = 'https://www.jianshu.com' + info_selectors[1].xpath("./div/a/@href").extract()[0]
        following = info_selectors[1].xpath("./div/a/p/text()").extract()[0]
        print(following_url,following)
        
        articles_url = 'https://www.jianshu.com' + info_selectors[2].xpath("./div/a/@href").extract()[0]
        articles = info_selectors[2].xpath("./div/a/p/text()").extract()[0]


        charlength = info_selectors[3].xpath("./div/p/text()").extract()[0]
        likes = info_selectors[4].xpath("./div/p/text()").extract()[0]


        description = re.sub('\s','',''.join(response.xpath("//div[@class='js-intro']/text()").extract()))


        item['nickname'] = nickname
        item['description'] = description
        item['followed'] = followed
        item['following'] = following
        item['articles'] = articles
        item['charlength'] = charlength
        item['likes'] = likes
        
        yield item
        
    def parseuserlist(self,response):
        url_list = response.xpath("//ul[@class='user-list']/li/div[@class='info']/a/@href").extract()

        url_list = ['https://www.jianshu.com'+url for url in url_list]
        for url in url_list:
            yield Request(url,callback=self.parse,dont_filter=True)

当然为了方便调试,我们在项目的根目录创建一个main.py文件

main.py

from scrapy.cmdline import execute
import os,sys

sys.path.append(os.path.dirname(os.path.basename(__file__)))

#注意,jianshu是爬虫名,不是项目名

execute(['scrapy','crawl','jianshu'])

运行main.py文件即可。

这里再贴上完整的配置settings.py

settings.py

BOT_NAME = 'JianShu'

SPIDER_MODULES = ['JianShu.spiders']
NEWSPIDER_MODULE = 'JianShu.spiders'
"""
DOWNLOADER_MIDDLEWARES = {
    'JianShu.middlewares.HeadersDownloaderMiddleware': None,
}
"""
ITEM_PIPELINES = {
    'JianShu.pipelines.CSVPipeline': 300,
}

DOWNLOAD_DELAY = 0.1

ROBOTSTXT_OBEY = False


DOWNLOADER_MIDDLEWARES = {
   'JianShu.middlewares.HeadersDownloaderMiddleware': 400,
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}


UserAgentList = ["Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)"
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
]
    原文作者:我为峰2014
    原文地址: https://www.jianshu.com/p/22edeecc7ed0
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞