使用PyCharm编写Scrapy爬虫脚本

2019年6月11日 318次阅读来源: Forkey

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架，下面将会介绍一下这个工具的安装、配置以及使用。（本人用mac机器）
Scrapy中文文档：https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/overview.html
本人编写的脚本地址：https://github.com/xiaoxiaoimg/scrapy

一、安装框架

1.通过命令行安装Scrapy(注意：这边Scrapy第一个字母是大写)

命令： pip3 install Scrapy

2.安装完成后，命令行输入scrapy后，出现如下信息就表示安装完成。

《使用PyCharm编写Scrapy爬虫脚本》 scrapy.png

二、创建爬虫

1.新建一个文件夹，存放脚本

命令： mkdir demo1

2.命令行中新建一个scrapy项目

命令： scrapy startproject demo1
image.png

创建后，咱们看下文件目录：
a. items.py主要是创建爬虫存储的字段
b. middlewares.py是中间件，主要是编写下载中间件、Cookies中间件等
c. pipelines.py用来爬虫后数据的处理（如存储到CSV、存储到数据库等）
d. spiders文件夹存放的是爬虫的脚本

image.png

3.新建爬虫文件，通过命令行进行创建，不一定要在spiders文件下新建,但是一定要在scrapy目录下创建（创建时会自动去寻找spiders文件夹并存放在该目录下）

命令： scrapy genspider demo “http://www.baidu.com“
我这边创建的爬虫名称为demo，爬虫的地址http://www.baidu.com
image.png

4.这时候，用pycharm打开上面创建的scrapy目录

《使用PyCharm编写Scrapy爬虫脚本》 image.png

三、编写爬虫脚本（本次爬虫的是BOSS直聘网站）

在items.py文件中定义我们需要爬虫的字段，如图所示

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class Demo1Item(scrapy.Item):
    # define the fields for your item here like:
    # BOSS直聘的字段
    href = scrapy.Field()
    job_title = scrapy.Field()
    salary = scrapy.Field()
    working_place = scrapy.Field()
    company_name = scrapy.Field()
    working_life = scrapy.Field()

在spiders文件夹下打开爬虫脚本，我这边编写了一个BOSS直聘网站爬虫脚本

import scrapy
from demo1.items import Demo1Item
import urllib
from scrapy import log

# BOSS直聘网站爬虫职位


class DemoSpider(scrapy.Spider):
    # 爬虫名， 启动爬虫时需要的参数*必填
    name = 'demo'
    # 爬取域范围，允许爬虫在这个域名下进行爬取（可选）
    allowed_domains = ['zhipin.com']
    # 爬虫需要的url
    start_urls = ['https://www.zhipin.com/c101280600/h_101280600/?query=测试']

    def parse(self, response):
        node_list = response.xpath("//div[@class='job-primary']")
        # 用来存储所有的item字段
        # items = []
        for node in node_list:
            item = Demo1Item()
            # extract() 将xpath对象转换为Unicode字符串
            href = node.xpath("./div[@class='info-primary']//a/@href").extract()
            job_title = node.xpath("./div[@class='info-primary']//a/div[@class='job-title']/text()").extract()
            salary = node.xpath("./div[@class='info-primary']//a/span/text()").extract()
            working_place = node.xpath("./div[@class='info-primary']/p/text()").extract()
            company_name = node.xpath("./div[@class='info-company']//a/text()").extract()
          
            item['href'] = href[0]
            item['job_title'] = job_title[0]
            item['salary'] = salary[0]
            item['working_place'] = working_place[0]
            item['company_name'] = company_name[0]

                # 返回提取到的每个item数据给管道处理，同时还会出来继续执行后面的代码
                yield item

脚本解说：
（1） name：表示该爬虫的名称
（2） allowed_domains：表示允许爬虫的域名，比如说：目前爬虫的是URL1,但在URL1中存在URL2链接，如果只允许爬虫URL1的话，那么其他URL2不会被爬虫到
（3） start_urls：表示要爬虫的地址，这边可以是一个list也可以是一个元祖
（4） def parse(self, response)：处理请求url后返回的response

node_list用来存储一个招聘卡片的信息，其中response.xpath是一个通过前端DOM结果获取html
image.png
- for node in node_list：获取卡片信息中的每个所需要字段
  * item = Demo1Item()：实例化item对象
  * 其中@href，表示获取href的值； text()表示获取当前标签的内容
  * extract()：加上这个方法主要是因为node.xpath是Selectors类型，需要它转换为Unicode字符串
  * item[‘href’] = href[0]将取到的值传给item
  * yield item：返回提取到的每个item数据给管道处理，同时还会出来继续执行后面的代码

在pipeline.py, 通过爬虫文件返回item数据进行操作

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
import scrapy

class Demo1Pipeline(object):

    def __init__(self):
        self.file = open('demo1/data/demo.json', 'w')

    def process_item(self, item, spider):

        content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
        self.file.write(content)
      
        return item

    def close_spider(self, spider):
        self.file.close()

脚本解说：
（1）. 初始化文件存储的位置
（2）. process_item方法中，将item转为字典并通过json.dumps转为字符串，存到文件中
（3）最后记得关闭文件，self.file.close()

四、执行爬虫文件（配置PyCharm执行命令有借鉴别人）

配置PyCharm执行scrapy脚本, 编写begin.py（之前截图中有提到过）

from scrapy import cmdline

cmdline.execute('scrapy crawl demo'.split())

点击如图中的按钮
image.png
打开弹窗后，按照截图操作执行
image.png
新建运行配置，选择对应的文件目录
image.png
保存后，执行运行按钮，将会自动运行爬虫脚本
image.png
6.执行后，存储的文件结构如下：
image.png

    原文作者：Forkey
    原文地址: https://www.jianshu.com/p/07b4d9f48505
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。