Scrapy爬虫中获取正常json格式的方法

2023年7月6日 290次阅读来源: 大鱼巨蟹

管道中数据保存成json格式，但是文件每一行是独立的{} 字典结构，整个文件不是标准的json结构。
需要在每一行的末尾加上’,’和换行，整个文件需要用[]包括起来，这样文件才是标准的json格式。
为此管道文件pipelines.py文件需要这样写：

第一种方法：

import codecs
import json
import os


class XxxxprojectPipeline(object):
    # 构造方法，在调用类的时候只执行一次
    def __init__(self):
        super().__init__()  # 执行父类的构造方法
        self.fp = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')
        self.fp.write('[')
    # 来一个item就会调用一次这个方法

    def process_item(self, item, spider):
        # 将item转为字典
        d = dict(item)
        # 将字典转为json格式
        string = json.dumps(d, ensure_ascii=False)
        self.fp.write(string + ',\n')  # 每行数据之后加入逗号和换行
        return item

    def close_spider(self, spider):
        self.fp.seek(-2, os.SEEK_END)  # 定位到倒数第二个字符，即最后一个逗号
        self.fp.truncate()  # 删除最后一个逗号
        self.fp.write(']')  # 文件末尾加入一个‘]’
        self.fp.close()   # 关闭文件

第二种方法：
调用scrapy提供的json export导出json文件,然后在settings.py

from scrapy.exporters import JsonItemExporter


class JsonExporterPipleline(object):
    def __init__(self):
        self.file = open('xxx.json', 'wb')
        self.exporter = JsonItemExporter(self.file, encoding="utf-8", ensure_ascii=False)
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

然后在settings.py文件中的ITEM_PIPELINES参数中添加此类

ITEM_PIPELINES = {
    'xxxproject.pipelines.JsonExporterPipleline': 30, # 添加管道
}

最后在项目/spiders目录下运行程序
scrapy crawl 爬虫名

这2种方法都可以获取完美的json，不同的是，第一种传入了每行数据之后换行，第2种方法默认是一行，中间不换行。
如果pipelines.py种同时写了这2种方法，并且在settings.py文件中同时开启了这2种方法，那么在运行程序的时候会生成2个文件名不同，内容相同的json文件（如果两个方法中写入的文件名不同的话）
事实上，我们只需要一个就行了。2种方法，2选1.

    原文作者：大鱼巨蟹
    原文地址: https://www.jianshu.com/p/8be37d5fead0
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。