本文首发于我的博客:gongyanli.com
前言:本文主要讲解Scrapy的数据持久化,主要包括存储到数据库、json文件以及内置数据存储
持久化存储——JSON
pipelins.py
`import json
from scrapy.exceptions import DropItem
class myPipeline(object):
def __init__(self):
self.file=open('test.json','wb')
def process_item(self,item,spider):
if item['title']:
line=json.dumps(dict(item))+'\n'
self.file.write(line)
return item
else:
raise DropItem("Missing title in %s" % item)`
持久化存储——MongoDB数据库
`class myPipeline(object):
def __init__(self):
self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])
self.db = self.client[settings['MONGO_DB']]
# self.coll = self.db[settings['MONGO_COLL2']]
self.chinacwa = self.db['chinacwa']
self.iot = self.db['iot']
self.ny135 = self.db['ny135']
self.productprice = self.db['productprice']
self.allproductprice = self.db['allproductprice']
def process_item(self, item, spider):
if isinstance(item, ChinacwaItem):
try:
if item['article_title']:
item = dict(item)
self.chinacwa.insert(item)
print("插入成功")
return item
except Exception as e:
spider.logger.exceptionn("")`
持久化存储——内置数据存储
settings
1.JSON
> FEED_FORMAT:json
> 所用的内置输出类:JsonItemExporter
2.JSON lines
> FEED_FORMAT:jsonlines
> 所用的内置输出类:JsonLinesItemExporter
3.CSV
> FEED_FORMAT:csv
> 所用的内置输出类:CsvItemExporter
4.XML
> FEED_FORMAT:xml
> 所用的内置输出类:XmlItemExporter
5.Pickle
> FEED_FORMAT:pickle
> 所用的内置输出类:PickleItemExporter
6.Marshal
> FEED_FORMAT:marshal
> 所用的内置输出类:MarshalItemExporter
> 使用方法:
$ scrapy crawl mySpider -o test.csv