推荐系统1：Scrapy创建一个简单的爬虫

2024年1月7日 277次阅读来源: 崔业康

创建项目

进入到文件存放目录下
创建项目，执行 scrapy startproject zhihuscrapy

创建爬虫

在spiders目录下创建文件 zhihu_spider.py
文件代码如下：

import scrapy

class ZhihuSpider(scrapy.Spider):
    name = "zhihu"
    allowed_domains = ["zhihu.com"]
    start_urls = [
        "https://zhuanlan.zhihu.com/p/38198729",
        "https://zhuanlan.zhihu.com/p/38235624"
    ]

    def parse(self, response):
        for sel in response.xpath('//head'):
            title = sel.xpath('title/text()').extract()
            link = sel.xpath('title/text()').extract()
            desc = sel.xpath('title/text()').extract()
            print title, link, desc

设置请求头

在settings.py中增加

#请求头 
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' 
#关闭robot 
ROBOTSTXT_OBEY = False 
#关闭cookies追踪 
COOKIES_ENABLED = False

启动爬取

回到项目目录下

scrapy crawl zhihu

改进代码

import scrapy

from zhihuscrapy.items import ZhihuscrapyItem

class ZhihuSpider(scrapy.Spider):
    name = "zhihu"
    allowed_domains = ["zhihu.com"]
    start_urls = [
        "https://zhuanlan.zhihu.com/p/38198729",
        "https://zhuanlan.zhihu.com/p/38235624"
    ]

    def parse(self, response):
        for href in response.css("UserLink-link > a::attr('href')"):
            #url = response.urljoin(response.url, href.extract())
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath('//head'):
            item = ZhihuscrapyItem()
            item['title'] = sel.xpath('title/text()').extract()
            item['link'] = sel.xpath('title/text()').extract()
            item['desc'] = sel.xpath('title/text()').extract()
            yield item

执行，并输出

scrapy crawl zhihu -o items.json

参考： Scrapy爬虫（1）-知乎
参考： Scrapy入门教程

    原文作者：崔业康
    原文地址: https://www.jianshu.com/p/b2491cacd27e
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。