scrapy 关于session

requests用session登陆这篇讲了怎么用同一个session控制cookies以达到登陆的需求,在scrapy里主要用的是FormRequest和cookiejar,文档这样说

流程是start_request,带着cookiejar发起request,在返回的response中找到formdata里面每回随网页变化的参数以及验证码的图片地址,发起下载图片的request,(因FormRequest.from_response第一个参数是response,在此需把response作为参数要把它传入meta,同时把变化的参数和cookiejar传入meta)
返回response下载验证码图片后将验证码传入formdata,通过FormRequest.from_response进行登陆,发起后续的要爬的数据的request记得把cookiejar带上。

关于验证码:
因为邦购网的验证码是同一个网址,不同cookie传回不同的验证码,所以必须用同一个session,则要带着么他={‘cookiejar’: response.meta[‘cookiejar’]}这项来下载图片。
我看了别家爬取的豆瓣知乎之类scrapy 登陆豆瓣是用request直接找到验证码图片下载,这种情况适应验证码的网址是唯一的。这就跟cookie无关了,因为什么时候访问,返回的验证码都是一样的。
关于post的时候用不用带headers,
关于headers用不用带这篇问答里说带着能解决问题。

import scrapy
from pyquery import PyQuery as pq


class BangoSpider(scrapy.Spider):
    name = 'bango'
    allowed_domains = ['banggo.com']
    start_urls = ['http://banggo.com/']

    def start_requests(self):
        yield scrapy.Request(callback=self.parse_page_with_captcha, meta = {'cookiejar': 1},
                             url='https://passport.banggo.com/CASServer/login?service=http%3A%2F%2Fbgact.banggo.com%2Flogin.shtml%3Fr_url%3Dhttp%25253A%25252F%25252Fuser.banggo.com%25252Fmember%25252FOrder')

    def parse_page_with_captcha(self,response):
        res = pq(response.body)
        lt =res('input').eq(-4).attr('value')
        data_for_lata = {'captcha_form': response,'ltvalue':lt, 'cookiejar': response.meta['cookiejar']}  
        yield scrapy.Request(url='https://passport.banggo.com/CASServer//custom/loginCode.do',
                             callback=self.parse_captcha_download, meta=data_for_lata)


    def parse_captcha_download(self,response):
    
        output = open("yzm.png", "wb")
        output.write(response.body)
        output.close()

        captcha_form = response.meta['captcha_form']
        captcha_text = input('input:')
        headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'en,zh-CN;q=0.9,zh;q=0.8,en-US;q=0.7',
            'Cache-Control': 'max-age=0',
            'Connection': 'keep-alive',
            'Content-Type': 'application/x-www-form-urlencoded',
            'Referer': 'https://passport.banggo.com/CASServer/login?service=http%3A%2F%2Fbgact.banggo.com%2Flogin.shtml%3Fr_url%3Dhttp%25253A%25252F%25252Fuser.banggo.com%25252Fmember%25252FOrder',
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
        data = {
            'username': ***,
            'password': ***,
            'vcode': captcha_text,
            'rememberUsername': 'on',
            'lt': response.meta['ltvalue'],
            '_eventId': 'submit',
            'loginType': '1',
            'lastIp': '112.193.157.37'
        }

        return scrapy.FormRequest.from_response(captcha_form, formdata=data, callback=self.login_in,
                                                headers=headers,meta={'cookiejar': response.meta['cookiejar']} )

    def login_in(self,response):
        res = pq(response.body)
        add =(res('.mbshop_userCenterLeftNav a').eq(-4).attr('href'))
        return scrapy.Request(add,callback=self.detail)
    def detail(self,response):
        res = pq(response.body)
        print(res('td').text())

参考:
https://stackoverflow.com/questions/27948326/scrapy-captcha

    原文作者:ddm2014
    原文地址: https://www.jianshu.com/p/72bca2dcac03
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞