Scrapy之表单提交

有时候,我们需要登录网站才能获取到特定的信息。我们以登录github login为例,下面是github登录的部分 html代码。

 <form action="/session" accept-charset="UTF-8" method="post">
   <input name="utf8" type="hidden" value="✓" />
   <input type="hidden" name="authenticity_token" value="vr2Ebi0MMmJvjeZEQDEToGr96pQ2CK6TraSsU96M86B9PUI9D+59pAtOG99pv7UouYfN19Ptxwo+PaaVxYnWMQ==" /> 
   <div class="auth-form-header p-0"> 
    <h1>Sign in to GitHub</h1> 
   </div> 
   <div id="js-flash-container"> 
   </div> 
   <div class="auth-form-body mt-3"> 
    <label for="login_field"> Username or email address </label> 
    <input type="text" name="login" id="login_field" class="form-control input-block" tabindex="1" autocapitalize="off" autocorrect="off" autofocus="autofocus" /> 
    <label for="password"> Password <a class="label-link" href="/password_reset">Forgot password?</a> </label> 
    <input type="password" name="password" id="password" class="form-control form-control input-block" tabindex="2" /> 
    <input type="submit" name="commit" value="Sign in" tabindex="3" class="btn btn-primary btn-block" data-disable-with="Signing in…" /> 
   </div> 
  </form>

<form>action表示表单提交的地址;
<form>accept-charset表示服务器接受的字符串集;
<form>method表示提交数据的方式;
<form>下的<input>标签决定了表单提交的内容;
<input>name属性决定了数据提交的名称,该名称对应的值根据type的不同,取值方式也不同,有些从value属性取值,如hidden;有些需要用户输入,如textpassword等,具体可查阅HTML的文档,本文不再赘述。

下面是提交数据的过程与结果:

《Scrapy之表单提交》 请求地址
《Scrapy之表单提交》 提交的数据

正如所看到的,表单数据是input的键值对锁组成。

《Scrapy之表单提交》 返回包

在返回包中,服务器会返回用户对应的Cookie信息,由于需要跳转302,返回包还提供了需要跳转的地址location.

从以上看,登录的本质,就是向目标服务器发送含有表单数据的请求,一般是通过POST来请求的。
scrapy提供了一个Request的子类FormRequest来构造和提交表达数据。FormRequest的构造参数在Request的基础上添加了formdata,该参数支持字典或元组的可迭代对象,当需要发起表单请求的时候,在构造时添加formdata即可。

我们通过FormRequest来实现登录github,通过www.github.com是否包含Signed in as来判断是否登录成功。

# scrapy shell https://github.com/login

>>> input_selector = response.css('input')
>>> fd = dict()
>>> for selector in input_selector:
...     name = selector.css('input::attr(name)').extract_first()
...     value = selector.css('input::attr(value)').extract_first()
...     if value is None:
...             value = ''
...     fd[name] = value
...
>>> fd['login'] = '******@gmail.com'
>>> fd['password'] = '******'
>>> request = scrapy.FormRequest('https://github.com/session', formdata=fd)
>>> fetch(request)
2018-06-03 15:27:50 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://github.com/> from <POST https://github.com/session>
2018-06-03 15:27:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/> (referer: None)
>>> 'Signed in as' in response.text
True

除了使用FormRequest之外,scrapy还提供了另一种方式来更为简单的提交表单,就是使用FormRequest的类方法from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...]),使用时,第一个参数只需要提供Response对象,然后在formdata中提供账号和密码,其余的其他隐藏参数,该方法会帮我们处理好。

下面再使用FormRequest.from_response()来登录github;

# scrapy shell https://github.com/login

>>> fd = dict()
>>> fd['login'] = '******@gmail.com'
>>> fd['password'] = '******'
>>> request = scrapy.FormRequest.from_response(response, formdata=fd)
>>> fetch(request)
2018-06-03 15:35:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://github.com/> from <POST https://github.com/session>
2018-06-03 15:35:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/> (referer: None)
>>> 'Signed in as' in response.text
True

再来看看再实际的项目中如何实现登录的逻辑:

# profiles.py
# -*- coding: utf-8 -*-
import scrapy

class ProfilesSpider(scrapy.Spider):
    name = 'profiles'
    allowed_domains = ['github.com']
    start_urls = ['http://github.com/']
    login = 'https://github.com/login'

    def parse(self, response):                      #开始正式爬虫后会默认调用的处理函数
        pass

    def after_login(self, response):
        yield from suoer().start_requests()         #调用spider的start_request()方法,以从start_urls开始爬取链接

    def start_requests(self):
        yield scrapy.Request(self.login_url, callback=self.parse_login)

    def parse_login(self, response):
        fd = dict()
        post_headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Content-Type": "application/x-www-form-urlencoded",
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36",
            "Referer": "https://github.com/",
        }
        fd['login'] = '*******@gmail.com'
        fd['password'] = '******'

        yield scrapy.FormRequest.from_response(response, formdata=fd, callback=self.after_login,headers = post_headers)

总结

本篇简单讲解了如何通过FormRequest来提交表单数据,至于带验证码的登录后面有时间研究一下。下一篇会研究如何动态页面的数据。

    原文作者:喵帕斯0_0
    原文地址: https://www.jianshu.com/p/e29ab5b868fb
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞