Python爬虫建站入门手记——从零开始建立采集站点（三：采集入库）

2019年4月14日 202次阅读来源: python

上回，我已经大概把爬虫写出来了。
我写了一个内容爬虫，一个爬取tag里面内容链接的爬虫
其实还差一个，就是收集一共有哪些tag的爬虫。但是这里先不说这个问题，因为我上次忘了这次又不想弄。。
还有个原因：如果实际采集的话，直接用http://segmentfault.com/questions/newest?page=1这个链接获取所有问题，挨个爬就行。

进入正题

第三部分，采集入库。

3.1 定义数据库（or model or schema）

为了入库，我需要在Django定义一个数据库的结构。（不说nosql和mongodb（也是一个nosql但是很像关系型）的事）
还记得那个名叫web的app么，里面有个叫models.py的文件，我现在就来编辑它。

bashvim ~/python_spider/web/models.py

内容如下:

python# -*- coding: utf-8 -*-
from django.db import models

# Create your models here.


class Tag(models.Model):
    title = models.CharField(max_length=30)

    def __unicode__(self):
        return self.title


class Question(models.Model):
    title = models.CharField(max_length=255)
    content = models.TextField()
    tags = models.ManyToManyField(Tag, related_name='questions')
    sf_id = models.CharField(max_length=16, default='0')　＃　加上这个可以记住问题在sf的位置，方便以后更新或者其他操作
    update_date = models.DateTimeField(auto_now=True)

    def __unicode__(self):
        return self.title


class Answer(models.Model):
    question = models.ForeignKey(Question, related_name='answers')
    content = models.TextField()

    def __unicode__(self):
        return 'To question %s' % self.question.title

都很直白，关于各个field可以看看 Django 的文档。

然后，我需要告诉我的python_spider项目，在运行的时候加载web这个app（项目不会自动加载里面的app）。

bashvim ~/python_spider/python_spider/settings.py

在INSTALLED_APPS里面加入web:

pythonINSTALLED_APPS = (
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'web',
)

下面，就可以用django自动生成数据库schema了

bashcd ~/python_spider
python manage.py makemigrations
python manage.py migrate

现在，我~/python_spider目录就产生了一个db.sqlite3文件，这是我的数据库。
把玩一番我的模型

python>>> from web.models import Answer, Question, Tag
>>> tag = Tag()
>>> tag.title = u'测试标签'
>>> tag.save()
>>> tag
<Tag: 测试标签>
>>> question = Question(title=u'测试提问', content=u'提问内容')
>>> question.save()
>>> question.tags.add(tag)
>>> question.save()
>>> answer = Answer(content=u'回答内容', question=question)
>>> answer.save()
>>> tag.questions.all() # 根据tag找question
[<Question: 测试提问>]
>>> question.tags.all() # 获取question的tags
[<Tag: 测试标签>]
>>> question.answers.all() # 获取问题的答案
[<Answer: To question 测试提问>]

以上操作结果正常，说明定义的models是可用的。

3.2 入库

接下来，我需要把采集的信息入库，说白了，就是把我自己蜘蛛的信息利用django的ORM存到django连接的数据库里面，方便以后再用Django读取用于做站。

入库的方法太多了，这里随便写一种，就是在web app里面建立一个spider.py, 里面定义两个蜘蛛，继承之前自己写的蜘蛛，再添加入库方法。

bashvim ~/python_spider/web/spider.py

代码如下：

python# -*- coding: utf-8 -*-
from sfspider import spider
from web.models import Answer, Question, Tag


class ContentSpider(spider.SegmentfaultQuestionSpider):

    def save(self): # 添加save()方法
        sf_id = self.url.split('/')[-1] # 1
        tags = [Tag.objects.get_or_create(title=tag_title)[0] for tag_title in self.tags]　＃ 2
        question, created = Question.objects.get_or_create(
            sf_id=sf_id,
            defaults={'title':self.title, 'content':self.content}
        ) # 3
        question.tags.add(*tags) # 4
        question.save()
        for answer in self.answers:
            Answer.objects.get_or_create(content=answer, question=question)
        return question, created


class TagSpider(spider.SegmentfaultTagSpider):

    def crawl(self): # 采集当前分页
        sf_ids = [url.split('/')[-1] for url in self.questions]
        for sf_id in sf_ids:
            question, created = ContentSpider(sf_id).save()

    def crawl_all_pages(self):
        while True:
            print u'正在抓取TAG:%s, 分页:%s' % (self.tag_name, self.page) # 5
            self.crawl()
            if not self.has_next_page:
                break
            else:
                self.next_page()

这个地方写得很笨，之前该在SegmentfaultQuestionSpider加上这个属性。
创建或者获取该提问的tags
创建或者获取提问，采用sf_id来避免重复
把tags都添加到提问，这里用*是因为这个方法原本的参数是(tag1, tag2, tag3)。但是我们的tags是个列表
测试的时候方便看看进度

然后，测试下我们的入库脚本

bashpython manage.py shell

python>>> from web.spider import TagSpider
>>> t = TagSpider(u'微信')
>>> t.crawl_all_pages()
正在抓取TAG:微信, 分页:1
正在抓取TAG:微信, 分页:2
正在抓取TAG:微信, 分页:3
KeyboardInterrupt # 用control-c中断运行，测试一下就行:)
>>> from web.models import Tag, Question
>>> Question.objects.all()
[<Question: 测试提问>, <Question: 微信支付获取prepayid，返回签名不匹配，>, <Question: 微信js怎么获取openID的>, <Question: 微信支付时加入attach参数提示签名错误>, <Question: 微信支付JSAPI调用返回fail_invalid_appid>, <Question: 微信消息连接打开  和  扫码打开连接有什么区别>, <Question: django做微信开发后台时无法返回response>, <Question: 微信端内置浏览器对canvas的支持有问题>, <Question: 分享到微信朋友圈的网页为什么点开直接跳至页尾？>, <Question: 微信支付开发：发起微信支付的时候，报错：invalid signature>, <Question: 前端加密代码有什么好办法不被破解>, <Question: 有没有桌面移动一体化网站发布方案,有市场吗?>, <Question: 微信如何获取用户的头像>, <Question: 重新设置微信自定义菜单 手机端没有显示该菜单>, <Question: 如何在用户输入关键字时自动回复图片，一张整体图。>, <Question: 手机图片上传是倒着的>, <Question: 微信内网页上传图片问题>, <Question: 如何转码微信多媒体下载接口的音频文件？>, <Question: 微信开放平台创建应用时，不能上传应用图片>, <Question: 微信页面中，怎么打开已安装的app？>, '...(remaining elements truncated)...']
>>> Question.objects.get(pk=5).tags.all() # 数据库中id=5的question的tags
[<Tag: 微信>, <Tag: 微信公众平台>, <Tag: 微信js-sdk>, <Tag: openid>]

3.3 设置django.contrib.admin来查看和编辑内容

为了更直观的观察我采集的数据，我可以利用django自带的admin
编辑文件

bashvim ~/python_spider/web/admin.py

pythonfrom django.contrib import admin
from web.models import Tag, Question, Answer

admin.site.register(Tag)
admin.site.register(Question)
admin.site.register(Answer)

然后创建超级用户

bashpython manage.py createsuperuser # 根据提示创建

启动测试服务器

bashpython manage.py runserver 0.0.0.0:80 # 我这是在runabove上，本地直接manage.py runserver

然后，我访问http://192.99.71.91/admin/登录刚刚创建的账号，就能对内容进行查看和编辑了
《Python爬虫建站入门手记——从零开始建立采集站点（三：采集入库）》

OK, 今天的内容到此，
下一篇，是编写django的view，套用简单的模板来建站。

    原文作者：python
    原文地址: https://segmentfault.com/a/1190000002549756
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。