python爬虫（6）爬取糗事百科

2023年12月11日 219次阅读来源: python爬虫

最近学习一段时间Python了，网上找个项目练练手，网上很多写爬取糗事百科段子的例子，所以就拿过来试一试

之前看到的例子，直接down下来运行，结果好多错误，需要自己调试，但是总体思路是没错的，今天就从头到尾再次实验一下。

1.流程分析

糗事百科的页面是这个样子的

也就是说，在主页面，每一个段子是由图片，文字，组成，对于我们的爬取任务来说，既得处理文字，还要处理图片，太麻烦了，我们先来一个简单的

就只获取文字，不处理图片内容了。

因此，我们爬取页面的入口是这个： http://www.qiushibaike.com/text/

这个页面的段子，只有文字，因此就会减少我们一部分工作量。

那么定好了我们将要爬取的目标，接下里，就分析一下，我们在这个网页中需要获得什么内容。

首先，每个段子的内容是我们需要获取的，那有了内容，我们还想知道是谁发布它的，也就是作者，其次呢，有多少人点赞，多少人评论呢，这也是我们想获取的。

基本需求有了，然后呢，我们想的不只是能够获取一页的内容，获取的内容应该是连续的，看完第一页，还想看第二页，因此也需要连续获取页面内容

那总体思路如下：

1.段子作者

2.段子内容

获取主页内容——> 3.点赞人数——> 当前页面获取完毕接着下一页。

4.评论人数

好了，总体思路有了，接下来就实践吧

2.获取起始页面

直接使用 urllib2 库来获取页面内容

#!/usr/bin/python
#coding:utf-8

import urllib2

def getPages():
	url="http://www.qiushibaike.com/text/"
	requests=urllib2.urlopen(url).read().decode('utf-8')
	print requests
getPages()

这样简单的两句话，应该就能得到了起始页面的内容，接着我们就能继续分析了

但是，问题来了，这样执行并不成功，它报错如下：

Traceback (most recent call last):
  File "06.qiushibaike_lianxi (复件).py", line 18, in <module>
    getPages()
  File "06.qiushibaike_lianxi (复件).py", line 15, in getPages
    requests=urllib2.urlopen(url).read().decode('utf-8')
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 444, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 408, in _read_status
    raise BadStatusLine(line)
httplib.BadStatusLine: ''

为什么呢？

因为有的网站阻止了这类的访问，他们不允许这样动作，比如爬虫来访问网站

只要在请求中加上伪装成浏览器的header就可以了，同时注意处理异常，因此修改如下：

#!/usr/bin/python
#coding:utf-8

import urllib2

def getPages():
	try:
		url="http://www.qiushibaike.com/text/"
		user_agent='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'
		headers={'User-Agent':user_agent}
		request=urllib2.Request(url,headers=headers)
		response=urllib2.urlopen(url).read().decode('utf-8')
		print response
 return response
	except urllib2.URLError,e:
		if hasattr(e,"reason"):
			print u"连接糗事百科失败，错误原因",e.reason
			return None
getPages()

这样就获取了起始页面的内容

3.获取关键内容

针对获取的页面进行处理，得到我们想要的内容

使用正则表达式获取内容

#!/usr/bin/python
#coding:utf-8

import urllib2
import re

def getPages():
	try:
		#页面起始网址
		url="http://www.qiushibaike.com/text/"
		#设置页面代理，否则获取不到页面内容	
		user_agent='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'
		headers={'User-Agent':user_agent}
		#将header打包到request里面
		request=urllib2.Request(url,headers=headers)
		#获取页面内容，并将其重新编码
		html=urllib2.urlopen(request).read().decode('utf-8')
		#print html
		return html
	except urllib2.URLError,e:
		if hasattr(e,"reason"):
			print u"连接糗事百科失败，错误原因",e.reason
			return None
def getPageItem(html):
	pageStories=[]
	pattern_author=re.compile(u'<h2>(.*?)</h2>',re.S)
	pattern_content=re.compile(u'<span>(.*?)</span>',re.S)
	pattern_support=re.compile(u'<i class="number">(\d*)</i>\s*好笑',re.S)
	pattern_comment=re.compile(u'<i class="number">(\d*)</i>\s*评论',re.S)
		
	find_author=re.findall(pattern_author,html)
	find_content=re.findall(pattern_content,html)
	find_support=re.findall(pattern_support,html)
	find_comment=re.findall(pattern_comment,html)

	if find_author:
		for i in xrange(len(find_author)):
			replaceBR=re.compile("<br/>")
			text=re.sub(replaceBR,"\n",find_content[i])
			#support=find_support[i].strip()+"个人说好笑"
			#comment=find_comment[i].strip()+"评论"
			comment="0"
			if i<len(find_comment):
				comment=find_comment[i].strip()
				support="0"
			if i<len(find_support):
				support=find_support[i].strip()
				pageStories.append([str(i+1),find_author[i].strip(),text,support,comment])
				print str(i+1),find_author[i].strip(),text,support,comment	
	else:
		print "数据异常"
		return None

	return pageStories
	
html=getPages()
getPageItem(html)

现在的结果如下：

1 苍南下山耍流氓，黑衣格哥买红糖 记得有一次我发烧，到小区门口卫生所打针，一个姐姐给我夹上体温计以后，还关心的摸我额头，左手摸完换右手，最后俩手捂着我的脸，，，还对里面一个小护士喊；娟”快出来暖暖手，，，， 3961 98
2 Kiss萝卜 楼主女汉子一枚，打扮中性化。
刚住进一个新的小区没几天，就听说这个小区有一对同性恋。天天同进同出，十分恩爱。
后来，偶然之间才知道说的是我和老公！ 8355 236
3 妹子不见了 今天去蹦极，一看价格200元一次。觉得有点贵，就问售票员能不能便宜点？她头也不抬的说了一句:不要绳 便宜50！。听完我心里这个乐啊！ 3838 156
4 嘻哈妹纸 午休时间，嘴巴里含了个拉丝糖，不知不觉趴桌子上睡着了……
领导来了，你能想象到，我那右半边脸被流出来的糖水粘到桌子上的模样吗……
大写的    囧……啊…… 2587 46
5 许我三日暖 老婆去我一同学开的理发店烫头发。回来告诉我说没要钱。
我打电话过去问同学怎么回事。同学说了:因为今天是过节（三 八），让她高兴高兴。你记得明天来给她交钱…… 4528 76
6 <糗犯监狱>～阿木 儿子上小学二年级了，今天儿子的老师终于把我叫到学校去了。老师把儿子的作业本往桌子上一摔说:“你以为我看不出来吗？你这已经
是第三次帮儿子做作业了！去！面冲墙站着去！”我看着当年对我恩重如山，如今白发苍苍的老师二话没说站了过去！ 3064 64

4.控制功能

一次性输出这么多，看着不是很舒服，要做到的是，一次是输出一条，然后看完一天按下回车再输出下一条。

把程序变成下面这样就可以了

def getOneJoke(pageStories):
	i =0
	for story in pageStories:
		i +=1
		input =raw_input()
		if input=="Q":
			enable=False
			return
		else:
			print "第%d篇\t发布人:%s\t\n%s\n赞:%s  评论人数:%s\n" % (i,story[1],story[2],story[3],story[4])

while enable:
	html=getPages()
	story=getPageItem(html)
	if len(story)>0:
		getOneJoke(story)

这样就能到达我们的目的了，但是还有一个问题，目前只能获得第一页的内容，后续的内容怎么获得呢？

5.获取连续页面

http://www.qiushibaike.com/text/ 这个页面如果翻页的话，就会发现规律

每页后面加上 page/num num 是页码，组合起来就是每一页的网址

即

http://www.qiushibaike.com/text/page/2

http://www.qiushibaike.com/text/page/3

因此，我们只需要对第一个函数稍加变形就能获取连续的页面了

终章

经过上面的一些小步骤，我们自己再调试一下程序，一个灵活的小程序就在我们手下诞生了～

亲测有效，不管是windows 还是linux都能运行

#!/usr/bin/python
#coding:utf-8

import urllib2
import re
import time
import sys
import datetime

class MyQiuBai:
	#初始化方法，定义一些变量
	def __init__(self):
		self.pageIndex=1
		self.user_agent='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'
		#初始化Headers
		self.headers={'User-Agent':self.user_agent}
		#存放段子的变量，每一个元素是每一页的段子
		self.stories=[]
		#存放程序是否继续运行的变量
		self.enable=False
		#将读过的段子保存到本地，这是本地文件名字
		self.filename='qiubai.txt'
		self.filesymbol=open(self.filename,'wb')

	#传入某一页面的索引获得页面代码
	def getPages(self,pageIndex):
		#print "翻页 %d" % (pageIndex)
		try:
			#构建新的URL地址
			url="http://www.qiushibaike.com/text/page/"+str(pageIndex)
			#构建请求的request
			request=urllib2.Request(url,headers=self.headers)
			#利用urlopen获取页面代码
			response=urllib2.urlopen(request)
			#将页面转化为UTF-8编码格式
			html=response.read().decode('utf-8')
			return html
		#捕捉异常，防止程序直接死掉
		except urllib2.URLError,e:
			if hasattr(e,"reason"):
				print u"连接糗事百科失败，错误原因",e.reason
				return None
	
	def getPageItem(self,html):
		#定义存贮list，保存所需内容
		pageStories=[]
		#通过正则暴力匹配获取内容，依次是作者、内容、点赞人数、评论人数
		pattern_author=re.compile(u'<h2>(.*?)</h2>',re.S)
		pattern_content=re.compile(u'<span>(.*?)</span>',re.S)
		pattern_support=re.compile(u'<i class="number">(\d*)</i>\s*好笑',re.S)
		pattern_comment=re.compile(u'<i class="number">(\d*)</i>\s*评论',re.S)
		
		find_author=re.findall(pattern_author,html)
		find_content=re.findall(pattern_content,html)
		find_support=re.findall(pattern_support,html)
		find_comment=re.findall(pattern_comment,html)
		#有的可能没有作者，提前做一个判断
		if find_author:
			for i in xrange(len(find_author)):
				#对段子内容简单的做一个处理，将换行符替换为真正的换行
				replaceBR=re.compile("<br/>")
				text=re.sub(replaceBR,"\n",find_content[i])
				comment="0"
				if i<len(find_comment):
					comment=find_comment[i].strip()
				support="0"
				if i<len(find_support):
					support=find_support[i].strip()
				#将获得到的内容，存放到list中,此处的i，也代表了这是本页的第几条
				pageStories.append([str(i+1),find_author[i].strip(),text,support,comment])
		else:
			print "数据异常"
			return None
		return pageStories
	#加载并提取页面的内容，加入到列表中	
	def loadPage(self,pageCode):
		if self.enable==True:
			#当前加载页面小于2页就再加载一页
			if len(self.stories)<2:
				pageStories=self.getPageItem(pageCode)
				if pageStories:
					#将该页的段子存放到全局list中
					self.stories.append(pageStories)
	#调用该方法，每次敲回车打印输出一个段子				
	def getOneJoke(self,pageStories,page):
		for story in pageStories:
			#获取当前时间
			writetime=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S ')
			#打印输出一条段子
			print "第%d页第%s篇\t发布人:%s\t%s\n%s\n赞:%s  评论人数:%s\n" % (page,story[0],story[1],str(writetime),story[2],story[3],story[4])
			#输出之后，将其写到文件中
			content="第%d页第%s篇\t发布人:%s\t%s\n%s\n赞:%s  评论人数:%s\n" % (page,story[0],story[1],str(writetime),story[2],story[3],story[4])
			self.filesymbol.write(content)
			self.filesymbol.write('\n')
			input=raw_input()
			#如果输入"Q"，那就退出程序，同时关闭文件描述符
			if input=="Q":
				self.enable=False
				self.filesymbol.close()
				return		
	def begin(self):
		print u"正在读取糗事百科,按页数查看新段子,Q退出，按Enter读取下一条"
		self.enable= True
		#自定义新的起始页面
		nowPage=1
		input=raw_input('输入开始看的页面，默认是第一页开始')
		try:
			nowPage=int(input)
		except Exception,e:
			print "input what %s" % (input)
		
		if input=="Q":
			self.enable=False
			self.filesymbol.close()
			return
		while self.enable:
			#获取起始页面
			pageCode=self.getPages(nowPage)
			if not pageCode:
				print("页面加载失败...")
				return None
			#多缓存一页
			self.loadPage(pageCode)
			if len(self.stories)>0:
				#从全局list中获取一页内容
				pageStories=self.stories[0]
				##将全局list中第一个元素删除，因为已经取出
				del self.stories[0]
				#获取这一页的内容
				self.getOneJoke(pageStories,nowPage)
			nowPage +=1


				
reload(sys)
sys.setdefaultencoding( "utf-8" )				
qiubai=MyQiuBai()
qiubai.begin()

最后可以通过 py2exe 工具将其做成一个小应用程序，这样不用安装python 也能使用这个了

参考：http://cuiqingcai.com/990.html

    原文作者：python爬虫
    原文地址: https://blog.csdn.net/qiqiyingse/article/details/60583129
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。