逃不掉的正则表达式

2023年1月22日 177次阅读来源: _weber_

导语

很多人一听到正则表达式就头大了，一看正则表达式的相关文档头就更大了，因为看不懂啊，完全不是人类可以理解的啊！
但是我要告诉你，想学习爬虫，你就无法逃掉正则表达式。
这篇文章就是向大家证明，正则表达式一点都不可怕，只是看起来很可怕的样子。

资料

正则表达式30分钟入门教程：http://deerchao.net/tutorials/regex/regex.htm
我知道很多人看不懂，但是请你认认真真读几遍，不理解没关系，读几遍你就开始通了。
python的正则表达式模块是re，所以可以再看看re的资料：
http://python.usyiyi.cn/translate/python_278/library/re.html
没耐性看，看不懂都没关系，继续下面的内容，你慢慢就懂了。

功能：

以上一篇文章为基础（http://www.jianshu.com/p/359ce3c88082 ），没看的请提前看下。
在上一篇文章中我们已经成功获取到了贴吧中一个帖子的三页内容，其实我们想要的只是里面的图片（你懂得)，现在我们想办法提取出所有图片的地址。

本次代码

picurls = re.findall('http://imgsrc.baidu.com/forum/w%3D580(.*?)"', m.decode(), re.S)
with open('c:/wwb/python/temp/picsite.txt', 'a') as file:
  for each in picurls:
    picurl = 'http://imgsrc.baidu.com/forum/w%3D580'+each
    file.write(picurl)
'''
使用正则从网页源码中提取出所有图片地址
m.decode()是为了将byte转换为string，因为re只能处理string
re.S是为了让.可以匹配换行符
使用for循环将每一个图片地址写入TXT
'''

现在你就可以得到一个包含所有图片地址的txt文件了。

完整代码

# -*- coding:utf-8 -*-  

import urllib.request as request
import re

def baidu_tieba(url, begin_page, end_page):
for i in range(begin_page, end_page+1):
    sName = 'c:/wwb/python/temp/'+str(i).zfill(5)+'.html'
    print('正在下载第'+str(i)+'个页面，并保存为'+sName)
    m = request.urlopen(url+str(i)).read()
    with open(sName, 'wb') as file:
        file.write(m)
    picurls = re.findall('http://imgsrc.baidu.com/forum/w%3D580(.*?)"', m.decode(), re.S)
    with open('c:/wwb/python/temp/picsite.txt', 'a') as file:
        for each in picurls:
            picurl = 'http://imgsrc.baidu.com/forum/w%3D580'+each
            file.write(picurl)
print('Done')

url = 'http://tieba.baidu.com/p/4906913050?pn='
begin_page = 1
end_page = 3
baidu_tieba(url, begin_page, end_page)

    原文作者：_weber_
    原文地址: https://www.jianshu.com/p/f7ef9bf0fb26
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。