python爬虫一：必应图片（从网页源代码中找出图片链接然后下载）

2024年6月1日 112次阅读来源: csuzhucong

这里讲解最简单的爬虫：从网页源代码中找出图片链接然后下载

代码：

#coding=utf-8
#必应图片爬虫
import re
import os
import urllib.request
url = 'http://cn.bing.com/images/search?q=usb+pen&FORM=HDRSC2'
coding = 'utf-8'
thepath = 'D:\\'

def get():
    try:
        html = urllib.request.urlopen(url).read().decode(coding)
    except:
        print('error')
        print(url)
        return
    title = re.search("<title>.*</title>", html).group()
    title = title[7:-20]
    pic_url = re.findall('http://.{1,100}.jpg|http://.{1,100}.png|http://.{1,100}.jpeg',str(html),re.IGNORECASE)
    pic_url = list(set(pic_url))
    path = thepath + title
    try:
        os.mkdir(path)
    except:
        return
    i = 1
    for each in pic_url:
        try:
            pic= urllib.request.urlopen(each,timeout=10).read()
        except:
            continue
        file = path + '\\' + title + str(i) + '.jpg'
        fp = open(file,'wb')
        fp.write(pic)
        fp.close()
        i=i+1
    if not os.listdir(path):
        os.removedirs(path)
        print('error')
        print(url)

get()

解释：

（1）urlopen是打开url对应的网页，获取源代码

（2）title是网页标题，利用正则表达式从源代码中获取标题

（3）pic_url是图片url的列表，方法比较简单，直接取出以http://开头以jpg结尾的字符串，中间的长度在1-100之间，因为不同图片的url在网页源代码中是隔开的，所以这个简单的正则表达式匹配以jpg结尾的url的准确率很高，如果是jpeg等等，只要用re1|re2就行，但是对于一些奇怪的url，不以图片后缀作为图片url的后缀，那就找不到它

（4）path是要创建的文件夹路径，文件夹名字就是title

（5）再用urlopen就可以直接打开图片url下载图片，所有图片都用title1,title2…命名

（6）最后一步，如果创建了文件夹但是没有下载图片，就把文件夹删掉

    原文作者：csuzhucong
    原文地址: https://blog.csdn.net/nameofcsdn/article/details/76010849
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。