python爬虫-爬取笔趣阁小说

2019年5月9日 307次阅读来源: hello_spider

1.环境

python3.6
python官网: www.python.org
需要用到的库： re、time、random、requests
requests库安装：https://jingyan.baidu.com/article/86f4a73ea7766e37d7526979.html

2.思路

我们先按照正常流程来访问一次网页

《python爬虫-爬取笔趣阁小说》 image.png

打开浏览器的开发者工具（我用的是chrome）F12可以打开

《python爬虫-爬取笔趣阁小说》 image.png

从上面的网页提取到小说详情页的url

《python爬虫-爬取笔趣阁小说》 image.png

这个网页就可以提取出每个章节的url，然后根据章节的url就可以查看到每一页小说的内容了，再根据正则就可以提取出小说了。

3.分析网页结构

《python爬虫-爬取笔趣阁小说》 image

小说的搜索的真实网址

《python爬虫-爬取笔趣阁小说》 image

小说的url

《python爬虫-爬取笔趣阁小说》 image

小说每个章节的url

《python爬虫-爬取笔趣阁小说》 image

4.代码实现

import requests
import re
import time
import random


def download(book_name):
    # 下载小说
    search_real_url = 'https://www.biquge5200.com/modules/article/search.php?searchkey=' + book_name
    try:
        novel_source = requests.get(search_real_url).text
        reg1 = r'<td class="odd"><a href="(.*?)">(.*?)</a></td>.*?<td class="odd">(.*?)</td>'
        # 所有搜索到的结果（包括小说网址、名称、作者姓名）
        novel_list = re.findall(reg1, novel_source, re.S)
        # 判断是否有结果返回
        if len(novel_list) == 0:
            print('你要找的小说不存在，请检查后重新输入')
    except Exception as e:
        print(e)
    for novel_url, novel_name, novel_author in novel_list:
        if novel_name == book_name:
            print('你即将下载的小说：%s 作者：%s' % (novel_name, novel_author))
            return novel_url, novel_name


def get_chapter(url):
    # 获取章节页面
    try:
        # 章节页面源代码
        chapter_page_source = requests.get(url).text
        reg2 = r'<dd><a href="(.*?)">(.*?)</a></dd>'
        chapter_list = re.findall(reg2, chapter_page_source)
    except Exception as e:
        print(e)
    return chapter_list


def get_content(chapter_list, novel_name):
    count = 0
    length = len(chapter_list)
    for chapter_url, chapter_name in chapter_list:
        try:
            time.sleep(1+random.random())
            content_source = requests.get(chapter_url).text
            reg = r'<div id="content">(.*?)</div>'
            content = re.findall(reg, content_source, re.S)[0]
            content = content.replace('<br/>', '').replace(' ', '').replace('<p>', '').replace('</p>', '')
            count += 1
            with open(novel_name + '.txt', 'a', encoding='utf-8') as f:
                f.write(chapter_name + '\n' * 2 + content + '\n' * 2)
                print('正在写入: ' + chapter_name)
                print('进度：%0.2f' % (count / length)+'%')
        except Exception as e:
            print(e)


if __name__ == '__main__':
    book_name = input('请输入你要下载的小说名字(确保输入的小说名字正确)：')
    novel_url, novel_name = download(book_name)
    chapter_list = get_chapter(novel_url)
    get_content(chapter_list, novel_name)

##以上内容仅供学习使用

    原文作者：hello_spider
    原文地址: https://www.jianshu.com/p/9bca173984a7
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。