思路:
1.爬取要制作成pdf的HTML网页标签
2.将爬取到的标签放到body标签内部组合成完整的HTML格式代码(我记得有个库可以实现,找了半天没找到,有记得的帮忙下边评论下)
3.使用pdfkit库将组合完整的HTML代码转化成pdf文档
pdfkit库的安装使用
pip install pdfkit
还需要安装配套的软件wkhtmltox(官网下载就行,一路next安装即可)
并且将wkhtmltox安装目录中的bin目录配置到path环境变量中
pfdkit.from_file(‘html文件路径’,‘输出的pdf文件路径’)
补个代码:
import pdfkit
import requests
from lxml import etree
# 爬虫爬取csdn文章主体内容
def spider(url):
headers = {
'authority': 'blog.csdn.net',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'sec-ch-ua': '"Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://blog.csdn.net/CXY00000?spm=1008.2194.3001.5343',
'accept-language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'cookie': '',# (换成自己的cookie)
}
# print(headers)
response=requests.get(url=url,headers=headers)
# print(response.text)
tree = etree.HTML(response.text)
lis=tree.xpath('//div[@class="blog-content-box"]')[0]
lis_str=etree.tostring(lis,encoding='unicode')
# print(lis_str)
html1 = ''' <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> {} </body> </html> '''.format(lis_str)
with open('./csdn.html','w',encoding='utf8') as fp:
fp.write(html1)
# 将html生成pdf
def makepdf(out_name):
pdfkit.from_file('./csdn.html','./'+str(out_name)+'.pdf')
if __name__ == '__main__':
url = input('请输入要爬取的链接:')
out_name = input('请输入输出的pdf文件名:')
spider(url)
makepdf(out_name)
exe小工具下载地址:
csdn文章转pdf(可见即可转).exe
https://jhc001.lanzouw.com/idWn4x0n50b
密码:7eeh