爬取html天生pdf

起首看了这篇文章前端运用puppeteer 爬虫天生《React.js 小书》PDF并兼并,发明末了的pdf没有书签,很难熬痛苦,所以重要在此基础上加了加书签的功用。

爬去的示例网站为React.js 小书,仅做进修交换

针对网页天生pdf

运用puppeteer爬取网页并天生pdf

puppeteer中文文档

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
  await page.pdf({path: 'hn.pdf', format: 'A4'});

  await browser.close();
})();

合成pdf

pdf-merge:兼并pdf

依赖于pdftk

怎样给pdf加上书签

pdftk:一个处置惩罚pdf的东西

  • 装置后将bin目次添加到环境变量

应用update_info_utf8给pdf增添书签:

pdftk 'd:\OpenSource\My\genpfdforrsb\React 小书(无书签).pdf' update_info_utf8 'd:\OpenSource\My\genpfdforrsb\bookmarks.txt' output 'd:\OpenSource\My\genpfdforrsb\React 小书.pdf'

书签是什么

也就是bookmarks.txt

书签花样:

BookmarkBegin
BookmarkTitle: PDF Reference (Version 1.5)
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: Contents
BookmarkLevel: 2
BookmarkPageNumber: 3

肯定书签页码

pdfjs-dist:猎取单个pdf页数,用于bookmarks.txt中指定页码

天生书签

const pageArr = result.map(c => c.numPages);
let txt = ''
for (let index = 0; index < pageArr.length; index++) {
    let temp = `BookmarkBegin\r\nBookmarkTitle: ${titleArr[index]}\r\nBookmarkLevel: 1\r\nBookmarkPageNumber: ${pageIndex}\r\n`
    txt += temp
    pageIndex += pageArr[index]
}
fs.writeFileSync('bookmarks.txt', txt);

加上书签

参考pdf-merge源码,增添runshell.js用于在node中实行pdftk的敕令

runshell.js以下:

'use strict';
const child = require('child_process');
const Promise = require('bluebird');
const exec = Promise.promisify(child.exec);

module.exports = (scripts) => new Promise((resolve, reject) => {
    exec(scripts)
        .then(resolve)
        .catch(reject);
});

实行pdftk update_info_utf8

const nobkname = 'React 小书(无书签).pdf'
const hasbkname = 'React 小书.pdf'
mergepdf(nobkname).then(buffer => {
    console.log('starting add bookmarks!')
    runshell(`pdftk "${__dirname}/${nobkname}" update_info_utf8 "${__dirname}/bookmarks.txt" output "${__dirname}/${hasbkname}"`).then(() => {
        console.log('completed add bookmarks!')
        fs.unlinkSync(`${__dirname}/${nobkname}`);
        fs.unlinkSync(`${__dirname}/bookmarks.txt`);
        console.log('all completed!')
    })
})
  • 文件途径需要用双引号

源码:genpfdforrsb

题目

兼并后的pdf页码不是一连的,照样单个pdf的页码

    原文作者:见风仍然是风
    原文地址: https://segmentfault.com/a/1190000017789823
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞