The correct way to edit PDF metadata in Python
There are several ways to edit PDF metadata in Python, but one way is better than the others.
I will start by talking about other ways that seem right but have side effects. Skip to the end of this article if you don’t have enough time and just use the correct way.
Weakness is package not maintained.
from pdfrw import PdfReader, PdfWriter, PdfDict
if __name__ == '__main__':
pdf_reader = PdfReader('old.pdf')
metadata = PdfDict(Author='Someone', Title='PDF in Python')
pdf_reader.Info.update(metadata)
PdfWriter().write('new.pdf', pdf_reader)
pdfrw can do quite easily without losing non-display information such as bookmarks.
PyPDF2 supports more PDF features than pdfrw, including decryption and more types of decompression.
Weakness is PDF not preserve outlines(bookmarks).
import pprint
from PyPDF2 import PdfFileReader, PdfFileWriter
if __name__ == '__main__':
file_in = open('old.pdf', 'rb')
pdf_reader = PdfFileReader(file_in)
metadata = pdf_reader.getDocumentInfo()
pprint.pprint(metadata)
pdf_writer = PdfFileWriter()
pdf_writer.appendPagesFromReader(pdf_reader)
pdf_writer.addMetadata({
'/Author': 'Someone',
'/Title': 'PDF in Python'
})
file_out = open('new.pdf', 'wb')
pdf_writer.write(file_out)
file_in.close()
file_out.close()
Using PdfFileWriter
create a new PDF, and get old contents through appendPagesFromReader()
, then addMetadata()
.
It seems that we cannot directly modify the PDF metadata, so we add all pages and metadata then write out to a new file.
The correct way to edit PDF metadata in Python.
import pprint
from PyPDF2 import PdfFileReader, PdfFileMerger
if __name__ == '__main__':
file_in = open('old.pdf', 'rb')
pdf_reader = PdfFileReader(file_in)
metadata = pdf_reader.getDocumentInfo()
pprint.pprint(metadata)
pdf_merger = PdfFileMerger()
pdf_merger.append(file_in)
pdf_merger.addMetadata({
'/Author': 'Someone',
'/Title': 'PDF in Python'
})
file_out = open('new.pdf', 'wb')
pdf_merger.write(file_out)
file_in.close()
file_out.close()
Using PdfFileMerger
concatenate pages through append()
.
append(fileobj, bookmark=None, pages=None, import_bookmarks=True)
- import_bookmarks (bool) – You may prevent the source document’s bookmarks from being imported by specifying this as False.
References
pdfrw: the other Python PDF library
Reading and writing pdf metadata