python – 从.pdf中提取特定数据并保存在Excel文件中

每个月我都需要从.pdf文件中提取一些数据来创建Excel表格.

我能够将.pdf文件转换为文本,但我不确定如何提取和保存我想要的特定信息.现在我有这个代码:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,    password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    return fstr

print convert_pdf_to_txt("FA20150518.pdf")

这就是结果:

    >>> 
AVILA 72, VALLDOREIX
08197 SANT CUGAT DEL VALLES
(BARCELONA)
TELF: 935441851
NIF: B65512725
EMAIL: buendialogistica@gmail.com

JOSE LUIS MARTINEZ LOPEZ

AVDA. DEL ESLA, 33-D
24240 SANTA MARIA DEL PARAMO
LEON
TELF: 600871170

FECHA
17/06/15

FACTURA
  20150518

CLIENTE
43000335

N.I.F.

71548163 B

PÁG.

1

Nº VIAJE

RUTA

DESTINATARIO / REFERENCIA

KG

BULTOS

IMPORTE

2015064210-08/06/15

CERDANYOLA DEL VALLES -> VINAROS

FERRER ALIMENTACION - VINAROZ

2,000.0

1

         150,00

TOTAL IMP.

%

IMPORTE

BASE

         150,00

         150,00

%
 21,00

IVA

%

REC.

TOTAL FRA.

(€)

          31,50

         181,50

Eur

Forma Pago:
Banco:

CONTADO

Vencimientos:
17/06/15
181,50

好的,现在我在变量convert_pdf_to_txt中有了文本.

我想提取这些信息:客户,账单数量,价格,到期日期和支付方式.

客户名称始终显示“EMAIL:buendialogistica@gmail.com”

账单数量总是下降“FACTURA”

价格总是下降两行“Vencimientos:”

截止日期总是下降“Vencimientos:”

支付方式永远落在“银行:”

我想在做这样的事情.如果我可以将此文本转换为列表,可以执行以下操作:

搜索客户:

 i=0
 while i < lengthlist
   if listitem[i] == "EMAIL: buendialogistica@gmail.com"
      i+1
      Customer = listitem[i]
      i = lengthlist
   else:
     i+1

搜寻账单号码:

 i=0
 while i < lengthlist
   if listitem[i] == "FACTURA"
      i+1
      Customer = listitem[i]
      i = lengthlist
   else:
     i+1

在我不知道如何在Excel中保存但我确信我可以在论坛中找到示例但首先我需要只提取这些数据.

最佳答案 你有正确的想法

string = convert_pdf_to_txt("FA20150518.pdf")
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
    if 'EMAIL:' in lines[i]:
        custData['Name'] = lines[i+1]
    elif 'FACTURA' in lines[i]:
        custData['BillNumber'] = lines[i+1]
    elif 'Vencimientos:' in lines[i]:
        custData['price'] = lines[i+2]
    elif 'Banco:' in lines[i]:
        custData['paymentType'] = lines[i+1]
print(custData)
点赞