python – 尝试从常规表达式生成pandas dataframe列时出现问题?

我正在处理目录中分配的许多.txt文件.从所有这些文件中,我应该如何提取特定单词或文本块(即由正则表达式定义的句子,段落和标记)并将它们放入pandas数据框(即表格格式)中,保留一个名称为每个的列.文件?到目前为止,我创建了这个执行此任务的函数(我知道……它并不完美):

在:

import glob, os, re
import pandas as pd
regex = r'\<the regex>\b'
ind = 'path/dir'
out = 'path/dir'
f ='path/redirected/output/'


def foo(ind, reg, out):
    for filename in glob.glob(os.path.join(in_directory, '*.txt')):
        with open(filename, 'r') as file:
            stuff = re.findall(a_regex, file.read(), re.M)
            #my_list = [str([j.split()[0] for j in i]) for i in stuff]

            lis = [t[::2] for t in stuff]
            cont = ' '.join(map(str, lis))
            print(cont)
            with open(out, 'a') as f:
                print(filename.split('/')[-1] + '\t' + cont, file = f)


foo(directory, regex, out)

然后输出重定向到第三个文件:

日期:

fileName1.txt       
fileName2.txt       stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk
fileName3.txt       stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk
....
fileNameN.txt       stringOrChunk

这就是我从前一个文件创建数据帧的方式(是的,我知道它很糟糕):

import pandas as pd
df = pd.read_csv(/path/of/f/, sep='\t', names = ['file_names','col1'])
df.to_csv('/pathOfNewCSV.csv', index=False, sep='\t')

最后:

    file_names  col1
0   fileName1.txt   NaN
1   fileName2.txt   stringOrChunk stringOrChunk stringOrChunk...
2   fileName3.txt   stringOrChunk stringOrChunk stringOrChunk...
3   fileName4.txt   stringOrChunk
.....
N   fileNameN.txt   stringOrChunk

那么,任何想法如何以更加pythonic和有效的方式做到这一点?

更新

我上传了一个带有一些文档的.zip作为data,所以如果我们要从文档中提取所有副词,我们应该这样做:

a_regex = r"\w+ly"
directory = '/Users/user/Desktop/Docs/'
output_dir = '/Users/user/Desktop/'

foo(ind, reg, out)

然后,它应该创建一个包含文档所有副词的表:

Files            words
doc1.txt    
doc2.txt    
doc3.txt     DIRECTLY PROBABLY EARLY 
doc4.txt    

有什么想法如何增强上述功能?另外,我不知道这是否是执行此information extraction task的最佳方式(即仅使用正则表达式).如何使用像woosh项目这样的字符串索引器或者nltk呢?

UPDATE

例如,考虑创建一个dataframe,它提取包含单词的所有句子:JESUITS:

    Files   words1  words2  words3  words4
0   doc1.txt    A GOVERNMENT SPOKESMAN HAS ANNOUNCED THAT WITH...   NaN     NaN     NaN
1   doc2.txt    11/12/98 "THERE WAS NO TORTURE OR MISTREATMENT...   NaN     NaN     NaN
2   doc3.txt    WHAT WE HAD PREDICTED HAS OCCURRED. CRISTIANI ...   SO, THE QUESTION IS: WHO GAVE THE ORDER TO KIL...   THE MASSACRE OF THE JESUITS WAS NOT A PERSONAL...   LET US REMEMBER THAT AFTER THE MASSSACRE OF TH...
3   doc4.txt    IN 11/12/98 OUR VIEW, THE ASSASSINS OF THE JES...   THE ASSASSINATION OF THE JESUITS AGAIN CONFIRM...   NaN     NaN

最佳答案 我不完全确定我理解这个问题,但是这里的片段是用nltk来解决这个问题的最佳努力.

from glob import glob
from os.path import join, split

import nltk
import pandas as pd

dir_name = '/tmp/stackovflw/Docs'
file_to_adverb_dict = {}
nltk_adverb_tags = {'RB', 'RBR', 'RBS'}  # taken from nltk.help.upenn_tagset()

for full_file_path in glob(join(dir_name, '*.txt')):
    with open(full_file_path, 'rb') as f:
        _, file_name = split(full_file_path)
        tokens = nltk.word_tokenize(f.read().lower()) # lower -> seems that nltk behaves differently when the text is uppercase - try it...
        adverbs_in_file = [token for token, tag in nltk.pos_tag(tokens) if tag in nltk_adverb_tags]
        # consider using a "set" here to remove duplicates
        file_to_adverb_dict[file_name] = ' '.join(adverbs_in_file).upper()  #converting it back to uppercase (your input is all uppercase)

print pd.DataFrame(file_to_adverb_dict.items(), columns=['file_names', 'col1'])
#   file_names                                               col1
# 0   doc4.txt  PROBABLY ABROAD ALFONSO HOWEVER ALWAYS ALREADY...
# 1   doc1.txt                                                NOT
# 2   doc3.txt  DIRECTLY NOT SO SOLELY NOT PROBABLY NOT EVEN N...
# 3   doc2.txt

还有一点需要注意的是,如果您只想在特定文件夹中找到以“ly”结尾的单词,grep就是您的朋友:

$grep  -o -i -E  '\w+ly' *.txt
doc3.txt:DIRECTLY
doc3.txt:SOLELY
doc3.txt:PROBABLY
doc3.txt:EARLY
doc4.txt:PROBABLY

-o只​​给你匹配而不是整行
-i无视案例
-E扩展正则表达式

使用awk减少文件名:

 $grep  -o -i -E  '\w+ly' *.txt | awk -F':' '{a[$1]=a[$1] " "  $2}END{for( i in a ) print  i,"," a[i]}'
doc4.txt , PROBABLY
doc3.txt , DIRECTLY SOLELY PROBABLY EARLY
点赞