python – 通过Pandas DataFrame搜索子字符串的最有效方法是什么？

2019年8月4日 146次阅读

我有一个包含75k行文本的Pandas Dataframe(每行大约350个字符).我需要搜索该数据帧中45k子串列表的出现.

预期输出是authors_data dict,其中包含作者列表和出现次数.下面的代码假设我有一个dataframe [‘text’]列和一个名为authors_list的子字符串列表.

authors_data = {}
for author in authors_list:
    count = 0
    for i, row in df.iterrows():
         if author in row.text:
             count += 1
authors_data[author] = count
print(author, authors_data[author])

我做了一些初步测试,10位作者花了我大约50秒.完整的表格将花费我几天的时间来运行.所以我正在寻找更有效的方法来运行代码.

df.iterrows()足够快吗？我应该研究一下特定的库吗？

让我知道！

最佳答案我试过这个,它正在做你想要的.你可以测试一下它是否更快.

for author in authors_list:
            authors_data[author] = df['AUTHORCOL'].map(lambda x: author in x).sum()