python – 根据单词列表优化数据帧

2023年11月10日 188次阅读

我有一个包含大量英文单词的类型列表的单词列表.

我还有一个数据框,看起来像： –

    FileName        PageNo     LineNo   GOODS_DESC  
1   17743633 - 1    TM000002    69      Abuj Cen Le 
31  17743633 - 1    TM000007    126     Mr USD  
33  17743633 - 1    TM000008    22      TABLEAU EMBALLAGE
34  17743633 - 1    TM000008    24      LISA e EMBALV
46  17743633 - 1    TM000008    143     Cen 
47  17743633 - 1    TM000008    146     A Gl
50  17743633 - 1    TM000009    121     Ppvv Tn Ppvv In 
51  17743633 - 1    TM000009    129     SPECIFY
52  17743633 - 1    TM000009    136     Decrp G 
58  17743633 - 1    TM000009    97      Je ugn  
60  17743633 - 1    TM000009    108     De Veel 
61  17743633 - 1    TM000014    44      TYRE CHIPS SHREDDED TYRES   
63  17743633 - 1    TM000014    48      TYRE CHIPS SHREDDED TYRES

我想只保留单词列表中存在的’GOODS_DESC’列中的那些单词.

我想要的输出是： –

    FileName        PageNo     LineNo   GOODS_DESC  
1   17743633 - 1    TM000002    69      NaN
31  17743633 - 1    TM000007    126     Mr USD  
33  17743633 - 1    TM000008    22      TABLEAU
34  17743633 - 1    TM000008    24      LISA  
46  17743633 - 1    TM000008    143     NaN 
47  17743633 - 1    TM000008    146     NaN
50  17743633 - 1    TM000009    121     NaN 
51  17743633 - 1    TM000009    129     SPECIFY
52  17743633 - 1    TM000009    136     NaN
58  17743633 - 1    TM000009    97      NaN 
60  17743633 - 1    TM000009    108     NaN
61  17743633 - 1    TM000014    44      TYRE CHIPS SHREDDED TYRES   
63  17743633 - 1    TM000014    48      TYRE CHIPS SHREDDED TYRES

我的方法也是提供输出,但我使用的是列表而且速度很慢.我想快点.

for rows in df.itertuples():
    a = []
    flat_list = []
    a.append(rows.GOODS_DESC)
    flat_list = [item.strip() for sublist in a for item in sublist.split(' ') if item.strip()]
    flat_list = list(sorted(set(flat_list), key=flat_list.index))
    flat_list = [i for i in flat_list if i.lower() in word_list]
    if(not flat_list):
        df.drop(rows.Index,inplace=True)
        continue
    s=' '.join(flat_list)
    df.loc[rows.Index,'GOODS_DESC']=s

df['GOODS_DESC'] = df['GOODS_DESC'].str.upper()

最佳答案你的逻辑似乎过于复杂.您可以在pd.Series.apply中使用单个列表解析.我建议,如下所示,使用set for O(1)lookup和str.casefold匹配字符串,无论大小写如何.

s = pd.Series(['Abuj Cen Le', 'Mr USD', 'TABLEAU EMBALLAGE', 'LISA e EMBALV'])

word_set = {i.casefold() for i in ['Mr', 'USD', 'TABLEAU', 'LISA']}

def apply_filter(x):
    out = ' '.join([i for i in x.split() if i.casefold() in word_set])
    return out if out else np.nan

res = s.apply(apply_filter)

print(res)

0        NaN
1     Mr USD
2    TABLEAU
3       LISA
dtype: object