Python 统计托福作文词频

2019年5月19日 215次阅读来源: 求愚

随着 AI 的大热，我的好奇心也受到了吸引。阅读了一些文章后发现，Pyhton 是一个非常适合 AI 编程的语言。于是开始了对其打怪升级的探索。

探索中发现，Python 提供丰富的库来帮助开发者们进行数据分析。自己由于工作需要，正好在准备托福写作。于是，当 Python 遇上 Tofel，一场美丽的邂逅便展开了。

目标

笔者完成了 5 篇托福作文后，想分析一下哪些词是我最常用的，进而学习这些词的同义词，扩大词汇量，然后在作文中自由替换。

思路

利用 Python 读取文件
统计每篇文章的词频
合并 5 篇文章的词频
输出前 10 词频的单词

行动

STEP 1:

《Python 统计托福作文词频》导出作文

笔者使用 Evernote 进行写作，其支持导出 hmtl 格式文件。导出后，重命名文件方便读取。

《Python 统计托福作文词频》重命名

STEP 2:

通过分析 html 文件，我发现正文都在 <body> 中。通过查询，发现 BeautifulSoup 库可以帮助处理 html 格式文件。

于是：

def filter_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    # 需要过滤<title>标签，避免作文题目干扰
    text = soup.body.get_text()
    return text

STEP 3:
接下来，需要统计一篇文章中每个单词的出现个数。这里主要用到了 re, collections.counter 两个 Python 内置对象。

def calculate_words_frequency(file):
    # 读取文件
    with open(file) as f:
        # html 处理
        f = filter_html(f)

        line_box = []
        word_box = []
        
        # 转成小写并将句子分成词
        line_box.extend(f.strip().lower().split())
        
        # 去除标点符号的影响    
        for word in line_box:
            if not word.isalpha():
                word = filter_puctuation(word)
            word_box.append(word)
        
        # 统计词频
        word_box = fileter_simple_words(collections.Counter(word_box))

        return word_box

这里解释一下 filter_puctuation()这个函数。当笔者输出词频结果时，发现由于标点符号的存在，很多单词的尾部会跟着. , or ?

为了避免标点对词频统计的干扰，笔者使用了简单的正则去过滤掉标点。（正则不太会，测试时够用，应该有更简单和全面的写法）

# 过滤单词尾部的,.?"和头部的"
def filter_puctuation(word):
    return re.sub(r'(\,$)|(\.$)|(\?$)|(\"$)|(^\")', '', word)

STEP 4:

在测试结果集的时候发现，排名靠前的单词都是介词，代词，连词等常用词。如 he, and, that. 但这些词并不是笔者想要的，于是需要先把常用简单词汇给过滤掉，再统计词频。（我手动敲了一些，应该网上有更全的清单）

def fileter_simple_words(words):
    # 过滤词清单
    simple_words = ['the', 'a', 'an', 'to', 'is',
                    'am', 'are', 'the', 'that', 'which',
                    'i', 'you', 'he', 'she', 'they',
                    'it', 'of', 'for', 'have', 'has',
                    'their', 'my', 'your', 'will', 'all',
                    'but', 'while', 'with', 'only', 'more',
                    'who', 'should', 'there', 'can', 'might',
                    'could', 'may', 'be', 'on', 'at',
                    'after', 'most', 'even', 'and', 'in',
                    'best', 'better', 'as', 'no', 'ever',
                    'me', 'not', 'his', 'her'
                    ]

    # words type is counter.
    for word in list(words):
        if word in simple_words:
            del words[word]

    return words

STEP 5:
快接近尾声啦。在统计完 1 篇文章的词频后，我需要将 5 篇文章的词频求和。鉴于 counter
对象的可加性，于是

def multiple_file_frequency(files):
    total_counter = collections.Counter()
    for file in files:
        total_counter += calculate_words_frequency(file)
    return total_counter

STEP 6:
求和之后，我想知道前 10 高频的词汇是哪些。

def most_common_words(files, number):
    total_counter = multiple_file_frequency(files)
    return total_counter.most_common(number)

STEP 7:
最后，使用 Python 可视化工具把结果生成柱状图。

def draw_figures(figures):
    labels, values = zip(*figures)
    indexes = np.arange(len(labels))
    width = 0.5
    plt.bar(indexes, values, width)
    plt.xticks(indexes, labels)
    plt.show()

《Python 统计托福作文词频》 Final results