使用Python比较两个文本文件的相似度

2023年2月17日 57次阅读来源: 郝伟博士

前言

本文使用Python对两个文本文件进行比较，目的是为比较学生的作业抄袭情况。由于总共有70名学生，两两比较工作量太大，所以写了本程序。算法的基本原理是对两个文件中的行每一行进行比较，统计内容相同的行数，然后与两个文件的最大行数的比值。

基本原理

设 lines1 和 lines2 分别为两个输入文件中的所有文本行。

count = 0
for line in range(len(files1)):    # 对文件1中的所有行进行遍历
		if lines2.count(line) > 0: # 文件2中包括文件1的行 line
		     count += 1            # count加1

# 结果为相同的行的个数与两个文件最大行的比值，当2个文件长度相同，且每一行都相同时，取最大值。
result = count / max(len(lines1), len(lines2))

测试结果

以下对学生的作业进行对比的计算的结果（作业保密，读者可以使用其他文本文件进行测试）

25和27的作业相似度为：73.77%
25和33的作业相似度为：77.05%
25和34的作业相似度为：88.52%
26和42的作业相似度为：98.21%
27和33的作业相似度为：89.83%
27和34的作业相似度为：76.27%
29和30的作业相似度为：73.17%
33和34的作业相似度为：80.70%
41和45的作业相似度为：97.22%
43和45的作业相似度为：72.22%
49和50的作业相似度为：85.29%
53和59的作业相似度为：91.67%
58和69的作业相似度为：92.59%
61和63的作业相似度为：72.73%
62和68的作业相似度为：80.00%

源程序只显示了70%相似度以上的。显然，这些同学的作业存在很大的抄袭嫌疑，经过查看相似内容，确认他们都有抄袭情况。另外，经过进一步简单分析，就可以发现作业抄袭学生的关系，如 25，27，33，34几人可能抄的是同一人；29，30是同一人等。

源代码

# coding: utf-8
import os

def readLines(filepath):
    lines = []
    try:
        with open(filepath, 'r', encoding = 'utf-8') as f:
            lines = f.readlines()
    except Exception:
        with open(filepath, 'r', encoding = 'gbk') as f:
            lines = f.readlines()
    return lines
    
def compare(file1, file2):
    lines1 = readLines(file1)
    lines2 = readLines(file2)
    
    count = 0.
    for line in lines1:
        if lines2.count(line) > 0:
            count += 1
    return count / max(len(lines1), len(lines2))
    
path = "第二次作业" # 输入路径，根据实际情况决定。
dirs = os.listdir(path)
files = []
error_files = []
for file in dirs:
    files.append(os.path.join(path, file))

for i in range(len(files)):
    for j in range(i + 1, len(files)):
        try:
            degree = compare(files[i], files[j])
            if degree > 0.7:
                print("{}和{}的作业相似度为：{:.2%}".format(files[i].split(" ")[0], files[j].split(" ")[0], degree).replace("第二次作业\\", ""))
        except Exception as e:
            if error_files.count(j) == 0:
                error_files.append(j)
            continue

    原文作者：郝伟博士
    原文地址: https://blog.csdn.net/weixin_43145361/article/details/104758247
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。