Python实现文本过滤去重

2024年1月17日 239次阅读来源: Donald_32e5

背景

爬虫会获取大量的数据，为不浪费资源，过滤重复数据就很有必要了

算法简介

1、余弦相似性
· 通过两个向量的夹角余弦值来评估相似度，应用较广泛
2、欧几里得距离
· 用使用较为广泛的公式来评估相似度
3、简单共有词
· 比较基础的文本过滤算法。使用两篇文档共有的字符，除以较长的那一篇文章来评估相似度
4、编辑距离
· 通过计算两字符串之间，由一个转成另外一个所需编辑的次数来评估它们的相似度，又称Levenshtein距离
5、Simhash+汉明距离
· 通过对比两篇文章hash串的汉明距离来评估相似度，多用于文档层面去重

算法详情

I、余弦相似性

具体的公式就不贴出来了，外面一大堆，主要记录一下Python的实现方式

1、首先进行分词，并列出两个句子的并集
2、计算词频的向量
4、计算余弦值，余弦值越大，证明夹角越小，两个向量越相似
5、代码实现，也就是TF-IDF算法

  def words2vec(words1=None, words2=None):
        """
        通过分词、计算并集、词频，得出词频向量
        :param words1:
        :param words2:
        :return: 词频的向量
        """
        v1 = []
        v2 = []
        tag1 = jieba.analyse.extract_tags(words1, withWeight=True)
        tag2 = jieba.analyse.extract_tags(words2, withWeight=True)
        tag_dict1 = {i[0]: i[1] for i in tag1}
        tag_dict2 = {i[0]: i[1] for i in tag2}
        merged_tag = set(tag_dict1.keys()) | set(tag_dict2.keys())
        for i in merged_tag:
            if i in tag_dict1:
                v1.append(tag_dict1[i])
            else:
                v1.append(0)
            if i in tag_dict2:
                v2.append(tag_dict2[i])
            else:
                v2.append(0)
        return v1, v2

6、使用jieba分词计算出语句的向量
7、根据向量计算余弦值

    def cosine_similarity(vector1, vector2):
        """
        余弦值相似度计算
        :param vector1:
        :param vector2:
        :return:
        """

        dot_product = 0.0
        norm1 = 0.0
        norm2 = 0.0
        for a, b in zip(vector1, vector2):
            dot_product += a * b
            norm1 += a ** 2
            norm2 += b ** 2
        if norm1 == 0.0 or norm2 == 0.0:
            return 0
        else:
            return round(dot_product / ((norm1 ** 0.5) * (norm2 ** 0.5)) * 100, 2)

II、欧几里得距离

1、首先也需要获得比较对象的向量，代码可以和上面的通用
2、根据向量，和欧几里得公式，计算出距离，使用1/(1+距离)算得相似度
3、代码实现

    def euclid_similarity(self, vector1, vector2):
        """
        欧几里得相似度算法
        :return:
        """
        v1, v2 = self.words2vec(vector1, vector2)
        similarity = 0.0
        for a, b in zip(v1, v2):
            d1 = np.sqrt(np.sum(np.square(a - b)))
            similarity = d1 if d1 > similarity else similarity
        print(1/(1+similarity))

III、编辑距离

1、Levenshtein有第三方包，可以通过pip install python-Levenshtein
2、计算相似度，使用1-(距离/较长的语句)计算得出相似度
3、代码实现

import Levenshtein
aa = Levenshtein(a, b)
max_len = len(a) if len(a) > len(b) else len(b)
print(1- aa/max_len)

IV、Simhash+汉明距离

1、谷歌的引擎爬虫每天获取的海量数据，就是用的Simhash来进行过滤的
2、他的基本原理是把两篇文档转换成字节，不过不是用普通的hash来进行转换
3、Simhash是一种局部敏感hash，所谓的局部敏感，就是在两个对比对象在hash后，任能保持之前的相似性
4、被hash之后的hash串的长度是相同的，通过汉明距的比较，就可以判断相似度了
5、代码实现

class Simhash:
    """经过测试和查找资料发现，Simhash更加适合文档去重，文本越短，精度缺失越厉害"""

    def __init__(self, content):
        self.simhash = self.simhash(content)

    def __str__(self):
        return str(self.simhash)

    def simhash(self, content):
        jieba.load_userdict('userdict.text')
        seg = jieba.cut(content)
        jieba.analyse.set_stop_words('testtimi.text')
        key_word = jieba.analyse.extract_tags(
            '|'.join(seg), topK=20, withWeight=True, allowPOS=())  # 在这里对jieba的tfidf.py进行了修改
        # 将tags = sorted(freq.items(), key=itemgetter(1), reverse=True)修改成tags = sorted(freq.items(), key=itemgetter(
        # 1,0), reverse=True) 即先按照权重排序，再按照词排序
        key_list = []
        # print(key_word)
        for feature, weight in key_word:
            weight = int(weight * 20)
            feature = self.string_hash(feature)
            temp = []
            for i in feature:
                if i == '1':
                    temp.append(weight)
                else:
                    temp.append(-weight)
            # print(temp)
            key_list.append(temp)
        list1 = np.sum(np.array(key_list), axis=0)
        # print(list1)
        if not key_list:  # 编码读不出来
            return '00'
        simhash = ''
        for i in list1:
            if i > 0:
                simhash = simhash + '1'
            else:
                simhash = simhash + '0'
        return simhash

    @staticmethod
    def string_hash(source):
        if source == "":
            return 0
        else:
            x = ord(source[0]) << 7
            m = 1000003
            mask = 2 ** 128 - 1
            for c in source:
                x = ((x * m) ^ ord(c)) & mask
            x ^= len(source)
            if x == -1:
                x = -2
            x = bin(x).replace('0b', '').zfill(64)[-64:]
            # print(source, x)

            return str(x)

        # 以下是使用系统自带hash生成，虽然每次相同的会生成的一样，
        # 不过，对于不同的汉子产生的二进制，在计算海明码的距离会不一样，
        # 即每次产生的海明距离不一致
        # 所以不建议使用。

        # x=str(bin(hash(source)).replace('0b','').replace('-','').zfill(64)[-64:])
        # print(source,x,len(x))
        # return x

    def hammingDis(self, com):
        t1 = '0b' + self.simhash
        t2 = '0b' + com.simhash
        n = int(t1, 2) ^ int(t2, 2)
        i = 0
        while n:
            n &= (n - 1)
            i += 1
        return i

6、最后的计算汉明距，也可以使用Levenshtein的hamming(str1,str2)方法
7、如果汉明距小于3，基本就说明文本的相似度很高

总结

1、如果需要比较的是短文本，余弦相似性是最好的选择，可以根据具体的业务，来判断使用哪种算法
2、Simhash在文档比较中表现强劲，但是随着比较字符的越来越短，它的精度就流失的越厉害

    原文作者：Donald_32e5
    原文地址: https://www.jianshu.com/p/fb104dd9dac4
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。