如何提取代码中的中文字符串

2024年5月7日 92次阅读来源: uncle_gy

前言

在一般情况下，程序中的中文字符串都是写在某个文件中读取（例如json），但是大部分时候我们都是直接写入到代码中，这个时候如果我们想把字符串提取出来就需要一个一个去找，或者通过IDE提供的字符串匹配方法来进行实现。通过IDE搜索的方式固然可以，但是如果我们想把字符串提取出来，则需要一个一个地点击，这样会非常麻烦。下面介绍如何使用代码提取中文字符串。

正则表达式

通过正则表达式可以提取中文字符串

具体的正则表达式为 [\u4e00-\u9fa5]

Python提取中文字符串

直接贴代码：

代码相关内容

Python 版本3.7

IDE Spyder

代码


# -*- coding: utf-8 -*-
import os
import re
list1=[]
def file_name(file_dir):
    for root,dirs,files in os.walk(file_dir):
        getChineseStrings(root,files)
        
def getChineseStrings(root,files):
    for f in files:
        if f.endswith('.js'):
            print(root+f)
            getNoRepeatList(root+f,list1)
            with open(os.path.join(root,f),encoding='UTF-8') as lines:
                for line in lines:
                    line=str(line)
                    #删除//注释的内容
                    line =re.sub(r'//.*$','',line)
                    #删除行内/**/注释的内容
                    line =re.sub(r'/\*.*\*/','',line)
                    #删除行内/*注释以及其右边的内容
                    line =re.sub(r'/\*.*$','',line)
                    #删除行内*/注释以及其左边的内容
                    line =re.sub(r'.*\*/','',line)
                    #删除*以及后面的字符串
                    line =re.sub(r'\*.*$','',line)
                    #查找“”中间的中文字符串
                    findPart(u"\".*[\u4e00-\u9fa5]+.*\"",line)
                    #删除“”中间的中文字符串
                    line =re.sub(u"\".*[\u4e00-\u9fa5]+.*\"",'',line)
                    #查找‘’中间的中文字符串
                    findPart(u"\'.*[\u4e00-\u9fa5]+.*\'",line)
                    #删除‘’中间的中文字符串
                    line =re.sub(u"\'.*[\u4e00-\u9fa5]+.*\'",'',line)
                    #查找><中间的中文字符串 
                    findPart(u">.*[\u4e00-\u9fa5]+.*<",line)
                    
def findPart(regex,text):
    res=re.findall(regex,text)
    for r in res:
        if '\"' in r:
            result = r.split('\"')
            for i in result:
                if re.compile(u'[\u4e00-\u9fa5]').search(i):
                    print (str(i))
                    getNoRepeatList(str(i),list1)
            return 
        if '\'' in r:
            result = r.split('\'')
            for i in result:
                if re.compile(u'[\u4e00-\u9fa5]').search(i):
                    print (str(i))
                    getNoRepeatList(str(i),list1)
            return 
        if '>' in r or '<' in r :
            result =re.split(r">|<", r) 
            for i in result:
                if re.compile(u'[\u4e00-\u9fa5]').search(i):
                    print (str(i))
                    getNoRepeatList(str(i),list1)
            return 
def getNoRepeatList(i,lists):
    if i not in lists:
        lists.append(i)
if __name__=='__main__':
    file_name('paste the path of the files here')
    print("==========================================================================================this is no repeating list")
    for i in list1:
        print(i)

代码详解

第一个方法：
file_name
这个方法主要用于遍历文件夹中的文件

第二个方法：
getChineseStrings
这个方法主要用于获得中文字符串，这也是整个代码文件的核心部分，

以JavaScript为例，其注释主要分为三种

第一种注释
//这是注释内容

第二种注释

/*这是注释内容*/

第三种注释

/* *这是注释内容 */

为了提取中文字符串，我们首先需要删除注释中的字符串，以//为例
line =re.sub(r'//.*$','',line)
这里直接将每一行中的//后面的字符都替换为''，这样在后面的提取中，就会自动排除掉这些注释中的字符串。

    原文作者：uncle_gy
    原文地址: https://blog.csdn.net/uncle_gy/article/details/104502482
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。