在python中用序列号字符串替换模式

2024年1月31日 235次阅读

我正在尝试在
python中实现以下替换.用{n}替换所有html标签

&安培;创建[tag,{n}]的哈希

原始字符串 – > “< h>这是一个字符串.< / H>< P>这是另一部分.< / P>”

替换文字 – > “{0}这是一个字符串.{1} {2}这是另一部分.{3}”

这是我的代码.我已经开始替换,但是我坚持使用替换逻辑,因为我无法找出以连续方式替换每个匹配项的最佳方法,即使用{0},{1}等等：

import re
text = "<h> This is a string. </H><p> This is another part. </P>"

num_mat = re.findall(r"(?:<(\/*)[a-zA-Z0-9]+>)",text)
print(str(len(num_mat)))

reg = re.compile(r"(?:<(\/*)[a-zA-Z0-9]+>)",re.VERBOSE)

phctr = 0
#for phctr in num_mat:
#    phtxt = "{" + str(phctr) + "}"
phtxt = "{" + str(phctr) + "}"
newtext = re.sub(reg,phtxt,text)

print(newtext)

有人可以帮助更好地实现这一目标吗？谢谢！

最佳答案

import re
import itertools as it

text = "<h> This is a string. </H><p> This is another part. </P>"

cnt = it.count()
print re.sub(r"</?\w+>", lambda x: '{{{}}}'.format(next(cnt)), text)

版画

{0} This is a string. {1}{2} This is another part. {3}

仅适用于简单标签(标签中没有属性/空格).对于扩展标记,您必须调整正则表达式.

此外,不重新初始化cnt = it.count()将继续编号.

更新以获取映射字典：

import re
import itertools as it

text = "<h> This is a string. </H><p> This is another part. </P>"

cnt = it.count()
d = {}
def replace(tag, d, cnt):
    if tag not in d:
        d[tag] = '{{{}}}'.format(next(cnt))
    return d[tag]
print re.sub(r"(</?\w+>)", lambda x: replace(x.group(1), d, cnt), text)
print d

打印：

{0} This is a string. {1}{2} This is another part. {3}
{'</P>': '{3}', '<h>': '{0}', '<p>': '{2}', '</H>': '{1}'}