这是我的样本数据
78|Indonesia|Pamela|Reid|preid25@gravatar.com|147.3.67.193
我希望得到结果
Indonesia
目前我在字符串上使用split并访问该值.但我想用正则表达式.
一些需要注意的条件:
数据可能为空
数据不包含管道(|)
我想使用正则表达式而不是拆分,因为我认为正则表达式更有效.我希望它尽可能高效的原因是因为源文件是70gb.
编辑:
这是我将使用它的整个代码
def main(argv):
mylist = set();
input_file = open("test.txt", 'r')
for row in input_file:
rowsplit = row.split("|");
if rowsplit[1] !='':
if rowsplit[1] in mylist:
filename= "bby_"+rowsplit[1]+".dat";
existingFile=open(filename,'a')
existingFile.write(row);
existingFile.close()
else:
mylist.add(rowsplit[1])
filename= "bby_"+rowsplit[1]+".dat";
newFile = open(filename,'a')
newFile.write(row);
newFile.close();
else:
print "Empty"
print mylist
我只是对我现在应该使用哪些答案感到困惑:(
我只是希望这段代码很快.而已.
最佳答案 拆分和检查长度可能仍然比正则表达式更快:
for line in f:
spl = line.split("|",2)
if len(spl) > 2:
print(spl[1])
....
匹配和非匹配行的一些时间:
In [24]: s = "78|Indonesia|Pamela|Reid|preid25@gravatar.com|147.3.67.193"
In [25]: %%timeit
spl = s.split("|",2)
if len(spl) > 2:
pass
....:
1000000 loops, best of 3: 413 ns per loop
In [26]: r = re.compile(r'(?<=\|)[^|]*')
In [27]: timeit r.search(s)
1000000 loops, best of 3: 452 ns per loop
In [28]: s = "78 Indonesia Pamela Reid preid25@gravatar.com 147.3.67.193"
In [29]: timeit r.search(s)
1000000 loops, best of 3: 1.66 µs per loop
In [30]: %%timeit
spl = s.split("|",2)
if len(spl) > 2:
pass
....:
1000000 loops, best of 3: 342 ns per loop
您可以通过创建str.split的本地引用来减少更多:
_spl = str.split
for line in f:
spl = _spl(s,"|",2)
if len(spl) > 2:
.....
由于每行中的管道数量始终相同:
def main(argv):
seen = set() # only use if you actually need a set of all names
with open("test.txt", 'r') as infile:
r = csv.reader(infile, delimiter="|")
for row in r:
v = row[1]
if v:
filename = "bby_" + v + ".dat"
existingFile = open(filename, 'a')
existingFile.write(row)
existingFile.close()
seen.add(v)
else:
print "Empty"
if / else似乎是多余的,因为你要附加到文件中,如果你想保留一组行[1]的另一个原因,你每次都可以添加到集合中,除非你真的需要一组全部我将从代码中删除它的名称.
应用相同的逻辑拆分:
def main(argv):
seen = set()
with open("test.txt", 'r') as infile:
_spl = str.split
for row in infile:
v = _spl(row,"|",2)[1]
if v:
filename = "bby_" + v + ".dat"
existingFile = open(filename, 'a')
existingFile.write(row)
existingFile.close()
seen.add(v)
else:
print "Empty"
导致大量开销的是不断打开和写入,但除非你能将所有行存储在内存中,否则没有简单的方法来解决它.
就阅读而言,在一个包含一千万行的文件中,只需拆分两次就会胜过csv阅读器:
In [15]: with open("in.txt") as f:
....: print(sum(1 for _ in f))
....:
10000000
In [16]: paste
def main(argv):
with open(argv, 'r') as infile:
for row in infile:
v = row.split("|", 2)[1]
if v:
pass
## -- End pasted text --
In [17]: paste
def main_r(argv):
with open(argv, 'r') as infile:
r = csv.reader(infile, delimiter="|")
for row in r:
if row[1]:
pass
## -- End pasted text --
In [18]: timeit main("in.txt")
1 loops, best of 3: 3.85 s per loop
In [19]: timeit main_r("in.txt")
1 loops, best of 3: 6.62 s per loop