使用Python基于特定列拆分csv文件

2019年7月28日 260次阅读

我是一名
Python初学者,并制作了一些基本的脚本.我最近的挑战是采用一个非常大的csv文件(10gb)并根据每行中特定变量的值将其拆分为许多较小的文件.

例如,文件可能如下所示：

Category,Title,Sales
"Books","Harry Potter",1441556
"Books","Lord of the Rings",14251154
"Series", "Breaking Bad",6246234
"Books","The Alchemist",12562166
"Movie","Inception",1573437

我想将文件拆分为单独的文件：
Books.csv,Series.csv,Movie.csv

实际上将有数百个类别,并且它们不会被分类.在这种情况下,它们位于第一列,但将来它们可能不是.

我在网上找到了一些解决方案,但在Python中没有.有一个非常简单的AWK命令可以在一行中执行此操作,但我无法在工作中访问AWK.

我编写了以下代码,但我认为这可能效率很低.任何人都可以建议如何加快速度？

import csv

#Creates empty set - this will be used to store the values that have already been used
filelist = set()

#Opens the large csv file in "read" mode
with open('//directory/largefile', 'r') as csvfile:

    #Read the first row of the large file and store the whole row as a string (headerstring)
    read_rows = csv.reader(csvfile)
    headerrow = next(read_rows)
    headerstring=','.join(headerrow) 

    for row in read_rows:

        #Store the whole row as a string (rowstring)
        rowstring=','.join(row)

        #Defines filename as the first entry in the row - This could be made dynamic so that the user inputs a column name to use
        filename = (row[0])

        #This basically makes sure it is not looking at the header row.
        if filename != "Category":

            #If the filename is not in the filelist set, add it to the list and create new csv file with header row.
            if filename not in filelist:    
                filelist.add(filename)
                with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
                    f.write(headerstring)
                    f.write("\n")
                    f.close()    
            #If the filename is in the filelist set, append the current row to the existing csv file.     
            else:
                with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
                    f.write(rowstring)
                    f.write("\n")
                    f.close()

谢谢！

最佳答案一种内存有效的方法和避免重新打开文件在这里附加的方法(只要你不会生成大量的打开文件句柄)就是使用dict将类别映射到fileobj.如果该文件尚未打开,则创建它并编写标题,然后始终将所有行写入相应的文件,例如：

import csv

with open('somefile.csv') as fin:    
    csvin = csv.DictReader(fin)
    # Category -> open file lookup
    outputs = {}
    for row in csvin:
        cat = row['Category']
        # Open a new file and write the header
        if cat not in outputs:
            fout = open('{}.csv'.format(cat), 'w')
            dw = csv.DictWriter(fout, fieldnames=csvin.fieldnames)
            dw.writeheader()
            outputs[cat] = fout, dw
        # Always write the row
        outputs[cat][1].writerow(row)
    # Close all the files
    for fout, _ in outputs.values():
        fout.close()