在data.table的fread中抑制wc -l

2024年1月22日 221次阅读

如果一个大文件(~30 GB),我正在读取块,并注意到大部分时间都是通过对整个文件执行行计数来完成的.

Read 500000 rows and 49 (of 49) columns from 28.250 GB file in 00:01:09
   4.510s (  7%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
  53.890s ( 79%) Count rows (wc -l)
   0.010s (  0%) Column type detection (first, middle and last 5 rows)
   0.120s (  0%) Allocation of 500000x49 result (xMB) in RAM
   9.780s ( 14%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.060s (  0%) Changing na.strings to NA
  68.370s        Total

是否有可能指定fread不会在每次读取块时都执行完整的行数,还是这是必要的步骤？

编辑：
这是我正在运行的确切命令：

fread(pfile, skip = 5E6, nrows = 5E5, sep = "\t", colClasses = rpColClasses, na.strings = c("NA", "N/A", "NULL"), head = FALSE, verbose = TRUE)

最佳答案我不确定你是否可以在fread中“关闭”wc -l命令.承认我确实有两个答案.

答案1：在调用fread之前,使用Unix命令split将大数据集分成块.我发现在处理大数据集(即不适合RAM的数据)时,了解一点Unix会有很长的路要走.

split -b 1m myfile.csv #breaks your file into 1mb chunks.

答案2：使用连接.遗憾的是,这种方法不适用于fread.查看我以前的帖子,了解我的意思是使用连接. Strategies for reading in CSV files in pieces?