python – 组合/平均多个数据文件

2019年8月4日 150次阅读

我有一组数据文件(例如,“data ####.dat”,其中#### = 0001,…,9999)都具有相同x值的公共数据结构第一列,然后是多个具有不同y值的列.

data0001.dat：

#A < comment line with unique identifier 'A'
#B1 < this is a comment line that can/should be dropped
1 11 21
2 12 22
3 13 23

data0002.dat：

#A < comment line with unique identifier 'A'
#B2 < this is a comment line that can/should be dropped
1 13 23
2 12 22
3 11 21

它们基本上源自我的程序的不同运行,具有不同的种子,我现在想要将这些部分结果组合成一个共同的直方图,以便保留以“#A”开头的注释行(对于所有文件都相同)并且其他注释行是下降.第一列保留,然后所有其他列应在所有数据文件上取平均值：

dataComb.dat：

#A < comment line with unique identifier 'A'
1 12 22 
2 12 22 
3 12 22

其中12 =(11 13)/ 2 =(12 12)/ 2 =(13 11)/ 2和22 =(21 23)/ 2 =(22 22)/ 2 =(23 21)/ 2

我已经有了一个bash脚本(可能是可怕的代码;但我不是那么经验……)通过运行./merge.sh data *>来完成这项工作.命令行中的dataComb.dat.它还检查所有数据文件是否在第一列中具有相同数量的列和相同的值.

merge.sh：

#!/bin/bash

if [ $# -lt 2 ]; then
    echo "at least two files please"
    exit 1;
fi

i=1
for file in "$@"; do
    cols[$i]=$(awk '
BEGIN {cols=0}
$1 !~ /^#/ {
  if (cols==0) {cols=NF}
  else {
    if (cols!=NF) {cols=-1}
  }
}
END {print cols}
' ${file})
    i=$((${i}+1))
done

ncol=${cols[1]}
for i in ${cols[@]}; do
    if [ $i -ne $ncol ]; then
        echo "mismatch in the number of columns"
        exit 1
    fi
done

echo "#combined $# files"
grep "^#A" $1

paste "$@" | awk "
\$1 !~ /^#/ && NF>0 {
  flag=0
  x=\$1
  for (c=1; c<${ncol}; c++) { y[c]=0. }
  i=1
  while (i <= NF) {
    if (\$i==x) {
      for (c=1; c<${ncol}; c++) { y[c] += \$(i+c) }
      i+= ${ncol}
    } else { flag=1; i=NF+1; }
  }
  if (flag==0) {
    printf(\"%e \", x)
    for (c=1; c<${ncol}; c++) { printf(\"%e \", y[c]/$#) }
    printf(\"\n\")
  } else { printf(\"# x -coordinate mismatch\n\") }
}"

exit 0

我的问题是,对于大量的数据文件,它很快就会变慢,并且在某些时候会引发“太多打开文件”错误.我看到只需一次粘贴所有数据文件(粘贴“$@”)就是问题,但是分批进行并以某种方式引入临时文件似乎也不是理想的解决方案.我很感激任何帮助,使其在保留调用脚本的方式时更具可伸缩性,即所有数据文件都作为命令行参数传递

我决定在python部分发布这个,因为我经常被告知处理这类问题非常方便.然而,我几乎没有使用python的经验,但也许这是最终开始学习它的机会;)

最佳答案下面附加的代码在Python 3.3中工作并产生所需的输出,但有一些小的警告：

>它从它处理的第一个文件中抓取初始注释行,但是没有费心去检查之后的所有其他文件仍然匹配(即,如果你有几个以#A开头的文件和一个以#A开头的文件#C,它不会拒绝#C,即使它可能应该).我主要想说明合并函数如何在Python中工作,并认为添加这种类型的杂项有效性检查最好留作“作业”问题.
>它也无需检查行数和列数是否匹配,如果不匹配则可能会崩溃.考虑一下另一个小作业问题.
>它将第一个列的右侧的所有列打印为浮点值,因为在某些情况下,这就是它们可能是什么.初始列被视为标签或行号,因此打印为整数值.

您可以像以前一样调用代码;例如,如果您将脚本文件命名为merge.py,则可以执行python merge.py data0001.dat data0002.dat,它会将合并的平均结果打印到stdout,就像使用bash脚本一样.与之前的答案之一相比,代码还具有更大的灵活性：它的编写方式,原则上应该(我实际上没有测试过这一点以确保)能够将文件与任意数量的列合并,而不仅仅是具有正好三列的文件.另一个好处是：它完成后不会保持文件打开; with open(name,’r’)as infile：line是一个Python习语,在脚本从文件读完后自动导致文件关闭,即使从不显式调用close().

#!/usr/bin/env python

import argparse
import re

# Give help description
parser = argparse.ArgumentParser(description='Merge some data files')
# Add to help description
parser.add_argument('fname', metavar='f', nargs='+',
                    help='Names of files to be merged')
# Parse the input arguments!
args = parser.parse_args()
argdct = vars(args)

topcomment=None
output = {}
# Loop over file names
for name in argdct['fname']:
    with open(name, "r") as infile:
        # Loop over lines in each file
        for line in infile:
            line = str(line)
            # Skip comment lines, except to take note of first one that
            # matches "#A"
            if re.search('^#', line):
                if re.search('^#A', line) != None and topcomment==None:
                    topcomment = line
                continue
            items = line.split()
            # If a line matching this one has been encountered in a previous
            # file, add the column values
            currkey = float(items[0])
            if currkey in output.keys():
                for ii in range(len(output[currkey])):
                    output[currkey][ii] += float(items[ii+1])
            # Otherwise, add a new key to the output and create the columns
            else:
                output[currkey] = list(map(float, items[1:]))

# Print the comment line
print(topcomment, end='')
# Get total number of files for calculating average
nfile = len(argdct['fname'])              
# Sort the output keys
skey = sorted(output.keys())
# Loop through sorted keys and print each averaged column to stdout
for key in skey:
    outline = str(int(key))
    for item in output[key]:
        outline += ' ' + str(item/nfile)
    outline += '\n'
    print(outline, end='')