python – 将hdf5文件合并到单个数据集中

2019年7月22日 713次阅读

我有很多hdf5文件,每个文件都有一个数据集.我想将它们组合成一个数据集,其中数据全部在同一个卷中(每个文件都是一个图像,我想要一个大的间隔拍摄图像).

我编写了一个python脚本来将数据提取为numpy数组,存储它们,然后尝试将其写入新的h5文件.但是,这种方法不起作用,因为组合数据使用的内存超过了我拥有的32 GB RAM.

我也尝试使用命令行工具h5copy.

h5copy -i file1.h5 -o combined.h5 -s '/dataset' -d '/new_data/t1'
h5copy -i file2.h5 -o combined.h5 -s '/dataset' -d '/new_data/t2'

哪个有效,但它会在新文件中生成许多数据集,而不是将所有数据集串联起来.

最佳答案虽然您无法将行明确附加到hdf5数据集,但在创建数据集时可以使用maxshape关键字,以便您可以“调整”数据集以调整新数据. (见
http://docs.h5py.org/en/latest/faq.html#appending-data-to-a-dataset)

假设数据集的列数始终相同,您的代码最终会看起来像这样：

import h5py

output_file = h5py.File('your_output_file.h5', 'w')

#keep track of the total number of rows
total_rows = 0

for n, f in enumerate(file_list):
  your_data = <get your data from f>
  total_rows = total_rows + your_data.shape[0]
  total_columns = your_data.shape[1]

  if n == 0:
    #first file; create the dummy dataset with no max shape
    create_dataset = output_file.create_dataset("Name", (total_rows, total_columns), maxshape=(None, None))
    #fill the first section of the dataset
    create_dataset[:,:] = your_data
    where_to_start_appending = total_rows

  else:
    #resize the dataset to accomodate the new data
    create_dataset.resize(total_rows, axis=0)
    create_dataset[where_to_start_appending:total_rows, :] = your_data
    where_to_start_appending = total_rows

output_file.close()