python-3.x – Pandas – 使用to_hdf加倍文件大小添加具有相同名称的数据帧

2023年6月6日 496次阅读

我是Pandas模块中的新手.我使用to_hdf创建了数据框并使用名称“dirtree”保存它：

df.to_hdf("d:/datatree full.h5", "dirtree")

我重复上述行动.之后,当我检查文件大小时,它会加倍.我想我的第二个数据帧被附加到旧数据帧,但检查存储中的数据帧和计数行表示没有额外的数据帧或行.怎么会这样？

我检查商店的代码：

store = pd.HDFStore('d:/datatree.h5')
print(store)
df = pd.read_hdf('d:/datatree.h5', 'dirtree')
df.text.count() # text is one of the columns in df

最佳答案我可以通过以下方式重现此问题：

原始样本DF：

In [147]: df
Out[147]:
          a         b           c
0  0.163757 -1.727003    0.641793
1  1.084989 -0.958833    0.552059
2 -0.419273 -1.037440    0.544212
3 -0.197904 -1.106120   -1.117606
4  0.891187  1.094537  100.000000

让我们把它保存到HDFStore：

In [149]: df.to_hdf('c:/temp/test_dup.h5', 'x')

文件大小：6992字节

让我们再来一次：

In [149]: df.to_hdf('c:/temp/test_dup.h5', 'x')

文件大小：6992字节注意：它没有改变

现在让我们打开HDFStore：

In [150]: store = pd.HDFStore('c:/temp/test_dup.h5')

In [151]: store
Out[151]:
<class 'pandas.io.pytables.HDFStore'>
File path: c:/temp/test_dup.h5
/x            frame        (shape->[5,3])

文件大小：6992字节注意：它没有改变

让我们再一次将DF保存到HDFStore,但请注意商店是开放的：

In [156]: df.to_hdf('c:/temp/test_dup.h5', 'x')

In [157]: store.close()

文件大小：12696字节#BOOM !!!

根本原因：

当我们这样做：store = pd.HDFStore(‘c：/temp/test_dup.h5’) – 它以默认模式’a'(追加)打开,所以它准备好修改商店和你写同一个文件,但不使用这个商店,它制作副本,以保护开放的商店……

如何避免它：

打开商店时使用mode =’r’：

In [158]: df.to_hdf('c:/temp/test_dup2.h5', 'x')

In [159]: store2 = pd.HDFStore('c:/temp/test_dup2.h5', mode='r')

In [160]: df.to_hdf('c:/temp/test_dup2.h5', 'x')
...
skipped
...
ValueError: The file 'c:/temp/test_dup2.h5' is already opened, but in read-only mode.  Please close it before reopening in append mode.

或者更好的管理HDF文件的方法 – 使用商店：

store = pd.HDFStore(filename)  # it's stored in the `'table'` mode per default !
store.append('key_name', df, data_columns=True)
...
store.close()  # don't forget to flush changes to disk !!!