从Len 18000的Dask数据帧中采样n = 2000生成错误当’replace = False’时,不能采用比总体更大的样本

我有一个从csv文件创建的dask数据帧,而len(daskdf)返回18000,但是当我ddSample = daskdf.sample(2000)时,我收到错误

ValueError: Cannot take a larger sample than population when 'replace=False'

如果数据框大于样本大小,我可以在不更换的情况下进行采样吗?

最佳答案 示例方法仅支持frac = keyword参数.见
API documentation

你得到的错误来自Pandas,而不是Dask.

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1]})
In [3]: df.sample(frac=2000, replace=False)
ValueError: Cannot take a larger sample than population when 'replace=False'

正如Pandas错误所示,考虑使用替换进行采样

In [4]: df.sample(frac=2, replace=True)
Out[4]: 
   x
0  1
0  1

In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=1)
In [7]: ddf.sample(frac=2, replace=True).compute()
Out[7]: 
   x
0  1
0  1
点赞