我有一个从csv文件创建的dask数据帧,而len(daskdf)返回18000,但是当我ddSample = daskdf.sample(2000)时,我收到错误
ValueError: Cannot take a larger sample than population when 'replace=False'
如果数据框大于样本大小,我可以在不更换的情况下进行采样吗?
最佳答案 示例方法仅支持frac = keyword参数.见
API documentation
你得到的错误来自Pandas,而不是Dask.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1]})
In [3]: df.sample(frac=2000, replace=False)
ValueError: Cannot take a larger sample than population when 'replace=False'
解
正如Pandas错误所示,考虑使用替换进行采样
In [4]: df.sample(frac=2, replace=True)
Out[4]:
x
0 1
0 1
In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=1)
In [7]: ddf.sample(frac=2, replace=True).compute()
Out[7]:
x
0 1
0 1