class – 尝试在scikit-learn中通过sample_weight平衡我的数据集

2023年4月29日 830次阅读

我使用RandomForest进行分类,我得到了一个不平衡的数据集,如：5830-no,1006-yes.我尝试平衡我的数据集与class_weight和sample_weight,但我不能.

我的代码是：

X_train,X_test,y_train,y_test = train_test_split(arrX,y,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})

但是在使用class_weight和sample_weight时,我的比率TPR,FPR,ROC没有任何改善.

为什么？我做错了吗？

不过,如果我使用名为balanced_subsample的函数,我的比率会得到很大改善：

def balanced_subsample(x,y,subsample_size):

    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

我的新代码是：

X_train_subsampled,y_train_subsampled=balanced_subsample(arrX,y,0.5)
X_train,X_test,y_train,y_test = train_test_split(X_train_subsampled,y_train_subsampled,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})

谢谢

最佳答案这还不是一个完整的答案,但希望它能帮到那里.

首先是一些一般性的评论：

>要调试此类问题,确定行为通常很有用.您可以将random_state属性传递给RandomForestClassifier和各种具有固有随机性的scikit-learn对象,以便在每次运行时获得相同的结果.你还需要：

import numpy as np
np.random.seed()
import random
random.seed()

您的balanced_subsample函数在每次运行时的行为方式相同.

>不要在n_estimators上进行网格搜索：在随机森林中,更多树木总是更好.
>请注意,sample_weight和class_weight具有类似的目标：实际样本权重将是从class_weight推断的sample_weight *权重.

你能尝试一下：

>在balanced_subsample函数中使用subsample = 1.除非有特殊原因不这样做,否则我们最好比较相似数量样本的结果.
>使用class_weight和sample_weight的子采样策略都设置为None.

编辑：再次阅读你的评论我发现你的结果并不令人惊讶！
你得到一个更好(更高)的TPR但更差(更高)的FPR.
它只是意味着你的分类器努力从第1类中获取正确的样本,从而产生更多的误报(当然也会获得更多的正确！).
如果您继续沿同一方向增加类/样本权重,您将看到此趋势继续.