我一直在努力确保使用CHAID包中实现的CHAID算法获得的分类树将生成一个树,其终端节点(叶子)至少具有minbucket观察数.根据chaid过程的描述,这可以通过指定chaid_control函数来完成:
chaid_control(alpha2 = 0.05, alpha3 = -1, alpha4 = 0.05,
minsplit = 20, minbucket = 7, minprob = 0.01,
stump = FALSE, maxheight = -1)
这与控制rpart包中的树类似.
然而,设置minbucket参数似乎不会对结果树的最终形状产生任何影响.这是一个例子:
library("CHAID")
set.seed(290875)
USvoteS <- USvote[sample(1:nrow(USvote), 1000),]
chaid(vote3 ~ ., data = USvoteS)
Model formula:
vote3 ~ gender + ager + empstat + educr + marstat
Fitted party:
[1] root
| [2] marstat in married
| | [3] educr <HS, HS, >HS: Gore (n = 311, err = 49.5%)
| | [4] educr in College, Post Coll: Bush (n = 249, err = 35.3%)
| [5] marstat in widowed, divorced, never married
| | [6] gender in male: Gore (n = 159, err = 47.8%)
| | [7] gender in female
| | | [8] ager in 18-24, 25-34, 35-44, 45-54: Gore (n = 127, err = 22.0%)
| | | [9] ager in 55-64, 65+: Gore (n = 115, err = 40.9%)
Number of inner nodes: 4
Number of terminal nodes: 5
终端节点3,4,6,8和9分别由311,249,159,127和115个观测值组成.现在,通常,为了限制最小数量的观测,应按如下方式进行:
ctrl <- chaid_control(minbucket = 200)
然而,援引
chaid(vote3 ~ ., data = USvoteS, control = ctrl)
产生与以前相同的树(而不是具有至少200个观察点的节点的树).
我不确定是否是我犯了错误或者在执行chaid程序时遗漏了什么……
最佳答案 每个终端节点中的最小观察数由minbucket和minprob控制.前者给出绝对观察次数,后者给出相对频率(相对于当前节点的样本大小).在内部,每个节点中使用两个量的最小值.这对我来说也是违反直觉的,因为我预期会使用最大值 – 但我没有检查原始CHAID算法是否以这种方式描述.
如果要确保只有minbucket控制最小节点大小,则设置minbucket = 200,minprob = 1.