这里有两个相关的问题,但它们不是我的重复,因为第一个问题有一个特定于数据集的解决方案,第二个问题涉及当启动与偏移一起提供时glm的失败.
我有以下数据集:
library(data.table)
df <- data.frame(names = factor(1:10))
set.seed(0)
df$probs <- c(0, 0, runif(8, 0, 1))
df$response = lapply(df$probs, function(i){
rbinom(50, 1, i)
})
dt <- data.table(df)
dt <- dt[, list(response = unlist(response)), by = c('names', 'probs')]
这样dt是:
> dt
names probs response
1: 1 0.0000000 0
2: 1 0.0000000 0
3: 1 0.0000000 0
4: 1 0.0000000 0
5: 1 0.0000000 0
---
496: 10 0.9446753 0
497: 10 0.9446753 1
498: 10 0.9446753 1
499: 10 0.9446753 1
500: 10 0.9446753 1
我试图使用lm2< – glm(data = dt,formula = response~probs,family = binomial(link =’identity’))来使用身份链接拟合逻辑回归模型. 这给出了一个错误:
Error: no valid set of coefficients has been found: please supply starting values
我尝试通过提供一个start参数来修复它,但后来又出现了另一个错误.
> lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start = c(0, 1))
Error: cannot find valid starting values: please specify some
在这一点上,这些错误对我来说毫无意义,我不知道该怎么做.
编辑:@iraserd已经对这个问题提出了更多的启示.使用start = c(0.5,0.5),我得到:
> lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start = c(0.5, 0.5))
There were 25 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: step size truncated: out of bounds
2: step size truncated: out of bounds
3: step size truncated: out of bounds
4: step size truncated: out of bounds
5: step size truncated: out of bounds
6: step size truncated: out of bounds
7: step size truncated: out of bounds
8: step size truncated: out of bounds
9: step size truncated: out of bounds
10: step size truncated: out of bounds
11: step size truncated: out of bounds
12: step size truncated: out of bounds
13: step size truncated: out of bounds
14: step size truncated: out of bounds
15: step size truncated: out of bounds
16: step size truncated: out of bounds
17: step size truncated: out of bounds
18: step size truncated: out of bounds
19: step size truncated: out of bounds
20: step size truncated: out of bounds
21: step size truncated: out of bounds
22: step size truncated: out of bounds
23: step size truncated: out of bounds
24: step size truncated: out of bounds
25: glm.fit: algorithm stopped at boundary value
和
> summary(lm2)
Call:
glm(formula = response ~ probs, family = binomial(link = "identity"),
data = dt, start = c(0.5, 0.5))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4023 -0.6710 0.3389 0.4641 1.7897
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.486e-08 1.752e-06 0.008 0.993
probs 9.995e-01 2.068e-03 483.372 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 69312 on 49999 degrees of freedom
Residual deviance: 35984 on 49998 degrees of freedom
AIC: 35988
Number of Fisher Scoring iterations: 24
我非常怀疑这与某些响应是以真实概率零生成的事实有关,这会导致问题,因为probs的系数接近1.
最佳答案 在fit.glm代码中有两个位置,它以错误终止,没有找到有效的系数集:请提供起始值.在一种情况下,当一些计算的偏差变为无穷大时,另一种情况似乎在提供无效的etastart和mustart选项时发生.
另见答案,详细阐述:How do I use a custom link function in glm?
当您尝试对概率进行回归(值介于0和1之间)时,我猜您需要指定不等于0或1的起始值:
lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start=c(0.5,0.5))
这会引发很多警告,并以溢出终止,可能是因为示例的人为性质.
更改公式以使用logit链接(因为您希望根据您的问题进行逻辑回归)消除警告(并且不需要启动参数):
lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='logit')