在Tibshirani的“统计学习要素”中,当比较最小二乘/线性模型和knn这两个场景时:
Scenario 1: The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means.
Scenario 2: The training data in each class came from a mixture of 10
low- variance Gaussian distributions, with individual means themselves
distributed as Gaussian.
我们的想法是,第一个更适合于最小二乘/线性模型,第二个更适合类似模型(那些具有更高方差的因为我知道因为knn考虑了最接近的点而不是所有点).
在R中,我如何模拟两种情况的数据?
最终目标是能够重现两种情景,以证明线性模型比第二种更有效地解释第一种情况.
谢谢!
最佳答案 这可能是方案1
library(mvtnorm)
N1 = 50
N2 = 50
K = 2
mu1 = c(-1,3)
mu2 = c(2,0)
cov1 = 0
v11 = 2
v12 = 2
Sigma1 = matrix(c(v11,cov1,cov1,v12),nrow=2)
cov2 = 0
v21 = 2
v22 = 2
Sigma2 = matrix(c(v21,cov2,cov2,v22),nrow=2)
x1 = rmvnorm(N1,mu1,Sigma1)
x2 = rmvnorm(N2,mu2,Sigma2)
这可能是从高斯混合模拟的候选者:
BartSimpson <- function(x,n = 100){
means <- as.matrix(sort(rnorm(10)))
dens <- .1*rowSums(apply(means,1,dnorm,x=x,sd=.1))
rBartSimpson <- c(apply(means,1,rnorm,n=n/10,sd=.1))
return(list("thedensity" = dens,"draws" = rBartSimpson))
}
x <- seq(-5,5,by=.01)
plot(x,BartSimpson(x)$thedensity,type="l",lwd=4,col="yellow2",xlim=c(-4,4),ylim=c(0,0.6))