如何使用FactoMineR包以编程方式确定主要组件的列索引？

2023年5月3日 349次阅读

给定包含混合变量(即分类和连续)的数据框,如,

digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
                 studLoc=sample(createRandString(10)),
                 finalmark=sample(c(0:100),10),
                 subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
                 )

我使用FactoMineR包执行无监督的功能选择

df.princomp <- FactoMineR::FAMD(df, graph = FALSE)

变量df.princomp是一个列表.

此后,可视化我使用的主要组件
fviz_screeplot()和fviz_contrib()之类的,

#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = 1, top = 10, sort.val = c("desc"))

给出了下面的图1

和图2

图1的说明：图1是碎石图. Scree Plot是一个简单的线段图,显示了每个主成分(PC)解释或表示的数据总方差的分数.因此我们可以看到前三个PC共占总方差的43.8％.现在问题自然而然地出现了,“这些变量是什么？”.我在图2中显示了这一点.

图2的说明：该图显示了主成分分析(PCA)结果中行/列的贡献.从这里我可以看到变量,名称,studLoc和finalMark是可用于进一步分析的最重要的变量.

进一步的分析 – 我坚持的地方：得出上述变量名称,studLoc,finalMark的贡献.我使用主成分变量df.princomp(见上文),如df.princomp $quanti.var $contrib [,4]和df.princomp $quali.var $contrib [,2：3].

我要手动指定列索引[,2：3]和[,4].

我想要的是：我想知道如何进行动态列索引分配,这样我就不必手动编码列表df.princomp中的列索引[,2：3]了？

我已经查看了以下类似的问题1,2,3和4,但找不到我的解决方案？任何有助于解决此问题的帮助或建议都会有所帮助.

最佳答案不确定我对你的问题的解释是否正确,如果不是,请道歉.根据我的收集,您使用PCA作为初始工具,向您展示哪些变量在解释数据集时最重要.然后,您需要返回原始数据,快速选择这些变量,无需每次手动编码,并将其用于其他分析.

如果这是正确的,那么我已经从贡献图中保存了数据,过滤掉了贡献最大的变量,并使用该结果创建了仅包含这些变量的新数据框.

digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
                 studLoc=sample(createRandString(10)),
                 finalmark=sample(c(0:100),10),
                 subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)

df.princomp <- FactoMineR::FAMD(df, graph = FALSE)

factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than, say, 20

r<-rownames(dat[dat$contrib>20,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new

#finalmark name    studLoc
#1         53    b POTYQ0002N
#2         73    i LWMTW1195I
#3         95    d VTUGO1685F
#4         39    f YCGGS5755N
#5         97    c GOSWE3283C
#6         58    g APBQD6181U
#7         67    a VUJOG1460V
#8         64    h YXOGP1897F
#9         15    j NFUOB6042V
#10        81    e QYTHG0783G

根据你的评论,你说你想要在Dim.1和Dim.2中找到值大于5的变量并将这些变量保存到一个新的数据框’,我会这样做：

#top contributors to both Dim 1 and 2

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1,2), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than 5

r<-rownames(dat[dat$contrib>5,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new

(这会将所有原始变量保留在我们的新数据框中,因为它们对总方差的贡献超过5％)