r – 如何添加按顺序计算重复项的列？

2023年3月8日 392次阅读

我想在数据框(integrates2)中添加一列,按顺序计算重复数据.以下是数据的样子：

name    program  date of contact   helper column
John     ffp        10/11/2014          2
John     TP         10/27/2014          2
Carlos   TP         11/19/2015          3
Carlos   ffp        12/1/2015           3
Carlos   wfd        12/31/2015          3
Jen      ffp        9/9/2014            2
Jen      TP         9/30/2014           2

这是在某些日期参加过某些课程的人员名单.我添加了一个辅助列来计算重复项并对联系日期进行排序.我期待计算存在的程序组合(例如ffp-tp,tp-ffp-wfd).

为了做到这一点,我想实现以下代码,以便在名为“program2”的新列的帮助下转置有序组合：

 #transpose the programs 
 require(reshape2) dcast(integrates2, name ~ program2, value.var=”program”)

然后我计划使用以下代码将结果转换为表格和数据框并计算频率：

 res = table(integrates2)
 resdf = as.data.frame(res)

我在以下链接中看到了这个：
Count number of time combination of events appear in dataframe columns ext

我需要的“program2”看起来像这样：

  Name    program  date of contact   helper column   program2
  John     ffp        10/11/2014          2             1
  John     TP         10/27/2014          2             2
  Carlos   TP         11/19/2015          3             1
  Carlos   ffp        12/1/2015           3             2
  Carlos   wfd        12/31/2015          3             3

这样,我可以使用“program2”转换到不同的列,然后计算组合.最终结果应如下所示：

    program  pro1   pro2   freq      
     ffp     tp             2   
     TP      ffp    wfd     1

我确信有更简单的方法可以做到这一点,但正如我所知,这就是我的所在.感谢帮助人！

最佳答案在考虑了这个问题后,我认为以下是可行的方法.如果您不介意组合所有程序名称,则可以执行以下操作.这可能要好得多.

setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
           list(total = .N), by = type]

#         type total
#1:     ffp-TP     2
#2: TP-ffp-wfd     1

如果要分隔程序名,可以使用splitstackshape包中的cSplit()进行分隔.

setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
              list(total = .N), by = type] -> temp

cSplit(temp, splitCols = "type", sep = "-")

#   total type_1 type_2 type_3
#1:     2    ffp     TP     NA
#2:     1     TP    ffp    wfd

dplyr代码的等价性是：

group_by(mydf, name) %>%
summarise(type = paste(program, collapse = "-")) %>%
count(type)

#        type     n
#       (chr) (int)
#1     ffp-TP     2
#2 TP-ffp-wfd     1

数据

mydf <- structure(list(name = c("John", "John", "Carlos", "Carlos", "Carlos", 
"Jen", "Jen"), program = c("ffp", "TP", "TP", "ffp", "wfd", "ffp", 
"TP"), dateOfContact = c("10/11/2014", "10/27/2014", "11/19/2015", 
"12/1/2015", "12/31/2015", "9/9/2014", "9/30/2014"), helperColumn = c(2L, 
2L, 3L, 3L, 3L, 2L, 2L)), .Names = c("name", "program", "dateOfContact", 
"helperColumn"), class = "data.frame", row.names = c(NA, -7L))