我想在数据框(integrates2)中添加一列,按顺序计算重复数据.以下是数据的样子:
name program date of contact helper column
John ffp 10/11/2014 2
John TP 10/27/2014 2
Carlos TP 11/19/2015 3
Carlos ffp 12/1/2015 3
Carlos wfd 12/31/2015 3
Jen ffp 9/9/2014 2
Jen TP 9/30/2014 2
这是在某些日期参加过某些课程的人员名单.我添加了一个辅助列来计算重复项并对联系日期进行排序.我期待计算存在的程序组合(例如ffp-tp,tp-ffp-wfd).
为了做到这一点,我想实现以下代码,以便在名为“program2”的新列的帮助下转置有序组合:
#transpose the programs
require(reshape2) dcast(integrates2, name ~ program2, value.var=”program”)
然后我计划使用以下代码将结果转换为表格和数据框并计算频率:
res = table(integrates2)
resdf = as.data.frame(res)
我在以下链接中看到了这个:
Count number of time combination of events appear in dataframe columns ext
我需要的“program2”看起来像这样:
Name program date of contact helper column program2
John ffp 10/11/2014 2 1
John TP 10/27/2014 2 2
Carlos TP 11/19/2015 3 1
Carlos ffp 12/1/2015 3 2
Carlos wfd 12/31/2015 3 3
这样,我可以使用“program2”转换到不同的列,然后计算组合.最终结果应如下所示:
program pro1 pro2 freq
ffp tp 2
TP ffp wfd 1
我确信有更简单的方法可以做到这一点,但正如我所知,这就是我的所在.感谢帮助人!
最佳答案 在考虑了这个问题后,我认为以下是可行的方法.如果您不介意组合所有程序名称,则可以执行以下操作.这可能要好得多.
setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
list(total = .N), by = type]
# type total
#1: ffp-TP 2
#2: TP-ffp-wfd 1
如果要分隔程序名,可以使用splitstackshape包中的cSplit()进行分隔.
setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
list(total = .N), by = type] -> temp
cSplit(temp, splitCols = "type", sep = "-")
# total type_1 type_2 type_3
#1: 2 ffp TP NA
#2: 1 TP ffp wfd
dplyr代码的等价性是:
group_by(mydf, name) %>%
summarise(type = paste(program, collapse = "-")) %>%
count(type)
# type n
# (chr) (int)
#1 ffp-TP 2
#2 TP-ffp-wfd 1
数据
mydf <- structure(list(name = c("John", "John", "Carlos", "Carlos", "Carlos",
"Jen", "Jen"), program = c("ffp", "TP", "TP", "ffp", "wfd", "ffp",
"TP"), dateOfContact = c("10/11/2014", "10/27/2014", "11/19/2015",
"12/1/2015", "12/31/2015", "9/9/2014", "9/30/2014"), helperColumn = c(2L,
2L, 3L, 3L, 3L, 2L, 2L)), .Names = c("name", "program", "dateOfContact",
"helperColumn"), class = "data.frame", row.names = c(NA, -7L))