R语言整合数据

2021年9月28日 65次阅读来源: 「已注销」

R语言中提供了许多用来整合和重塑数据的强大方法

在整合数据时，往往将多组观测值替换为根据这些观测值计算的描叙性统计量

在重塑数据时，则会通过修改数据的结构（行和列）来决定数据的组织方式

使用SQL语句操作数据（*）

虽然在R语言中有很多优秀的函数，如aggregate和daply可以对数据框统计，但sql功能强大，不仅能实现数据的清洗、统计、运算，还可以实现数据存储、控制、定义和调用
library(sqldf)

示例：

#  安装sqldf包
install.packages("sqldf")
#  运行结果：
#  WARNING: Rtools is required to build R packages but is not currently installed. Please #  download and install the appropriate version of Rtools before proceeding:
#  
#  https://cran.rstudio.com/bin/windows/Rtools/
#  Installing package into ‘C:/Users/Admin/Documents/R/win-library/3.6’
#  (as ‘lib’ is unspecified)
#  also installing the dependencies ‘ellipsis’, ‘glue’, ‘bit’, ‘rlang’, ‘vctrs’, ‘digest’, ‘bit64’, ‘blob’, ‘memoise’, ‘pkgconfig’, ‘Rcpp’, ‘BH’, ‘plogr’, ‘gsubfn’, ‘proto’, ‘RSQLite’, ‘DBI’, ‘chron’
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/ellipsis_0.3.0.zip'
#  Content type 'application/zip' length 44575 bytes (43 KB)
#  downloaded 43 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/glue_1.4.0.zip'
#  Content type 'application/zip' length 158233 bytes (154 KB)
#  downloaded 154 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/bit_1.1-15.2.zip'
#  Content type 'application/zip' length 252475 bytes (246 KB)
#  downloaded 246 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/rlang_0.4.5.zip'
#  Content type 'application/zip' length 1131356 bytes (1.1 MB)
#  downloaded 1.1 MB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/vctrs_0.2.4.zip'
#  Content type 'application/zip' length 1027328 bytes (1003 KB)
#  downloaded 1003 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/digest_0.6.25.zip'
#  Content type 'application/zip' length 249452 bytes (243 KB)
#  downloaded 243 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/bit64_0.9-7.zip'
#  Content type 'application/zip' length 551485 bytes (538 KB)
#  downloaded 538 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/blob_1.2.1.zip'
#  Content type 'application/zip' length 47627 bytes (46 KB)
#  downloaded 46 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/memoise_1.1.0.zip'
#  Content type 'application/zip' length 36855 bytes (35 KB)
#  downloaded 35 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/pkgconfig_2.0.3.zip'
#  Content type 'application/zip' length 22207 bytes (21 KB)
#  downloaded 21 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/Rcpp_1.0.4.6.zip'
#  Content type 'application/zip' length 3030802 bytes (2.9 MB)
#  downloaded 2.9 MB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/BH_1.72.0-3.zip'
#  Content type 'application/zip' length 18270741 bytes (17.4 MB)
#  downloaded 17.4 MB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/plogr_0.2.0.zip'
#  Content type 'application/zip' length 18864 bytes (18 KB)
#  downloaded 18 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/gsubfn_0.7.zip'
#  Content type 'application/zip' length 358104 bytes (349 KB)
#  downloaded 349 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/proto_1.0.0.zip'
#  Content type 'application/zip' length 472221 bytes (461 KB)
#  downloaded 461 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/RSQLite_2.2.0.zip'
#  Content type 'application/zip' length 2275367 bytes (2.2 MB)
#  downloaded 2.2 MB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/DBI_1.1.0.zip'
#  Content type 'application/zip' length 607261 bytes (593 KB)
#  downloaded 593 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/chron_2.3-55.zip'
#  Content type 'application/zip' length 203176 bytes (198 KB)
#  downloaded 198 KB
#  
#  试开URL’https://cran.rstudio.com/bin/windows/contrib/3.6/sqldf_0.4-11.zip'
#  Content type 'application/zip' length 78408 bytes (76 KB)
#  downloaded 76 KB
#  
#  package ‘ellipsis’ successfully unpacked and MD5 sums checked
#  package ‘glue’ successfully unpacked and MD5 sums checked
#  package ‘bit’ successfully unpacked and MD5 sums checked
#  package ‘rlang’ successfully unpacked and MD5 sums checked
#  package ‘vctrs’ successfully unpacked and MD5 sums checked
#  package ‘digest’ successfully unpacked and MD5 sums checked
#  package ‘bit64’ successfully unpacked and MD5 sums checked
#  package ‘blob’ successfully unpacked and MD5 sums checked
#  package ‘memoise’ successfully unpacked and MD5 sums checked
#  package ‘pkgconfig’ successfully unpacked and MD5 sums checked
#  package ‘Rcpp’ successfully unpacked and MD5 sums checked
#  package ‘BH’ successfully unpacked and MD5 sums checked
#  package ‘plogr’ successfully unpacked and MD5 sums checked
#  package ‘gsubfn’ successfully unpacked and MD5 sums checked
#  package ‘proto’ successfully unpacked and MD5 sums checked
#  package ‘RSQLite’ successfully unpacked and MD5 sums checked
#  package ‘DBI’ successfully unpacked and MD5 sums checked
#  package ‘chron’ successfully unpacked and MD5 sums checked
#  package ‘sqldf’ successfully unpacked and MD5 sums checked
#  
#  The downloaded binary packages are in
#   	   C:\Users\Admin\AppData\Local\Temp\RtmpUHJCna\downloaded_packages
library(sqldf)

name <- c(rep("张三", 1, 3), rep("李四", 3))
subject <- c("数学","语文","英语","数学","语文","英语")
score <- c(89, 80, 70, 90, 70, 80)
stuid <- c(1, 1, 1, 2, 2, 2)
stuscore <- data.frame(name, subject, score, stuid)
stuscore
#  运行结果：
#    name subject score stuid
#  1 张三    数学    89     1
#  2 张三    语文    80     1
#  3 张三    英语    70     1
#  4 李四    数学    90     2
#  5 李四    语文    70     2
#  6 李四    英语    80     2
sqldf("select name, sum(score) as allscore from stuscore group by name order by allscore")
#  运行结果：
#    name allscore
#  1 张三      239
#  2 李四      240
sqldf("select name, stuid, sum(score) as allscore from stuscore group by name order by allscore")
#  运行结果：
#    name stuid allscore
#  1 张三     1      239
#  2 李四     2      240
sqldf("select stuid, name, subject, max(score) as maxscore from stuscore group by stuid order by maxscore")
#  运行结果：
#    stuid name subject maxscore
#  1     1 张三    数学       89
#  2     2 李四    数学       90
sqldf("select stuid, name, subject, avg(score) as avgscore from stuscore group by stuid order by avgscore")
#  运行结果：
#    stuid name subject avgscore
#  1     1 张三    数学 79.66667
#  2     2 李四    数学 80.00000

汇总统计数据

数据汇总统计通过aggregate()实现
它首先将数据进行分组（按行），然后对每一组数据进行函数统计，最后把结果组合成一个表格返回

aggregate(x,by,FUN)

其中：

x是待统计的数据对象
by是一个变量名组成的列表，这些变量将被去掉以形成新的观测
FUN是用来计算描述统计量的标量函数，它将被用来计算新的观测值

示例：

score <- data.frame(ID = c(101, 102, 103, 104, 105, 106, 107, 108, 109, 110),
				score1 = c(92,  86,  85,  74,  82,  88,  96,  91,  84,  72),
				score2 = c(73,  69,  82,  93,  80,  94,  71,  87,  86,  91),
				gender = c("male", "male", "female", "female", "female", "female", "female", "male", "male", "male"))
score
#  运行结果：
#      ID score1 score2 gender
#  1  101     92     73   male
#  2  102     86     69   male
#  3  103     85     82 female
#  4  104     74     93 female
#  5  105     82     80 female
#  6  106     88     94 female
#  7  107     96     71 female
#  8  108     91     87   male
#  9  109     84     86   male
#  10 110     72     91   male
aggregate(score[,c(2,3)],by=list(score[,4]),FUN=mean)
#  运行结果：
#    Group.1 score1 score2
#  1  female     85   84.0
#  2    male     85   81.2

mtcars
#  运行结果：
#                       mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#  Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#  Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#  Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#  Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#  Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#  Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#  Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#  Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#  Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#  Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#  Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#  Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#  Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#  Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#  Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#  Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#  Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#  Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#  Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#  Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#  Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#  Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#  AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#  Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#  Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#  Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#  Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#  Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#  Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#  Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#  Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#  Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
colnames(mtcars)
#  运行结果：
#   [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
mtcars$cyl
#  运行结果：
#   [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
attach(mtcars)   #  绑定数据集，之后可直接引用变量名
#  运行结果：
#   [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
aggregate(mtcars[,c(1,3)],by=list(cyl,gear),FUN=mean)
#  运行结果：
#    Group.1 Group.2    mpg     disp
#  1       4       3 21.500 120.1000
#  2       6       3 19.750 241.5000
#  3       8       3 15.050 357.6167
#  4       4       4 26.925 102.6250
#  5       6       4 19.750 163.8000
#  6       4       5 28.200 107.7000
#  7       6       5 19.700 145.0000
#  8       8       5 15.400 326.0000

重塑数据

重塑数据可以通过merge函数与melt函数实现。其中，merge函数可以横向合并两个数据框（数据集），melt函数可以实现数据整合的功能

merge函数

粘贴数据结构——R中合并两个数据集可以通过专门的函数merge( )来实现

merge通过相同的列或行名来识别，合并两个数据框或列表，其调用格式如下：
        merge(x,y,by = intersect(names(x),names(y)),
                by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
                sort = TRUE, suffixes = c(“.x”,”.y”), no.dups = TRUE,
                incomparables = NULL, …)

参数	含义
x,y	要合并的数据集
by	指定合并的依据（相同的行或列）
by.x,by.y	分别为第一个数据框和第二个数据框要连接的列名
all,all.x,all.y	逻辑值，默认为FALSE。以all.x=TRUE为例，表示当x中的行没有相应的y进行匹配时，用NA填充；若为FALSE，那么仅输出x和y中都包含的行

    原文作者：「已注销」
    原文地址: https://blog.csdn.net/qq_43133192/article/details/105482104
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。