小洁详解《R数据科学》--第18章模型构建

2019年10月20日 404次阅读来源: 小洁忘了怎么分身

1.准备工作

library(tidyverse)
#> ── Attaching packages ───────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.6
#> ✔ tidyr   0.8.1     ✔ stringr 1.3.1
#> ✔ readr   1.1.1     ✔ forcats 0.3.0
#> ── Conflicts ──────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library(modelr)
options(na.action = na.warn)
library(nycflights13)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

2. 示例数据diamonds

价格与质量、颜色、纯净度的关系都是不符合预期的，是因为一个重要的混淆变量：重量（carat），是确定钻石价格的单一因素中最重要的一个，而质量差的钻石往往更重一些。

通过拟合一个模型来分离carat变量的作用

#数据调整
#取＜2.5克拉以下的数据，进行对数转换
diamonds2 <- diamonds %>%
filter(carat <= 2.5) %>%
mutate(lprice = log2(price), lcarat = log2(carat))

#对数转换后的模式可视化-从对数转换为线性
ggplot(diamonds2, aes(lcarat, lprice)) +
geom_hex(bins = 50)

《小洁详解《R数据科学》--第18章模型构建》

#线性拟合
mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)

#验证
grid <- diamonds2 %>%
data_grid(carat = seq_range(carat, 20)) %>% 
mutate(lcarat = log2(carat)) %>%
add_predictions(mod_diamond, "lprice") %>%
mutate(price = 2 ^ lprice) #此时的price就是预测值
#覆盖在原始数据上
ggplot(diamonds2, aes(carat, price)) +
geom_hex(bins = 50) +
geom_line(data = grid, color = "red", size = 1)

《小洁详解《R数据科学》--第18章模型构建》

这里出现了一个警告信息：

Warning message:
Computation failed in stat_binhex():
Package hexbin required for stat_binhex.
Please install and try again.

按照提示安装hexbin即可解决#install.packages(“hexbin”)

检验残差

diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond, "lresid")
ggplot(diamonds2, aes(lcarat, lresid)) +
geom_hex(bins = 50)

《小洁详解《R数据科学》--第18章模型构建》

用残差代替price绘图，即可以得到期望中的价格与切割质量、颜色、纯度的关系，也就是移除了重量对价格的影响，而替换为随机噪声。

残差为x，表示实际值是预测值的2^x倍。

更复杂的模型–考虑到质量、颜色、纯度

#创建模型
mod_diamond2 <- lm(
lprice ~ lcarat + color + cut + clarity,
data = diamonds2
)

复习一下：
data_grid的第一个参数是数据框，第二个参数及以后是列名。
add_prediction参数是数据框和模型。

用data_grid的.model参数简化输入

grid <- diamonds2 %>%
data_grid(cut, .model = mod_diamond2) %>%
add_predictions(mod_diamond2)
grid
#> # A tibble: 5 x 5
#>   cut       lcarat color clarity  pred
#>   <ord>      <dbl> <chr> <chr>   <dbl>
#> 1 Fair      -0.515 G     VS2      11.2
#> 2 Good      -0.515 G     VS2      11.3
#> 3 Very Good -0.515 G     VS2      11.4
#> 4 Premium   -0.515 G     VS2      11.4
#> 5 Ideal     -0.515 G     VS2      11.4

#残差检验
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond2, "lresid2")
ggplot(diamonds2, aes(lcarat, lresid2)) +
geom_hex(bins = 50)

《小洁详解《R数据科学》--第18章模型构建》

存在一些较大残差的钻石，可能的原因：异常值、数据问题、模型有问题。

3.示例数据flights

书上忘写library了。

library(nycflights13)
#统计每日航班数
daily <- flights %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarize(n = n())
daily
#> # A tibble: 365 x 2
#>    date           n
#>    <date>     <int>
#>  1 2013-01-01   842
#>  2 2013-01-02   943
#>  3 2013-01-03   914
#>  4 2013-01-04   915
#>  5 2013-01-05   720
#>  6 2013-01-06   832
#>  7 2013-01-07   933
#>  8 2013-01-08   899
#>  9 2013-01-09   902
#> 10 2013-01-10   932
#> # ... with 355 more rows
#可视化
ggplot(daily, aes(date, n)) +
geom_line()

《小洁详解《R数据科学》--第18章模型构建》

（1）一周

daily <- daily %>%
mutate(wday = wday(date, label = TRUE))
ggplot(daily, aes(wday, n)) +
geom_boxplot()

《小洁详解《R数据科学》--第18章模型构建》

查看wday()函数的帮助文档，它是把日期转换为周几的函数，lable的说明：

logical.
Only available for wday.
TRUE will display the day of the week as an ordered factor of character strings, such as “Sunday.”
FALSE will display the day of the week as a number.

拟合模型并覆盖到原始数据上，用红点表示预测值

mod <- lm(n ~ wday, data = daily)
grid <- daily %>%
data_grid(wday) %>%
add_predictions(mod, "n")
ggplot(daily, aes(wday, n)) +
geom_boxplot() +
geom_point(data = grid, color = "red", size = 4)

《小洁详解《R数据科学》--第18章模型构建》

计算残差

daily <- daily %>%
add_residuals(mod)
daily %>%
ggplot(aes(date, resid)) +
geom_ref_line(h = 0) + #划了一条水平参考线
geom_line()

《小洁详解《R数据科学》--第18章模型构建》

六月以后残差浮动很大，有的日期的航班数比预测少很多。

用filter筛选出这些特殊日期–发现是一些节日：

daily %>%
filter(resid < -100)
#> # A tibble: 11 x 4
#>    date           n wday  resid
#>    <date>     <int> <ord> <dbl>
#>  1 2013-01-01   842 Tue   -109.
#>  2 2013-01-20   786 Sun   -105.
#>  3 2013-05-26   729 Sun   -162.
#>  4 2013-07-04   737 Thu   -229.
#>  5 2013-07-05   822 Fri   -145.
#>  6 2013-09-01   718 Sun   -173.
#>  7 2013-11-28   634 Thu   -332.
#>  8 2013-11-29   661 Fri   -306.
#>  9 2013-12-24   761 Tue   -190.
#> 10 2013-12-25   719 Wed   -244.
#> 11 2013-12-31   776 Tue   -175.

添加平滑的趋势线

daily %>%
ggplot(aes(date, resid)) +
geom_ref_line(h = 0) +
geom_line(color = "grey50") +
geom_smooth(se = FALSE, span = 0.20)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

《小洁详解《R数据科学》--第18章模型构建》

季节性星期六效应

daily %>%
filter(wday == "周六") %>% #我查看了我的数据和预期不太一样，没有显示sat而是周六，是语言的原因，先不管了改过来再说
ggplot(aes(date, n)) +
geom_point() +
geom_line() +
scale_x_date(
NULL,
date_breaks = "1 month",
date_labels = "%b"
)

《小洁详解《R数据科学》--第18章模型构建》

大概看了一下下面还是涉及到日期以及更具体的问题。暂时跳过。
值得注意的是如果模型效果不理想，要重视离群点，削弱离群点对模型的影响使用函数是MASS::rlm()

用rlm()拟合的曲线

library(splines)
mod <- MASS::rlm(n ~ wday * ns(date, 5), data = daily) #我有一个疑问为什么这里是5
daily %>%
data_grid(wday, date = seq_range(date, n = 13)) %>%
add_predictions(mod) %>%
ggplot(aes(date, pred, color = wday)) +
geom_line() +
geom_point()

《小洁详解《R数据科学》--第18章模型构建》

–这个用color映射进行了隐性分组，将一周的每一天汇总为一条曲线。

《小洁详解《R数据科学》--第18章模型构建》微信公众号生信星球同步更新我的文章

友情链接：
生信技能树公益视频合辑：学习顺序是linux，r，软件安装，geo，小技巧，ngs组学！
B站链接：https://m.bilibili.com/space/338686099
YouTube链接：https://m.youtube.com/channel/UC67sImqK7V8tSWHMG8azIVA/playlists
生信工程师入门最佳指南：https://mp.weixin.qq.com/s/vaX4ttaLIa19MefD86WfUA
学徒培养：https://mp.weixin.qq.com/s/3jw3_PgZXYd7FomxEMxFmw
资料大全：https://mp.weixin.qq.com/s/QcES9u1vYh-l6LMXPgJIlA

    原文作者：小洁忘了怎么分身
    原文地址: https://www.jianshu.com/p/b3456e449bf3
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。