作者:黄天元,复旦大学博士在读,热爱数据科学与R,热衷推广R在业界的应用。邮箱:huang.tian-yuan@qq.com.欢迎交流!
这次给大家分享的是对分类变量进行特征工程。很多时候我们会遇到一些数据,它们是性别、商场名称、图书类别等等,这些都称之为分类变量,它们是可以提供信息的特征,对这些特征进行有效的处理,有助于我们在后面的建模或分析中更好地进行知识发现和机器学习。
Python
#加载包
import pandas as pd
import numpy as np
无序变量转化
简单来说,比如学校的同学,一共划分为12个班级,每个同学都有一个班级名称,“一班”,“二班”…然后我们把一班的同学转化为“1”,二班同学则标注为“2”,以此类推。事实上这仅仅起到了简化的作用,没有进行特征提取,但是还是不失为是一种特征转化的方法。
#载入数据和审视
file_path = "G:/Py/practical-machine-learning-with-python-master/notebooks/Ch04_Feature_Engineering_and_Selection/"
vg_df = pd.read_csv(file_path + 'datasets/vgsales.csv', encoding='utf-8')
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]
Name Platform Year Genre Publisher 1 Super Mario Bros. NES 1985.0 Platform Nintendo 2 Mario Kart Wii Wii 2008.0 Racing Nintendo 3 Wii Sports Resort Wii 2009.0 Sports Nintendo 4 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo 5 Tetris GB 1989.0 Puzzle Nintendo 6 New Super Mario Bros. DS 2006.0 Platform Nintendo
#查看Genre列的类别
genres = np.unique(vg_df['Genre'])
genres
array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
'Strategy'], dtype=object)
#用LabelEncoder对Genre列每一个类别进行数字标注
from sklearn.preprocessing import LabelEncoder
gle = LabelEncoder() #设置一个标注器
genre_labels = gle.fit_transform(vg_df['Genre']) #用标注器进行标注
genre_mappings = {index: label for index, label in enumerate(gle.classes_)} #把标注内容的键值对提取出来
genre_mappings #显示键值对
{0: 'Action',
1: 'Adventure',
2: 'Fighting',
3: 'Misc',
4: 'Platform',
5: 'Puzzle',
6: 'Racing',
7: 'Role-Playing',
8: 'Shooter',
9: 'Simulation',
10: 'Sports',
11: 'Strategy'}
#把标注内容以列的形式加入到原始数据框中
vg_df['GenreLabel'] = genre_labels
vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]
Name Platform Year Genre GenreLabel 1 Super Mario Bros. NES 1985.0 Platform 4 2 Mario Kart Wii Wii 2008.0 Racing 6 3 Wii Sports Resort Wii 2009.0 Sports 10 4 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing 7 5 Tetris GB 1989.0 Puzzle 5 6 New Super Mario Bros. DS 2006.0 Platform 4
有序变量转化
有的时候分类变量是可以表示自身的等级的,也就是带有数学大小的属性。举个例子就是:优良中差的评价,那么等级应该是优>良>中>差。有序变量的转化就是希望完成优良中差到4321数值之间的转化,我们来看下面的例子。
#数据载入
poke_df = pd.read_csv(file_path + 'datasets/Pokemon.csv', encoding='utf-8')
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)
np.unique(poke_df['Generation']) #独特值的审视
array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)
#定义键值对
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3,
'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
#根据键值对做映射
poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10] #观察样例
Name Generation GenerationLabel 4 Octillery Gen 2 2 5 Helioptile Gen 6 6 6 Dialga Gen 4 4 7 DeoxysDefense Forme Gen 3 3 8 Rapidash Gen 1 1 9 Swanna Gen 5 5
这里的数据代表这个口袋妖怪是第几代的,是第几代就显示几,比如我是第一代,那么最后的标签就是1.
分类变量的编码
很多背景下,分类变量其实只是代表一种场景,不同场景之间的区别很难被量化。正如二班的同学不可能是一班同学的两倍关系,十二班的同学更不可能就是12个一班的同学,因此对于分类变量的编码,我们需要用特殊的方法,下面我们就来介绍一下。
One-hot
直接上例子大家理解更透彻:
#原始数据展示
poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10]
Name Generation Legendary 4 Octillery Gen 2 False 5 Helioptile Gen 6 False 6 Dialga Gen 4 True 7 DeoxysDefense Forme Gen 3 True 8 Rapidash Gen 1 False 9 Swanna Gen 5 False
#加载模块
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
#用原始我们提到的无序分类变量标注方法
# transform and map pokemon generations
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(poke_df['Generation'])
poke_df['Gen_Label'] = gen_labels
# transform and map pokemon legendary status
leg_le = LabelEncoder()
leg_labels = leg_le.fit_transform(poke_df['Legendary'])
poke_df['Lgnd_Label'] = leg_labels
poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]
poke_df_sub.iloc[4:10]
Name Generation Gen_Label Legendary Lgnd_Label 4 Octillery Gen 2 1 False 0 5 Helioptile Gen 6 5 False 0 6 Dialga Gen 4 3 True 1 7 DeoxysDefense Forme Gen 3 2 True 1 8 Rapidash Gen 1 0 False 0 9 Swanna Gen 5 4 False 0
#采用one-hot编码方法
# encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder() #构造编码器
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray() #用编码器对标签进行转化,并转化为数组
gen_feature_labels = list(gen_le.classes_) #构造列名称
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels) #构造特征编码数据框
# encode legendary status labels using one-hot encoding scheme
leg_ohe = OneHotEncoder()
leg_feature_arr = leg_ohe.fit_transform(poke_df[['Lgnd_Label']]).toarray()
leg_feature_labels = ['Legendary_'+str(cls_label) for cls_label in leg_le.classes_]
leg_features = pd.DataFrame(leg_feature_arr, columns=leg_feature_labels)
F:\Anaconda3\envs\R\lib\site-packages\sklearn\preprocessing\_encoders.py:363: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)
F:\Anaconda3\envs\R\lib\site-packages\sklearn\preprocessing\_encoders.py:363: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)
#展示编码结果
poke_df_ohe = pd.concat([poke_df_sub, gen_features, leg_features], axis=1) #数据框合并
columns = sum([['Name', 'Generation', 'Gen_Label'],gen_feature_labels,
['Legendary', 'Lgnd_Label'],leg_feature_labels], [])
poke_df_ohe[columns].iloc[4:10]
Name Generation Gen_Label Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 Legendary Lgnd_Label Legendary_False Legendary_True 4 Octillery Gen 2 1 0.0 1.0 0.0 0.0 0.0 0.0 False 0 1.0 0.0 5 Helioptile Gen 6 5 0.0 0.0 0.0 0.0 0.0 1.0 False 0 1.0 0.0 6 Dialga Gen 4 3 0.0 0.0 0.0 1.0 0.0 0.0 True 1 0.0 1.0 7 DeoxysDefense Forme Gen 3 2 0.0 0.0 1.0 0.0 0.0 0.0 True 1 0.0 1.0 8 Rapidash Gen 1 0 1.0 0.0 0.0 0.0 0.0 0.0 False 0 1.0 0.0 9 Swanna Gen 5 4 0.0 0.0 0.0 0.0 1.0 0.0 False 0 1.0 0.0
我们可以看到,利用one-hot编码方法,有多少个独特的值,就会多生成多少列,然后当属于这个值的时候,在该列就会标注为1,其他则均为0. 需要理解一点的是,经过训练,我们的编码器现在能够自动把相应的列转化为one-hot的编码格式,如果有新的数据,可以重新利用我们的编码器,对数据进行编码。(这对于训练集、测试集同时使用数据预处理极其有用)
#构造新数据
new_poke_df = pd.DataFrame([['PikaZoom', 'Gen 3', True],
['CharMyToast', 'Gen 4', False]],
columns=['Name', 'Generation', 'Legendary'])
new_poke_df
Name Generation Legendary 0 PikaZoom Gen 3 True 1 CharMyToast Gen 4 False
#利用之前的编码器对数据进行编码
new_gen_labels = gen_le.transform(new_poke_df['Generation'])
new_poke_df['Gen_Label'] = new_gen_labels
new_leg_labels = leg_le.transform(new_poke_df['Legendary'])
new_poke_df['Lgnd_Label'] = new_leg_labels
new_poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]
Name Generation Gen_Label Legendary Lgnd_Label 0 PikaZoom Gen 3 2 True 1 1 CharMyToast Gen 4 3 False 0
#重新使用one-hot编码器
new_gen_feature_arr = gen_ohe.transform(new_poke_df[['Gen_Label']]).toarray()
new_gen_features = pd.DataFrame(new_gen_feature_arr, columns=gen_feature_labels)
new_leg_feature_arr = leg_ohe.transform(new_poke_df[['Lgnd_Label']]).toarray()
new_leg_features = pd.DataFrame(new_leg_feature_arr, columns=leg_feature_labels)
new_poke_ohe = pd.concat([new_poke_df, new_gen_features, new_leg_features], axis=1)
columns = sum([['Name', 'Generation', 'Gen_Label'], gen_feature_labels,
['Legendary', 'Lgnd_Label'], leg_feature_labels], [])
new_poke_ohe[columns]
Name Generation Gen_Label Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 Legendary Lgnd_Label Legendary_False Legendary_True 0 PikaZoom Gen 3 2 0.0 0.0 1.0 0.0 0.0 0.0 True 1 0.0 1.0 1 CharMyToast Gen 4 3 0.0 0.0 0.0 1.0 0.0 0.0 False 0 1.0 0.0
不过这个代码量也太大了,life is short,不能这样子。让我们试试pandas自带的get_dummies函数。
gen_onehot_features = pd.get_dummies(poke_df['Generation'])
pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]
Name Generation Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 4 Octillery Gen 2 0 1 0 0 0 0 5 Helioptile Gen 6 0 0 0 0 0 1 6 Dialga Gen 4 0 0 0 1 0 0 7 DeoxysDefense Forme Gen 3 0 0 1 0 0 0 8 Rapidash Gen 1 1 0 0 0 0 0 9 Swanna Gen 5 0 0 0 0 1 0
一步搞定,爽!
其他编码方法
首先我们要明确编码的目的,有的认为需要做简化的表示,但是更多的时候我们是希望机器能够读懂我们现实世界稀奇古怪的分类。不过机器只能认得数字,所以我们就要想方设法让计算机能够“理解”我们输入的分类变量。下面介绍的方法,有的是one-hot的变形,但是为了节省计算资源做了调整,另外一些则是因为统计原理,在不同情况要了解不同的用法。这里只给出代码,感兴趣的同学可以根据关键词深入查询了解。
##Dummy Coding Scheme
gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]
Name Generation Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 4 Octillery Gen 2 1 0 0 0 0 5 Helioptile Gen 6 0 0 0 0 1 6 Dialga Gen 4 0 0 1 0 0 7 DeoxysDefense Forme Gen 3 0 1 0 0 0 8 Rapidash Gen 1 0 0 0 0 0 9 Swanna Gen 5 0 0 0 1 0
#Effect Coding Scheme
gen_onehot_features = pd.get_dummies(poke_df['Generation'])
gen_effect_features = gen_onehot_features.iloc[:,:-1]
gen_effect_features.loc[np.all(gen_effect_features == 0, axis=1)] = -1.
pd.concat([poke_df[['Name', 'Generation']], gen_effect_features], axis=1).iloc[4:10]
F:\Anaconda3\envs\R\lib\site-packages\pandas\core\indexing.py:543: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
Name Generation Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 4 Octillery Gen 2 0.0 1.0 0.0 0.0 0.0 5 Helioptile Gen 6 -1.0 -1.0 -1.0 -1.0 -1.0 6 Dialga Gen 4 0.0 0.0 0.0 1.0 0.0 7 DeoxysDefense Forme Gen 3 0.0 0.0 1.0 0.0 0.0 8 Rapidash Gen 1 1.0 0.0 0.0 0.0 0.0 9 Swanna Gen 5 0.0 0.0 0.0 0.0 1.0
#Feature Hashing scheme
unique_genres = np.unique(vg_df[['Genre']])
print("Total game genres:", len(unique_genres))
print(unique_genres)
Total game genres: 12
['Action' 'Adventure' 'Fighting' 'Misc' 'Platform' 'Puzzle' 'Racing'
'Role-Playing' 'Shooter' 'Simulation' 'Sports' 'Strategy']
from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=6, input_type='string')
hashed_features = fh.fit_transform(vg_df['Genre'])
hashed_features = hashed_features.toarray()
pd.concat([vg_df[['Name', 'Genre']], pd.DataFrame(hashed_features)], axis=1).iloc[1:7]
Name Genre 0 1 2 3 4 5 1 Super Mario Bros. Platform 0.0 2.0 2.0 -1.0 1.0 0.0 2 Mario Kart Wii Racing -1.0 0.0 0.0 0.0 0.0 -1.0 3 Wii Sports Resort Sports -2.0 2.0 0.0 -2.0 0.0 0.0 4 Pokemon Red/Pokemon Blue Role-Playing -1.0 1.0 2.0 0.0 1.0 -1.0 5 Tetris Puzzle 0.0 1.0 1.0 -2.0 1.0 -1.0 6 New Super Mario Bros. Platform 0.0 2.0 2.0 -1.0 1.0 0.0
简单来说,dummy方法是one-hot减少一列,需要大家理解一下统计学自由度的概念,只管来说就是非男即女策略,其实表示性别,不需要男女两列,一列是否是女,即可表示。而effect方法则涉及统计学原理,大家可以自行学习。hash方法往往是为了节省空间所采用的,这样可以在节省空间的条件下进行特征提取,但是实际应用还是要继续深入了解才行。
R
在R中怎么实现上述步骤呢?我们这里给出代码实现。
#加载包
library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 3.0.0 √ purrr 0.2.5
√ tibble 1.4.2 √ dplyr 0.7.6
√ tidyr 0.8.1 √ stringr 1.3.1
√ readr 1.1.1 √ forcats 0.3.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
#无序变量转化
#读数据
file_path = "G:/Py/practical-machine-learning-with-python-master/notebooks/Ch04_Feature_Engineering_and_Selection/"
read_csv(paste0(file_path,'datasets/vgsales.csv')) -> vg_df
#偷懒的我一步到位
vg_df %>%
select(Genre) %>%
distinct %>%
mutate(GenreLabel = 1:n()) %>%
right_join(vg_df) %>%
select('Name', 'Platform', 'Year', 'Genre', 'GenreLabel') %>%
slice(1:7)
Parsed with column specification:
cols(
Rank = col_integer(),
Name = col_character(),
Platform = col_character(),
Year = col_character(),
Genre = col_character(),
Publisher = col_character(),
NA_Sales = col_double(),
EU_Sales = col_double(),
JP_Sales = col_double(),
Other_Sales = col_double(),
Global_Sales = col_double()
)
Joining, by = "Genre"
NamePlatformYearGenreGenreLabelWii Sports Wii 2006 Sports 1 Super Mario Bros. NES 1985 Platform 2 Mario Kart Wii Wii 2008 Racing 3 Wii Sports Resort Wii 2009 Sports 1 Pokemon Red/Pokemon BlueGB 1996 Role-Playing 4 Tetris GB 1989 Puzzle 5 New Super Mario Bros. DS 2006 Platform 2
#有序变量转化
read_csv(paste0(file_path,'datasets/Pokemon.csv')) -> poke_df
#把偷懒进行到底
poke_df %>%
mutate(GenerationLabel=str_sub(Generation,start=-1,end=-1)) %>%
select('Name', 'Generation', 'GenerationLabel') %>%
slice(4:10)
Parsed with column specification:
cols(
`#` = col_integer(),
Name = col_character(),
`Type 1` = col_character(),
`Type 2` = col_character(),
Total = col_integer(),
HP = col_integer(),
Attack = col_integer(),
Defense = col_integer(),
`Sp. Atk` = col_integer(),
`Sp. Def` = col_integer(),
Speed = col_integer(),
Generation = col_character(),
Legendary = col_logical()
)
NameGenerationGenerationLabelVenusaurMega Venusaur Gen 1 1 Charmander Gen 1 1 Charmeleon Gen 1 1 Charizard Gen 1 1 CharizardMega Charizard XGen 1 1 CharizardMega Charizard YGen 1 1 Squirtle Gen 1 1
#one-hot
pacman::p_load(onehot)
iris %>% onehot -> encoder
predict(encoder,iris) %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies=setosaSpecies=versicolorSpecies=virginica5.13.51.40.21 0 0 4.93.01.40.21 0 0 4.73.21.30.21 0 0 4.63.11.50.21 0 0 5.03.61.40.21 0 0 5.43.91.70.41 0 0
#dummy
model.matrix(~iris$Species - 1) %>% head
iris$Speciessetosairis$Speciesversicoloriris$Speciesvirginica110021003100410051006100
关于在R中用effect coding和feature hashing的方法,目前我还没有找到比较成熟的方法。不过feature hashing有较为广泛的用途,因此Github已经有相关的项目可以参考,见https://github.com/wush978/FeatureHashing
分析
事实上,特征工程是个非常大的话题,能够做好特征工程,甚至都可以做一个专门的特征工程师,是个非常专的领域。我们的文章中仅仅是对基本的方法做了一些介绍,具体哪些方法什么时候用,怎么用,还需要大家进一步继续探索。希望文章能够多多少少能够启发大家继续对特征工程探究下去!