1.构造数据
In [1]:
import pandas as pd data=pd.DataFrame({'group':['a','a','a','b','b','b','c','c','c'], 'data':[4,1,2,2,3,5,3,5,5]}) data
Out[1]:
group | data | |
---|---|---|
0 | a | 4 |
1 | a | 1 |
2 | a | 2 |
3 | b | 2 |
4 | b | 3 |
5 | b | 5 |
6 | c | 3 |
7 | c | 5 |
8 | c | 5 |
2.排序
In [2]:
data.sort_values(by=['group','data'],ascending=[False,True],inplace=True)#by指定序列,ascending=[False,True]指定升序,BOOL来确定是升序还是降序;inplace=True确认改变原始数据 data
Out[2]:
group | data | |
---|---|---|
6 | c | 3 |
7 | c | 5 |
8 | c | 5 |
3 | b | 2 |
4 | b | 3 |
5 | b | 5 |
1 | a | 1 |
2 | a | 2 |
0 | a | 4 |
3指定键值进行排序:
In [3]:
data=pd.DataFrame({'k1':['one']*3+['two']*4,'k2':[3,2,1,3,3,3,4]}) data
Out[3]:
k1 | k2 | |
---|---|---|
0 | one | 3 |
1 | one | 2 |
2 | one | 1 |
3 | two | 3 |
4 | two | 3 |
5 | two | 3 |
6 | two | 4 |
In [4]:
data.sort_values(by='k2')
Out[4]:
k1 | k2 | |
---|---|---|
2 | one | 1 |
1 | one | 2 |
0 | one | 3 |
3 | two | 3 |
4 | two | 3 |
5 | two | 3 |
6 | two | 4 |
5.对重复的数据删除
In [5]:
data.drop_duplicates()#删除k1+k2里都重复的值
Out[5]:
k1 | k2 | |
---|---|---|
0 | one | 3 |
1 | one | 2 |
2 | one | 1 |
3 | two | 3 |
6 | two | 4 |
In [6]:
data.drop_duplicates('k1')#删除k1重复的值
Out[6]:
k1 | k2 | |
---|---|---|
0 | one | 3 |
3 | two | 3 |
6.对值作出一个新的映射
In [7]:
data1=pd.DataFrame({'food':['A1','A2','B1','B2','C1','C2','C3'],'data':[1,2,3,4,5,6,7]}) data1
Out[7]:
food | data | |
---|---|---|
0 | A1 | 1 |
1 | A2 | 2 |
2 | B1 | 3 |
3 | B2 | 4 |
4 | C1 | 5 |
5 | C2 | 6 |
6 | C3 | 7 |
6-1 apply的映射
In [8]:
def food_map(series): if series['food']=='A1': return 'A' elif series['food']=='A2': return 'A' elif series['food']=='B1': return 'B' elif series['food']=='B2': return 'B' elif series['food']=='C1': return 'C' elif series['food']=='C2': return 'C' elif series['food']=='C3': return 'C' data1['food_map']=data1.apply(food_map,axis='columns')#apply映射 data1
Out[8]:
food | data | food_map | |
---|---|---|---|
0 | A1 | 1 | A |
1 | A2 | 2 | A |
2 | B1 | 3 | B |
3 | B2 | 4 | B |
4 | C1 | 5 | C |
5 | C2 | 6 | C |
6 | C3 | 7 | C |
6-2 map的映射
In [9]:
food2Upper={ 'A1':'A', 'A2':'A', 'B1':'B', 'B2':'B', 'C1':'C', 'C2':'C', 'C3':'C'}#字典的映射 data1['upper']=data1['food'].map(food2Upper)#map映射操作 data1
Out[9]:
food | data | food_map | upper | |
---|---|---|---|---|
0 | A1 | 1 | A | A |
1 | A2 | 2 | A | A |
2 | B1 | 3 | B | B |
3 | B2 | 4 | B | B |
4 | C1 | 5 | C | C |
5 | C2 | 6 | C | C |
6 | C3 | 7 | C | C |
7.新添加一列 assign操作
In [10]:
import numpy as np df=pd.DataFrame({'data1':np.random.random(5), 'data2':np.random.random(5)}) df2=df.assign(rantion=df['data1']/df['data2']) df2
Out[10]:
data1 | data2 | rantion | |
---|---|---|---|
0 | 0.002526 | 0.336918 | 0.007498 |
1 | 0.530793 | 0.549558 | 0.965854 |
2 | 0.527832 | 0.229412 | 2.300803 |
3 | 0.902357 | 0.826746 | 1.091456 |
4 | 0.984355 | 0.372997 | 2.639041 |
In [11]:
df2.drop('rantion',axis='columns',inplace=True)#删除指定列操作 df2
Out[11]:
data1 | data2 | |
---|---|---|
0 | 0.002526 | 0.336918 |
1 | 0.530793 | 0.549558 |
2 | 0.527832 | 0.229412 |
3 | 0.902357 | 0.826746 |
4 | 0.984355 | 0.372997 |
8.替换值 replace
In [12]:
data=pd.Series([1,2,3,4,5,6,7,8,9]) data
Out[12]:
0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 dtype: int64
In [13]:
data.replace(9,np.nan,inplace=True) data
Out[13]:
0 1.0 1 2.0 2 3.0 3 4.0 4 5.0 5 6.0 6 7.0 7 8.0 8 NaN dtype: float64
9.数据离散化:把数据按范围分组 pd.cut
In [14]:
ages=[15,20,18,25,46,89,66,80] bins=[10,40,90] bins_res=pd.cut(ages,bins)#离散化数据:10-40,40-90两组 bins_res
Out[14]:
[(10, 40], (10, 40], (10, 40], (10, 40], (40, 90], (40, 90], (40, 90], (40, 90]] Categories (2, interval[int64]): [(10, 40] < (40, 90]]
In [15]:
bins_res.labels#没有分类
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-15-886c7b02dfc3> in <module>() ----> 1bins_res.labels#没有分类 AttributeError: 'Categorical' object has no attribute 'labels'
In [16]:
pd.value_counts(bins_res)#输出分组的范围和每组的个数
Out[16]:
(40, 90] 4 (10, 40] 4 dtype: int64
In [17]:
pd.cut(ages,[10,30,50,90])#把bins直接用[10,30,50,80]代替
Out[17]:
[(10, 30], (10, 30], (10, 30], (10, 30], (30, 50], (50, 90], (50, 90], (50, 90]] Categories (3, interval[int64]): [(10, 30] < (30, 50] < (50, 90]]
In [18]:
group_names=['Yonth','Mille','Old'] pd.value_counts(pd.cut(ages,[10,30,50,90],labels=group_names))
Out[18]:
Yonth 4 Old 3 Mille 1 dtype: int64
10.查看缺失值
In [19]:
df=pd.DataFrame([range(3),[0,np.nan,0],[0,0,np.nan],range(3)])#构建一些缺失值 df
Out[19]:
0 | 1 | 2 | |
---|---|---|---|
0 | 0 | 1.0 | 2.0 |
1 | 0 | NaN | 0.0 |
2 | 0 | 0.0 | NaN |
3 | 0 | 1.0 | 2.0 |
In [20]:
df.isnull()#查看缺失值位置,False就是缺失值位置
Out[20]:
0 | 1 | 2 | |
---|---|---|---|
0 | False | False | False |
1 | False | True | False |
2 | False | False | True |
3 | False | False | False |
In [21]:
df.isnull().any()#默认按列查看
Out[21]:
0 False 1 True 2 True dtype: bool
In [22]:
df.isnull().any(axis=1)#默认按行查看
Out[22]:
0 False 1 True 2 True 3 False dtype: bool
11.填充缺失值
In [23]:
df.fillna(5)#用5填充缺失值
Out[23]:
0 | 1 | 2 | |
---|---|---|---|
0 | 0 | 1.0 | 2.0 |
1 | 0 | 5.0 | 0.0 |
2 | 0 | 0.0 | 5.0 |
3 | 0 | 1.0 | 2.0 |
In [24]:
df[df.isnull().any(axis=1)]#定位有缺失值的行
Out[24]:
0 | 1 | 2 | |
---|---|---|---|
1 | 0 | NaN | 0.0 |
2 | 0 | 0.0 | NaN |