3-10 Pandas 常用操作

 

1.构造数据

In [1]:

import pandas as pd
data=pd.DataFrame({'group':['a','a','a','b','b','b','c','c','c'],
                 'data':[4,1,2,2,3,5,3,5,5]})
data

Out[1]:  

 groupdata
0a4
1a1
2a2
3b2
4b3
5b5
6c3
7c5
8c5

 

2.排序

In [2]:

data.sort_values(by=['group','data'],ascending=[False,True],inplace=True)#by指定序列,ascending=[False,True]指定升序,BOOL来确定是升序还是降序;inplace=True确认改变原始数据
data

Out[2]:  

 groupdata
6c3
7c5
8c5
3b2
4b3
5b5
1a1
2a2
0a4

 

3指定键值进行排序:

In [3]:

data=pd.DataFrame({'k1':['one']*3+['two']*4,'k2':[3,2,1,3,3,3,4]})
data

Out[3]:  

 k1k2
0one3
1one2
2one1
3two3
4two3
5two3
6two4

In [4]:

data.sort_values(by='k2')

Out[4]:  

 k1k2
2one1
1one2
0one3
3two3
4two3
5two3
6two4

 

5.对重复的数据删除

In [5]:

data.drop_duplicates()#删除k1+k2里都重复的值

Out[5]:  

 k1k2
0one3
1one2
2one1
3two3
6two4

In [6]:

data.drop_duplicates('k1')#删除k1重复的值

Out[6]:  

 k1k2
0one3
3two3

 

6.对值作出一个新的映射

In [7]:

data1=pd.DataFrame({'food':['A1','A2','B1','B2','C1','C2','C3'],'data':[1,2,3,4,5,6,7]})
data1

Out[7]:  

 fooddata
0A11
1A22
2B13
3B24
4C15
5C26
6C37

 

6-1 apply的映射

In [8]:

def food_map(series):
    if series['food']=='A1':
        return 'A'
    elif series['food']=='A2':
        return 'A'
    elif series['food']=='B1':
        return 'B'
    elif series['food']=='B2':
        return 'B'
    elif series['food']=='C1':
        return 'C'
    elif series['food']=='C2':
        return 'C'
    elif series['food']=='C3':
        return 'C'
data1['food_map']=data1.apply(food_map,axis='columns')#apply映射
data1 

Out[8]:  

 fooddatafood_map
0A11A
1A22A
2B13B
3B24B
4C15C
5C26C
6C37C

 

6-2 map的映射

In [9]:

food2Upper={
    'A1':'A',
    'A2':'A',
    'B1':'B',
    'B2':'B',
    'C1':'C',
    'C2':'C',
    'C3':'C'}#字典的映射
data1['upper']=data1['food'].map(food2Upper)#map映射操作
data1

Out[9]:  

 fooddatafood_mapupper
0A11AA
1A22AA
2B13BB
3B24BB
4C15CC
5C26CC
6C37CC

 

7.新添加一列 assign操作

In [10]:

import numpy as np
df=pd.DataFrame({'data1':np.random.random(5),
                 'data2':np.random.random(5)})
df2=df.assign(rantion=df['data1']/df['data2'])
df2

Out[10]:  

 data1data2rantion
00.0025260.3369180.007498
10.5307930.5495580.965854
20.5278320.2294122.300803
30.9023570.8267461.091456
40.9843550.3729972.639041

In [11]:

df2.drop('rantion',axis='columns',inplace=True)#删除指定列操作
df2

Out[11]:  

 data1data2
00.0025260.336918
10.5307930.549558
20.5278320.229412
30.9023570.826746
40.9843550.372997

 

8.替换值 replace

In [12]:

data=pd.Series([1,2,3,4,5,6,7,8,9])
data

Out[12]:

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int64

In [13]:

data.replace(9,np.nan,inplace=True)
data

Out[13]:

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
7    8.0
8    NaN
dtype: float64

 

9.数据离散化:把数据按范围分组 pd.cut

In [14]:

ages=[15,20,18,25,46,89,66,80]
bins=[10,40,90]
bins_res=pd.cut(ages,bins)#离散化数据:10-40,40-90两组
bins_res

Out[14]:

[(10, 40], (10, 40], (10, 40], (10, 40], (40, 90], (40, 90], (40, 90], (40, 90]]
Categories (2, interval[int64]): [(10, 40] < (40, 90]]

In [15]:

bins_res.labels#没有分类

 

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-15-886c7b02dfc3> in <module>()
----> 1bins_res.labels#没有分类

AttributeError: 'Categorical' object has no attribute 'labels'

In [16]:

pd.value_counts(bins_res)#输出分组的范围和每组的个数

Out[16]:

(40, 90]    4
(10, 40]    4
dtype: int64

In [17]:

pd.cut(ages,[10,30,50,90])#把bins直接用[10,30,50,80]代替

Out[17]:

[(10, 30], (10, 30], (10, 30], (10, 30], (30, 50], (50, 90], (50, 90], (50, 90]]
Categories (3, interval[int64]): [(10, 30] < (30, 50] < (50, 90]]

In [18]:

group_names=['Yonth','Mille','Old']
pd.value_counts(pd.cut(ages,[10,30,50,90],labels=group_names))

Out[18]:

Yonth    4
Old      3
Mille    1
dtype: int64

 

10.查看缺失值

In [19]:

df=pd.DataFrame([range(3),[0,np.nan,0],[0,0,np.nan],range(3)])#构建一些缺失值
df

Out[19]:  

 012
001.02.0
10NaN0.0
200.0NaN
301.02.0

In [20]:

df.isnull()#查看缺失值位置,False就是缺失值位置

Out[20]:  

 012
0FalseFalseFalse
1FalseTrueFalse
2FalseFalseTrue
3FalseFalseFalse

In [21]:

df.isnull().any()#默认按列查看

Out[21]:

0    False
1     True
2     True
dtype: bool

In [22]:

df.isnull().any(axis=1)#默认按行查看

Out[22]:

0    False
1     True
2     True
3    False
dtype: bool

 

11.填充缺失值

In [23]:

df.fillna(5)#用5填充缺失值

Out[23]:  

 012
001.02.0
105.00.0
200.05.0
301.02.0

In [24]:

df[df.isnull().any(axis=1)]#定位有缺失值的行

Out[24]:  

 012
10NaN0.0
200.0NaN
    原文作者:karina512
    原文地址: https://www.cnblogs.com/AI-robort/p/11654976.html
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞