3-11 group操作拓展

 

In [1]:

import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['foo','bar','foo','bar',
                     'foo','bar','foo','foo'],
                 'B':['one','one','two','three',
                     'two','two','one','three'],
                 'C':np.random.randn(8),
                 'D':np.random.randn(8)})#randn=>具有标准正态分布
df

Out[1]:  

 ABCD
0fooone1.126165-0.676814
1barone-1.429697-0.464149
2footwo-0.383661-0.309679
3barthree0.9450991.375307
4footwo-0.296882-0.630503
5bartwo2.526570-1.142886
6fooone-0.848323-0.310705
7foothree-1.683177-1.371868

In [2]:

grouped=df.groupby('A')
grouped

Out[2]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000000052092B0>

In [3]:

grouped.count()#统计次数

Out[3]:  

 BCD
A   
bar333
foo555

In [4]:

grouped=df.groupby(['A','B'])#索引多项
grouped.count()

Out[4]:  

  CD
AB  
barone11
three11
two11
fooone22
three11
two22

In [5]:

def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'a'
    else:
        return 'b'
grouped=df.groupby(get_letter_type,axis=1)
grouped.count().iloc[0]

Out[5]:

a    1
b    3
Name: 0, dtype: int64

In [6]:

s=pd.Series([1,2,3,1,2,3],[8,7,6,8,7,6])
s

Out[6]:

8    1
7    2
6    3
8    1
7    2
6    3
dtype: int64

 

1.指定多个索引的多个操作

In [7]:

grouped=s.groupby(level=0)#默认从0开始
grouped.first()#索引的第一部分,自动去除重复部分;也可以指定last()

Out[7]:

6    3
7    2
8    1
dtype: int64

In [8]:

grouped.sum()

Out[8]:

6    6
7    4
8    2
dtype: int64

In [9]:

grouped=s.groupby(level=0,sort=False)
grouped.first()

Out[9]:

8    1
7    2
6    3
dtype: int64

 

2.单独索引某一列的某一个元素:多重索引

In [10]:

df2=pd.DataFrame({'X':['A','B','A','B'],'Y':[1,2,3,4]})
df2

Out[10]:  

 XY
0A1
1B2
2A3
3B4

 

2-1 多重所以方法一

In [11]:

df2.groupby(['X']).get_group('A')#关注具体的某一个键值

Out[11]:  

 XY
0A1
2A3

 

2-2 多重所以方法二

In [12]:

arrays=[['foo','bar','foo','bar', 'foo','bar','foo','foo'],
         ['one','one','two','three','two','two','one','three']]
index=pd.MultiIndex.from_arrays(arrays,names=['first','second'])#添加索引名字
s=pd.Series(np.random.randn(8),index=index)#对索引键来添加值
s

Out[12]:

first  second
foo    one      -0.518263
bar    one       0.583992
foo    two       1.338273
bar    three    -0.671916
foo    two       0.633448
bar    two       0.144302
foo    one       0.828419
       three    -0.834918
dtype: float64

In [13]:

grouped=s.groupby(level=0)#索引第一列
grouped.sum()

Out[13]:

first
bar    0.056377
foo    1.446958
dtype: float64

In [14]:

grouped=s.groupby(level='second')#索引第二列,也可以指定名字
grouped.sum()

Out[14]:

second
one      0.894148
three   -1.506835
two      2.116022
dtype: float64

 

3 aggregate:以A B为键求和

In [15]:

grouped=df.groupby(['A','B'])
grouped.aggregate(np.sum)

Out[15]:  

  CD
AB  
barone-1.429697-0.464149
three0.9450991.375307
two2.526570-1.142886
fooone0.277842-0.987519
three-1.683177-1.371868
two-0.680543-0.940182

In [16]:

grouped=df.groupby(['A','B'],as_index=False)#as_index=False :不去除重复的行,是一行行索引
grouped.aggregate(np.sum)

Out[16]:  

 ABCD
0barone-1.429697-0.464149
1barthree0.9450991.375307
2bartwo2.526570-1.142886
3fooone0.277842-0.987519
4foothree-1.683177-1.371868
5footwo-0.680543-0.940182

In [17]:

grouped=df.groupby(['A','B']).sum().reset_index()#重新构建索引
grouped

Out[17]:  

 ABCD
0barone-1.429697-0.464149
1barthree0.9450991.375307
2bartwo2.526570-1.142886
3fooone0.277842-0.987519
4foothree-1.683177-1.371868
5footwo-0.680543-0.940182

In [18]:

grouped=df.groupby(['A','B'])
grouped.size()#统计出现次数

Out[18]:

A    B    
bar  one      1
     three    1
     two      1
foo  one      2
     three    1
     two      2
dtype: int64

 

7.得出统计特性值

In [19]:

grouped.describe().head()

Out[19]:  

  CD
  countmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%max
AB                
barone1.0-1.429697NaN-1.429697-1.429697-1.429697-1.429697-1.4296971.0-0.464149NaN-0.464149-0.464149-0.464149-0.464149-0.464149
three1.00.945099NaN0.9450990.9450990.9450990.9450990.9450991.01.375307NaN1.3753071.3753071.3753071.3753071.375307
two1.02.526570NaN2.5265702.5265702.5265702.5265702.5265701.0-1.142886NaN-1.142886-1.142886-1.142886-1.142886-1.142886
fooone2.00.1389211.396174-0.848323-0.3547010.1389210.6325431.1261652.0-0.4937600.258878-0.676814-0.585287-0.493760-0.402232-0.310705
three1.0-1.683177NaN-1.683177-1.683177-1.683177-1.683177-1.6831771.0-1.371868NaN-1.371868-1.371868-1.371868-1.371868-1.371868

 

  1. 得出指定的统计指标 agg操作

In [20]:

grouped=df.groupby('A')
grouped['C'].agg([np.sum,np.mean,np.std])

Out[20]:  

 summeanstd
A   
bar2.0419720.6806571.991346
foo-2.085878-0.4171761.023003

In [21]:

grouped['C'].agg({'sum1':np.sum,'mean1':np.mean,'std1':np.std})#改名字

 

E:\software\Anaconda3 5.2.0\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version. Use                 named aggregation instead.

    >>> grouper.agg(name_1=func_1, name_2=func_2)

  """Entry point for launching an IPython kernel.

Out[21]:  

 sum1mean1std1
A   
bar2.0419720.6806571.991346
foo-2.085878-0.4171761.023003
    原文作者:karina512
    原文地址: https://www.cnblogs.com/AI-robort/p/11678956.html
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞