Pandas库基础分析——数据生成和访问

2023年8月3日 96次阅读来源: 元宵大师

前言

Pandas是Python环境下最有名的数据统计包，是基于 Numpy 构建的含有更高级数据结构和工具的数据分析包。Pandas围绕着 Series 和 DataFrame 两个核心数据结构展开的。本文着重介绍这两种数据结构的生成和访问的基本方法。

Series

Series是一种类似于一维数组的对象，由一组数据（一维ndarray数组对象）和一组与之对应相关的数据标签（索引）组成。
注：numpy（Numerical Python）提供了python对多维数组对象的支持：ndarray，具有矢量运算能力，快速、节省空间。

（1）Pandas说明文档中对Series特点介绍如下：

“”” One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).
Operations between Series (+, -, /, , *) align values based on their
associated index values– they need not be the same length. The result
index will be the sorted union of the two indexes.
Parameters
———- data : array-like, dict, or scalar value
Contains data stored in Series index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex(len(data)) if not provided. If both a dict and index
sequence are used, the index will override the keys found in the
dict. dtype : numpy.dtype or None
If None, dtype will be inferred copy : boolean, default False
Copy input data """

（2）创建Series的基本方法如下，数据可以是阵列（list、ndarray）、字典和常量值。s = pd.Series(data, index=index)


s = pd.Series([-1.55666192,-0.75414753,0.47251231,-1.37775038,-1.64899442], index=['a', 'b', 'c', 'd', 'e'],dtype='int8' )
a   -1
b    0
c    0
d   -1
e   -1
dtype: int8

s = pd.Series(['a',-0.75414753,123,66666,-1.64899442], index=['a', 'b', 'c', 'd', 'e'],)
a           a
b   -0.754148
c         123
d       66666
e    -1.64899
dtype: object

注：Series支持的数据类型包括整数、浮点数、复数、布尔值、字符串等numpy.dtype，与创建ndarray数组相同的是，如未指定类型，它会尝试推断出一个合适的数据类型，例程中数据包含数字和字符串时，推断为object类型；如指定int8类型时数据以int8显示。

s = pd.Series(np.random.randn(5))
0    0.485468
1   -0.912130
2    0.771970
3   -1.058117
4    0.926649
dtype: float64

s.index
RangeIndex(start=0, stop=5, step=1)

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
a    0.485468
b   -0.912130
c    0.771970
d   -1.058117
e    0.926649
dtype: float64

注：当数据未指定索引时，Series会自动创建整数型索引


s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.})
a    0.0
b    1.0
c    2.0
dtype: float64

s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a'])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

注：通过Python字典创建Series，可视为一个定长的有序字典。如果只传入一个字典，那么Series中的索引即是原字典的键。如果传入索引，那么会找到索引相匹配的值并放在相应的位置上，未找到对应值时结果为NaN。


s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

注：数值重复匹配以适应索引长度

（3）访问Series中的元素和索引


s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a'])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

s.values
[  1.   2.  nan   0.]

s.index
Index([u'b', u'c', u'd', u'a'], dtype='object')

注：Series的values和index属性获取其数组表示形式和索引对象


s['a']
0.0

s[['a','b']]
a    0.0
b    1.0
dtype: float64

s[['a','b','c']]
a    0.0
b    1.0
c    2.0
dtype: float64

s[:2] 
b    1.0
c    2.0
dtype: float64

注：可以通过索引的方式选取Series中的单个或一组值

DataFrame

DataFrame是一个表格型（二维）的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。DataFrame既有行索引也有列索引，它可以看做由Series组成的字典（共用同一个索引）。

（1）Pandas说明文档中对DataFrame特点介绍如下：

“”” Two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data
structure
Parameters
———- data : numpy ndarray (structured or homogeneous), dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like objects index : Index or array-like
Index to use for resulting frame. Will default to np.arange(n) if
no indexing information part of input data and no index provided columns : Index or array-like
Column labels to use for resulting frame. Will default to
np.arange(n) if no column labels are provided dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input

（2）创建DataFrame的基本方法如下，数据可以是由列表、一维ndarray或Series组成的字典（序列长度必须相同）、二维ndarray、字典组成的字典等df = pd.DataFrame(data, index=index)


df = pd.DataFrame({'one': [1., 2., 3., 5], 'two': [1., 2., 3., 4.]})
   one  two
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0

注：以列表组成的字典形式创建，每个序列成为DataFrame的一列。不支持单一列表创建df = pd.DataFrame({[1., 2., 3., 5], [1., 2., 3., 4.]})，因为list为unhashable类型


df = pd.DataFrame([[1., 2., 3., 5],[1., 2., 3., 4.]],index=['a', 'b'],columns=['one','two','three','four'])
   one  two  three  four
a  1.0  2.0    3.0   5.0
b  1.0  2.0    3.0   4.0

注：以嵌套列表组成形式创建2行4列的表格，通过index和 columns参数指定了索引和列名


data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
[(0,  0., '') (0,  0., '')]

注：zeros(shape, dtype=float, order=’C’)返回一个给定形状和类型的用0填充的数组


data[:] = [(1,2.,'Hello'), (2,3.,"World")]        
df = pd.DataFrame(data)
   A    B      C
0  1  2.0  Hello
1  2  3.0  World

df = pd.DataFrame(data, index=['first', 'second'])
        A    B      C
first   1  2.0  Hello
second  2  3.0  World

df = pd.DataFrame(data, columns=['C', 'A', 'B'])
       C  A    B
0  Hello  1  2.0
1  World  2  3.0

注：同Series相同，未指定索引时DataFrame会自动加上索引，指定列则按指定顺序进行排列


data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
        'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

注：以Series组成的字典形式创建时，每个Series成为一列，如果没有显示指定索引，则各Series的索引被合并成结果的行索引。NaN代替缺失的列数据


df = pd.DataFrame(data,index=['d', 'b', 'a'])
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

df = pd.DataFrame(data,index=['d', 'b', 'a'], columns=['two', 'three'])
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data2)
   a   b     c
0  1   2   NaN
1  5  10  20.0

注：以字典的列表形式创建时，各项成为DataFrame的一行，字典键索引的并集成为DataFrame的列标


df = pd.DataFrame(data2, index=['first', 'second'])
        a   b     c
first   1   2   NaN
second  5  10  20.0

df = pd.DataFrame(data2, columns=['a', 'b'])
   a   b
0  1   2
1  5  10

df = pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
                 ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
                 ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, 
                 ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},  
                 ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
       a              b
       a    b    c    a     b
A B  4.0  1.0  5.0  8.0  10.0
  C  3.0  2.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

注：以字典的字典形式创建时，列索引由外层的键合并成结果的列索引，各内层字典成为一列，内层的键会被合并成结果的行索引。

（3）访问DataFrame中的元素和索引


data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
        'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

df['one']或df.one
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

注：通过类似字典标记的方式或属性的方式，可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFrame相同的索引，且其name属性也被相应设置。


df[0:1]
   one  two
a  1.0  1.0

注：返回前两列数据


df.loc['a']
one    1.0
two    1.0
Name: a, dtype: float64

df.loc[:,['one','two'] ]
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

df.loc[['a',],['one','two']]
   one  two
a  1.0  1.0

df.loc['a','one']
1.0

注：loc是通过标签来选择数据


df.iloc[0:2,0:1]  
   one
a  1.0
b  2.0

df.iloc[0:2]  
   one  two
a  1.0  1.0
b  2.0  2.0

df.iloc[[0,2],[0,1]]#自由选取行位置，和列位置对应的数据
   one  two
a  1.0  1.0
c  3.0  3.0

注：iloc通过位置来选择数据


df.ix['a']
one    1.0
two    1.0
Name: a, dtype: float64

df.ix['a',['one','two']]
one    1.0
two    1.0
Name: a, dtype: float64

df.ix['a',[0,1]]
one    1.0
two    1.0
Name: a, dtype: float64

df.ix[['a','b'],[0,1]]
   one  two
a  1.0  1.0
b  2.0  2.0

df.ix[1,[0,1]]
one    2.0
two    2.0
Name: b, dtype: float64

df.ix[[0,1],[0,1]]
   one  two
a  1.0  1.0
b  2.0  2.0

注：通过索引字段ix和名称结合的方式获取行数据


df.ix[df.one>1,:1]
   one
b  2.0
c  3.0

注：使用条件来选择，选取one列中大于1的行和第一列


df['one']=16.8
    one  two
a  16.8  1.0
b  16.8  2.0
c  16.8  3.0
d  16.8  4.0

val = pd.Series([2,2,2],index=['b', 'c', 'd'])
df['one']=val
   one  two
a  NaN  1.0
b  2.0  2.0
c  2.0  3.0
d  2.0  4.0

注：列可以通过赋值方式修改，将列表或数组赋值给某个列时长度必须和DataFrame的长度相匹配。Series赋值时会精确匹配DataFrame的索引，空位以NaN填充。


df['four']=[3,3,3,3]
   one  two  four
a  NaN  1.0     3
b  2.0  2.0     3
c  2.0  3.0     3
d  2.0  4.0     3

注：对不存在的列赋值会创建新列


df.index.get_loc('a')
0

df.index.get_loc('b')
1

df.columns.get_loc('one')
0

注：通过行/列索引获取整数形式位置

更多python量化交易内容互动请加微信公众号：PythonQT-YuanXiao
欢迎订阅量化交易课程：
[链接地址]

    原文作者：元宵大师
    原文地址: https://segmentfault.com/a/1190000013304713
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。