pandas数据结构介绍

Series

Series是一种一维数组型对象,包含一个值序列并包含数据标签(称为索引index

1
2
3
4
5
6
7
obj = pd.Series([4,7,-5,3])
print(obj) ==>
0 4
1 7
2 -5
3 3
dtype: int64

这里我们发现和之前ndarray不同的是左边有一列索引列,默认索引是从0~N-1可以通过values和index属性分别获得对象的值和索引

1
2
3
4
5
obj = pd.Series([4,7,-5,3])
print(obj.values) ==>
[ 4 7 -5 3]
print(obj.index) ==>
RangeIndex(start=0, stop=4, step=1)

可以使用索引数组来指定我们需要的索引,并使之能够在之后能通过索引值来找到对应的值

1
2
3
4
5
6
7
obj = pd.Series([4,7,-5,3] ,index=['d','b','a','c'])
print(obj) ==>
d 4
b 7
a -5
c 3
dtype: int64

如果我们使用Numpy风格的操作,例如布尔数组过滤等,索引值和对应的值仍然会保持一一对应。所以我们说Series也可以被认为是一个长度固定且有序的字典,并且我们也可以使用字典来生成一个Series

1
2
3
4
5
6
7
8
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
obj = pd.Series(sdata)
print(obj) ==>
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64

还可以指定键值来通过字典创建Series

1
2
3
4
5
6
7
8
9
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj = pd.Series(sdata,index=states)
print(obj) ==>
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64

上例中California在字典中找不到对应的数据为空,pandas可用isnull和notnull函数来检查数据(同时也是实例方法)

1
2
3
4
5
6
7
8
9
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj = pd.Series(sdata,index=states)
print(pd.isnull(obj)) ==>
California True
Ohio False
Oregon False
Texas False
dtype: bool

Series对象自身和其索引都有name属性(与其他重要功能结合在一起)

1
2
3
4
5
6
7
8
9
10
11
12
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj = pd.Series(sdata,index=states)
obj.name = 'population'
obj.index.name = 'state'
print(obj) ==>
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64

注意:索引在创建之后仍可以对index传入数组的方式进行改变

DataFrame

DataFrame表示的是矩阵的数据表,包含已排序的列集合,每一列可以是不同类型。DataFrame既有行索引又有列索引,数据被存储为一个以上的二维块,而不是列表字典等一维数组的集合。

多种方法可以构建DataFrame,最常见的是利用包含等长度列表或Numpy数组的字典。

1
2
3
4
5
6
7
8
9
10
11
12
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
'year': [2000,2001,2002,2001,2002,2003],
'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}
frame = pd.DataFrame(data)
print(frame) ==>
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2

产生的DataFrame会自动给Sereies分配索引

对于大型的DataFrame可以使用head方法选出头部的五行

可以传入columns来指定顺序排列,index来设置索引。同样若无法找到columns则会出现缺失值

1
2
3
4
5
6
7
8
9
10
11
12
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
'year': [2000,2001,2002,2001,2002,2003],
'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}
frame = pd.DataFrame(data, columns=['year','state','pop','debt'], index=['one','two','three','four','five','six'])
print(frame) ==>
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN

对于DataFrame中的一列,可以按字典型标记或属性检索为Series

1
2
3
4
5
6
7
8
9
10
11
12
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
'year': [2000,2001,2002,2001,2002,2003],
'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}
frame = pd.DataFrame(data, columns=['year','state','pop','debt'], index=['one','two','three','four','five','six'])
print(frame[state]) ==>
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object

行也可以通过loc进行选取(后续会提到)

列的引用是可以修改的,空的列可以赋标量值或值数组,并且值的长度必须和DataFrame的长度相匹配

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
'year': [2000,2001,2002,2001,2002,2003],
'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}
frame = pd.DataFrame(data, columns=['year','state','pop','debt'], index=['one','two','three','four','five','six'])
frame['debt']=16.5
print(frame) ==>
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
frame['debt']= np.arange(6.)
print(frame) ==>
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0

当吧series赋值给一列时,Series索引会按照DataFrame的索引重新排列,并在空缺的地方填充缺失值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
'year': [2000,2001,2002,2001,2002,2003],
'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}
frame = pd.DataFrame(data, columns=['year','state','pop','debt'], index=['one','two','three','four','five','six'])
val = pd.Series([-1.2,-1.5,-1.7],index = ['two','four','five'])
frame['debt'] = val
print(frame) ==>
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN

如果被赋值的列并不存在,则会生成一个新的列,可用del进行列的删除

注意!:从DataFrame中选取的列是数据的视图,如果需要复制,应当显式地使用Series的copy方法

如果使用包含字典的嵌套字典,那么pandas会将字典的键作为列,内部字典的键作为行索引

1
2
3
4
5
6
7
pop = {'Nevada':{2001:2.4, 2002:2.9},'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
frame = pd.DataFrame(pop)
print(frame) ==>
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

可以使用类似Numpy的语法对其进行转置

内部字典的键被联合、排序后形成了结果的索引,如果已经显式指明了索引,内部的键将不会再被排序

1
2
3
4
5
6
7
pop = {'Nevada':{2001:2.4, 2002:2.9},'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
frame = pd.DataFrame(pop, index=[2001, 2002, 2003])
print(frame) ==>
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN

如果DataFrame的索引和列拥有name属性,则其也会被显示

1
2
3
4
5
6
7
8
9
10
pop = {'Nevada':{2001:2.4, 2002:2.9},'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
frame = pd.DataFrame(pop)
frame.index.name = 'year'
frame.columns.name = 'state'
print(frame) ==>
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

DataFrame中的values属性会将包含在DataFrame中的数据以二维ndarray的形式返回,而dtype将自适应为适合所有数据的类型

下面是一些DataFrame构造函数的有效输入

load failed

索引对象

索引对象用于存储轴标签其他元数据。在构造时所使用的任意数组或标签序列都可以在内部转换为索引对象。

索引对象是不可变的,因此用户无法修改索引对象

下面是一些索引对象的方法和属性:

load failed

Author: YihangBao
Link: https://roarboil.github.io/2019/09/05/pandasintro/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.