基本功能

这一部分将会介绍Series和DataFrame中的数据交互的基础机制

重建索引

reindex方法用于创建一个符合新索引的新对象。Series在调用reindex方法时,会将数据按照新的索引值进行排列,如果某个索引之前并不存在,则会引入缺失值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
obj1 = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d','b','a','c'])
print(obj1) ==>
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
obj2 = obj1.reindex(['a','b','c','d','e'])
print(obj2) ==>
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64

对于一些顺序数据(如时间顺序),可能会需要进行插值和自动填充method可选参数允许使用ffill方法在重建索引时插值。ffill将会将值向前填充

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
obj1 = pd.Series(['Blue','purple','yellow'], index=[0,2,4])
print(obj1) ==>
0 Blue
2 purple
4 yellow
dtype: object
obj2 = obj1.reindex(range(6), method='ffill')
print(obj2) ==>
0 Blue
1 Blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object

在DataFrame中reindex可以改变行索引,列索引或同时,当仅传入一个序列中时,结果中的会重建索引

1
2
3
4
5
6
7
8
9
10
11
12
13
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns=['Ohio','Texas','California'])
print(frame) ==>
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
frame1 = frame.reindex(['a','b','c','d'])
print(frame1) ==>
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0

同理,如果要对列进行修改则应当使用columns关键字

轴向上删除条目

drop方法会返回一个含有指示值和轴向上删除值的新对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
obj = pd.Series(np.arange(5), index =['a','b','c','d','e'])
print(obj) ==>
a 0
b 1
c 2
d 3
e 4
dtype: int64
obj1 = obj.drop('c')
print(obj1) ==>
a 0
b 1
d 3
e 4
dtype: int64

对于DataFrame来说,调用drop时使用标签序列会根据行标签删除值,可以通过axis=’columns’来从中删除值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
print(data) ==>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
print(data.drop(['Colorado','Ohio'])) ==>
one two three four
Utah 8 9 10 11
New York 12 13 14 15
print(data.drop('two',axis='columns')) ==>
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15

上面所说的都是返回一个新的对象而原先的不发生改变,但如果加上inplace参数则可以直接操作原对象不返回新对象

1
2
3
4
5
6
7
8
9
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
data.drop('Utah',inplace=True)
print(data) ==>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 12 13 14 15

索引、选择与过滤

Series的索引(obj[…])和Numpy类似,但Series的索引值可以不仅仅是整数

普通python切片不包含尾部,但Series与之不同。并且在对这些方法设值的时候会修改Series相应的部分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
obj =pd.Series(np.arange(4.),index=['a','b','c','d'])
print(obj) ==>
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
print(obj['b':'c']) ==>
b 1.0
c 2.0
dtype: float64
obj['b':'c']=5
print(obj) ==>
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64

使用单个值或序列,可以从DataFrame中索引出一个或多个

1
2
3
4
5
6
7
8
9
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
print(data[['three','one']]) ==>
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12

如果要进行列索引则可以使用切片语法,同时也能用Numpy风格的语法进行布尔检索

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
print(data) ==>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
print(data[:2]) ==>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
data[data<5]=0
print(data) ==>
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

针对DataFrame在行上的标签索引,可以使用轴标签loc整数标签iloc

1
2
3
4
5
6
7
8
9
10
11
12
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
print(data.loc['Colorado',['two','three']]) ==>
two 5
three 6
Name: Colorado, dtype: int64
print(data.iloc[2,[3,0,1]]) ==>
four 11
one 8
two 9
Name: Utah, dtype: int64

除了单个标签,iloc和loc还可用于切片

1
2
3
4
5
6
7
8
9
10
11
12
13
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
print(data.loc[:'Utah','two']) ==>
Ohio 1
Colorado 5
Utah 9
Name: two, dtype: int64
print(data.iloc[:,:3][data.three>5]) ==>
one two three
Colorado 4 5 6
Utah 8 9 10
New York 12 13 14

整数索引

由于整数索引在使用时经常会出现歧义,数据选择时最好使用标签索引,可以使用loc(标签)或iloc(整数)

1
2
3
4
5
6
7
8
9
10
11
12
13
ser = pd.Series(np.arange(3.))
print(ser) ==>
0 0.0
1 1.0
2 2.0
dtype: float64
print(ser.loc[:1]) ==>
0 0.0
1 1.0
dtype: float64
print(ser.iloc[:1]) ==>
0 0.0
dtype: float64

算术和数据对齐

如果两个对象相加,所返回结果的索引将是索引对的并集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
s1 = pd.Series([7.3,-2.5,3.4,1.5], index=['a','c','d','e'])
s2 = pd.Series([-2.1,3.6,-1.5,4,3.1], index=['a','c','e','f','g'])
print(s1) ==>
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
print(s2) ==>
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
print(s1+s2) ==>
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64

通过上面的例子我们可以看到,对于两者都有的索引,所得的就是二者相加的值。而对于没有交叠的标签位置上,内部数据对齐会产生缺失值

对于DataFrame来说,行和列上都会执行对齐

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
print(df1) ==>
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
print(df2) ==>
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
print(df1+df2) ==>
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN

由上面的例子我们看到它的索引、列是每个DataFrame的索引、列的并集,并在一些地方产生了缺失值

使用填充值的算术方法

上述操作方法我们发现在一些地方出现了缺失值,如果当轴标签在一个对象中存在而在另一个对象中不存在想要指定填充值可以使用add方法并在其中传入fill_value参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
print(df1) ==>
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
print(df2) ==>
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
print(df1.add(df2,fill_value=0)) ==>
b c d e
Colorado 6.0 7.0 8.0 NaN
Ohio 3.0 1.0 6.0 5.0
Oregon 9.0 NaN 10.0 11.0
Texas 9.0 4.0 12.0 8.0
Utah 0.0 NaN 1.0 2.0

注意:这里的填充值指的是填充如原先不存在的地方然后参与加法运算,并非将结果中的NaN改为填充值

下面是一些常用的计算方法

DataFrame和Series间的操作

DataFrame和Series之间的操作涉及到了广播机制

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
df = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
ser = df.iloc[0]
print(df) ==>
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
print(ser) ==>
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
print(df-ser) ==>
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0

这里我们看到,因为广播机制的原因,减法在每一行都进行了操作。默认情况下,DataFrame和Series的数学操作会将Series的索引DataFrame的列进行匹配并广播到各行

如果我们想要改在列上进行广播,行上匹配,则必须使用算术方法的一种:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
ser = df['d']
print(df) ==>
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
print(ser) ==>
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
print(df.sub(ser,axis='index')) ==>
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0

使用axis进行匹配轴并进行广播

函数应用和映射

Numpy的通用函数(逐元素数组方法)对pandas对象也有效:

1
2
3
4
5
6
7
8
9
10
11
12
13
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
print(frame) ==>
b d e
Utah 0.823234 -1.230791 0.679275
Ohio 0.820003 -0.280677 -0.684320
Texas -0.373359 -1.338919 -0.629548
Oregon -0.059928 -2.414072 0.476505
print(np.abs(frame)) ==>
b d e
Utah 0.823234 1.230791 0.679275
Ohio 0.820003 0.280677 0.684320
Texas 0.373359 1.338919 0.629548
Oregon 0.059928 2.414072 0.476505

另外一个常用的操作是将自定义函数应用在一行或者一列的一维数组上,可用DataFrame的apply方法实现这个功能

1
2
3
4
5
6
7
8
9
10
11
12
13
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
print(frame) ==>
b d e
Utah 0.371684 -0.428336 0.366052
Ohio 0.508354 -1.108785 -1.001388
Texas -0.751085 0.879442 -1.595899
Oregon -0.270006 -0.010182 -0.677792
f = lambda x: x.max()-x.min()
print(frame.apply(f)) ==>
b 1.259439
d 1.988227
e 1.961951
dtype: float64

这里的f就对frame中的每一列计算其最大值和最小值的差,结果是一个以frame为作为索引的Series

如果想对列进行操作就传入axis=’columns’

传递给apply函数的不一定返回标量值,也可以返回带有多个值的Series

1
2
3
4
5
6
7
8
9
10
11
12
13
def f(x):
return pd.Series([x.min(),x.max()],index=['min','max'])
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
print(frame) ==>
b d e
Utah 0.901981 1.239383 1.412554
Ohio 1.584249 0.120927 -0.849564
Texas -0.720863 0.607950 -0.481181
Oregon 0.641740 1.217635 -0.589586
print(frame.apply(f)) ==>
b d e
min -0.720863 0.120927 -0.849564
max 1.584249 1.239383 1.412554

也可使用逐元素的python函数,如果想要根据frame中的每个浮点数计算格式化字符串应当使用applymap方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
print(frame) ==>
b d e
Utah -0.762464 -0.076285 -0.177307
Ohio 0.755204 -2.211331 -0.259672
Texas 0.233797 0.539000 1.620073
Oregon 0.738784 -0.342989 -0.006291
format = lambda x:'%.2f' % x
print(frame.applymap(format)) ==>
b d e
Utah -0.76 -0.08 -0.18
Ohio 0.76 -2.21 -0.26
Texas 0.23 0.54 1.62
Oregon 0.74 -0.34 -0.01

排序和排名

可使用sort_index方法对行索引或列索引进行排序,该方法返回一个新的对象

1
2
3
4
5
6
7
8
9
10
11
12
13
obj = pd.Series(range(4),index=['d','a','b','c'])
print(obj) ==>
d 0
a 1
b 2
c 3
dtype: int64
print(obj.sort_index()) ==>
a 1
b 2
c 3
d 0
dtype: int64

这里数据会默认在上按照升序进行排序。

如果要对列进行排序可使用axis=1参数,进行降序排序使用ascending=False参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
frame = pd.DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
print(frame) ==>
d a b c
three 0 1 2 3
one 4 5 6 7
print(frame.sort_index()); ==>
d a b c
one 4 5 6 7
three 0 1 2 3
print(frame.sort_index(axis=1)) ==>
a b c d
three 1 2 3 0
one 5 6 7 4
print(frame.sort_index(axis=1,ascending=False)) ==>
d c b a
three 0 3 2 1
one 4 7 6 5

如果想对进行排序可以使用sort_values()方法。默认情况下所有缺失值NaN都会被排序到底部

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
obj = pd.Series([4, np.nan,7, np.nan,-3,2])
print(obj); ==>
0 4.0
1 NaN
2 7.0
3 NaN
4 -3.0
5 2.0
dtype: float64
print(obj.sort_values()) ==>
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64

当给DataFrame排序时,可以使用一列或多列作为排序键,使用参数by

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
frame = pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
print(frame) ==>
b a
0 4 0
1 7 1
2 -3 0
3 2 1
print(frame.sort_values(by='b')) ==>
b a
2 -3 0
3 2 1
0 4 0
1 7 1
print(frame.sort_values(by=['a','b'])) ==>
b a
2 -3 0
0 4 0
3 2 1
1 7 1

排名指的是在数据中有效数据点数就行分配,比如可能出现1,2,2,4这样的排名序列。当然如果对于排名使用观察序列的话可以加上method=参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
obj = pd.Series([7,-5,7,4,2,0,4])
print(obj) ==>
0 7
1 -5
2 7
3 4
4 2
5 0
6 4
dtype: int64
print(obj.rank()) ==>
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
print(obj.rank(method='first')) ==>
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
print(obj.rank(ascending=False, method='max')) ==>
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64

接下来是一些用来打破平级关系的method方法

load failed

DataFrame也可以进行类似操作

1
2
3
4
5
6
7
8
9
10
11
12
13
frame = pd.DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
print(frame) ==>
b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5
print(frame.rank(axis='columns')) ==>
b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0

含有重复标签的轴索引

如果标签存在重复现象会使代码变得极为复杂,因为来自索引的输出类型可能因标签是否重复而有所不同。比如根据一个标签索引多个条目会返回一个序列,而单个条目会返回标量值。

Author: YihangBao
Link: https://roarboil.github.io/2019/09/05/basicfun/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.