处理缺失值

对于数值型数据,pandas使用浮点值NaN来代表缺失。而缺失值为NA,表示数据不存在或者存在但无法观察。有一些函数可以用于处理:

1
2
3
4
5
6
7
8
9
10
11
12
13
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
print(string_data) ==>
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
print(string_data.isnull()) ==>
0 False
1 False
2 True
3 False
dtype: bool

过滤缺失值

dropna在过滤数据时起到了很大的作用

1
2
3
4
5
6
7
8
9
10
11
12
13
data = pd.Series([1,NA,3.5,NA,7])
print(data) ==>
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
print(data.dropna()) ==>
0 1.0
2 3.5
4 7.0
dtype: float64

对于DataFrame对象,dropna默认会删除所有含缺失值的,若仅仅是要删除所有值为NA的行则传入参数how= ‘all’,如果要对进行操作则传入axis=1,如果要指定丢弃的行数则传入参数thresh=

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA],[NA, 6.5,3.]])
print(data) ==>
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
print(data.dropna()) ==>
0 1 2
0 1.0 6.5 3.0
print(data.dropna(how='all')) ==>
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
print(data.dropna(how='all', axis=1)) ==>
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
print(data.dropna(thresh=1)) ==>
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0

补全缺失值

可以使用fillna函数来进行缺失值的补充

可以使用常数补全,指定列字典补全。默认情况下fillna生成一个新的对象,但也可以使用inplace参数修改已经存在的对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
print(df) ==>
0 1 2
0 -1.192710 NaN NaN
1 0.507454 NaN NaN
2 -0.573454 NaN 0.235207
3 -0.308547 NaN 1.769382
4 -1.696863 1.103388 -0.433570
5 0.304338 0.176991 -0.282484
6 -0.134053 -0.213646 0.800938
print(df.fillna(0)) ==>
0 1 2
0 -1.192710 0.000000 0.000000
1 0.507454 0.000000 0.000000
2 -0.573454 0.000000 0.235207
3 -0.308547 0.000000 1.769382
4 -1.696863 1.103388 -0.433570
5 0.304338 0.176991 -0.282484
6 -0.134053 -0.213646 0.800938
print(df.fillna({1: 0.5,2:0})) ==>
0 1 2
0 -1.192710 0.500000 0.000000
1 0.507454 0.500000 0.000000
2 -0.573454 0.500000 0.235207
3 -0.308547 0.500000 1.769382
4 -1.696863 1.103388 -0.433570
5 0.304338 0.176991 -0.282484
6 -0.134053 -0.213646 0.800938

也可以用之前用于重建索引的相同的插值方法method=’ffill’(同时也可以带上limit参数用于限制个数)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
print(df) ==>
0 1 2
0 -0.571684 -1.130371 -0.565300
1 1.243802 -1.536666 0.361830
2 1.291224 NaN 0.514455
3 -1.232330 NaN 2.057907
4 0.364088 NaN NaN
5 -0.598980 NaN NaN
print(df.fillna(method='ffill')) ==>
0 1 2
0 -0.571684 -1.130371 -0.565300
1 1.243802 -1.536666 0.361830
2 1.291224 -1.536666 0.514455
3 -1.232330 -1.536666 2.057907
4 0.364088 -1.536666 2.057907
5 -0.598980 -1.536666 2.057907
print(df.fillna(method='ffill', limit=2)) ==>
0 1 2
0 -0.571684 -1.130371 -0.565300
1 1.243802 -1.536666 0.361830
2 1.291224 -1.536666 0.514455
3 -1.232330 -1.536666 2.057907
4 0.364088 NaN 2.057907
5 -0.598980 NaN 2.057907

还有一些创造性的填充方法可结合具体问题使用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
data = pd.Series([1.,NA,3.5,NA,7])
print(data) ==>
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
print(data.fillna(data.mean())) ==>
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64

下面是一些参数的参考

Author: YihangBao
Link: https://roarboil.github.io/2019/09/21/nanp/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.