对于数值型 数据,pandas使用浮点值NaN 来代表缺失。而缺失值为NA ,表示数据不存在或者存在但无法观察。有一些函数可以用于处理:
1 2 3 4 5 6 7 8 9 10 11 12 13 string_data = pd.Series(['aardvark' , 'artichoke' , np.nan, 'avocado' ]) print(string_data) ==> 0 aardvark1 artichoke2 NaN3 avocadodtype: object print(string_data.isnull()) ==> 0 False 1 False 2 True 3 False dtype: bool
过滤缺失值 dropna 在过滤数据时起到了很大的作用
1 2 3 4 5 6 7 8 9 10 11 12 13 data = pd.Series([1 ,NA,3.5 ,NA,7 ]) print(data) ==> 0 1.0 1 NaN2 3.5 3 NaN4 7.0 dtype: float64 print(data.dropna()) ==> 0 1.0 2 3.5 4 7.0 dtype: float64
对于DataFrame对象,dropna默认会删除所有 含缺失值的行 ,若仅仅是要删除所有值为NA的行则传入参数how= ‘all’ ,如果要对列 进行操作则传入axis=1 ,如果要指定丢弃的行数则传入参数thresh=
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 data = pd.DataFrame([[1. , 6.5 , 3. ], [1. , NA, NA],[NA, NA, NA],[NA, 6.5 ,3. ]]) print(data) ==> 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN2 NaN NaN NaN3 NaN 6.5 3.0 print(data.dropna()) ==> 0 1 2 0 1.0 6.5 3.0 print(data.dropna(how='all')) ==> 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN3 NaN 6.5 3.0 print(data.dropna(how='all', axis=1)) ==> 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN2 NaN NaN NaN3 NaN 6.5 3.0 print(data.dropna(thresh=1)) ==> 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN3 NaN 6.5 3.0
补全缺失值 可以使用fillna 函数来进行缺失值的补充
可以使用常数补全,指定列字典补全。默认情况下fillna生成一个新的对象 ,但也可以使用inplace 参数修改已经存在的对象
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 df = pd.DataFrame(np.random.randn(7 ,3 )) df.iloc[:4 , 1 ] = NA df.iloc[:2 , 2 ] = NA print(df) ==> 0 1 2 0 -1.192710 NaN NaN1 0.507454 NaN NaN2 -0.573454 NaN 0.235207 3 -0.308547 NaN 1.769382 4 -1.696863 1.103388 -0.433570 5 0.304338 0.176991 -0.282484 6 -0.134053 -0.213646 0.800938 print(df.fillna(0)) ==> 0 1 2 0 -1.192710 0.000000 0.000000 1 0.507454 0.000000 0.000000 2 -0.573454 0.000000 0.235207 3 -0.308547 0.000000 1.769382 4 -1.696863 1.103388 -0.433570 5 0.304338 0.176991 -0.282484 6 -0.134053 -0.213646 0.800938 print(df.fillna({1: 0.5,2:0})) ==> 0 1 2 0 -1.192710 0.500000 0.000000 1 0.507454 0.500000 0.000000 2 -0.573454 0.500000 0.235207 3 -0.308547 0.500000 1.769382 4 -1.696863 1.103388 -0.433570 5 0.304338 0.176991 -0.282484 6 -0.134053 -0.213646 0.800938
也可以用之前用于重建索引的相同的插值方法method=’ffill’ (同时也可以带上limit参数用于限制个数)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 df = pd.DataFrame(np.random.randn(6 ,3 )) df.iloc[2 :, 1 ] = NA df.iloc[4 :, 2 ] = NA print(df) ==> 0 1 2 0 -0.571684 -1.130371 -0.565300 1 1.243802 -1.536666 0.361830 2 1.291224 NaN 0.514455 3 -1.232330 NaN 2.057907 4 0.364088 NaN NaN5 -0.598980 NaN NaNprint(df.fillna(method='ffill')) ==> 0 1 2 0 -0.571684 -1.130371 -0.565300 1 1.243802 -1.536666 0.361830 2 1.291224 -1.536666 0.514455 3 -1.232330 -1.536666 2.057907 4 0.364088 -1.536666 2.057907 5 -0.598980 -1.536666 2.057907 print(df.fillna(method='ffill', limit=2)) ==> 0 1 2 0 -0.571684 -1.130371 -0.565300 1 1.243802 -1.536666 0.361830 2 1.291224 -1.536666 0.514455 3 -1.232330 -1.536666 2.057907 4 0.364088 NaN 2.057907 5 -0.598980 NaN 2.057907
还有一些创造性的填充方法可结合具体问题使用:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 data = pd.Series([1. ,NA,3.5 ,NA,7 ]) print(data) ==> 0 1.0 1 NaN2 3.5 3 NaN4 7.0 dtype: float64 print(data.fillna(data.mean())) ==> 0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64
下面是一些参数的参考