数据转换

删除重复值

当DataFrame中出现重复行时可进行操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
data = pd.DataFrame({'k1':['one','two']*3 + ['two'], 'k2':[1,1,2,3,3,4,4]})
print(data) ==>
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4

#这里duplicates函数的作用时返回布尔值,false表示前面没有出现过,true表示前面出现过
print(data.duplicated()) ==>
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool

#这里drop_duplicates的作用是将上面显示为true的行删去,也即将与前面重复的行删去
print(data.drop_duplicates()) ==>
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4

data['v1'] = range(7)
print(data) ==>
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6

#对drop_duplicates函数传入参数可以使其基于指定列进行重复值删除,同时传入keep='last'参数会将重复的值的最后一个返回而非返回第一个
print(data.drop_duplicates(['k1','k2'], keep='last')) ==>
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6

使用函数或映射进行数据转换

假设我们有这样的需求,对于一个由许多物品组成的DataFrame,我们想要加上一列表示其所属的种类,我们该采取什么措施?

我们先创建一个DataFrame和一个描述种类对应关系的map

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
data = pd.DataFrame({'food':['bacon','pulled pork','bacon','pastrami','corned beef','bacon','pastrami','honey ham','nova lox'],
'ounces':[4,3,12,6,7.5,8,3,5,6]})
print(data) ==>
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 pastrami 6.0
4 corned beef 7.5
5 bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

之后就可以对DataFrame的其中一个Series列进行map操作获取新的对应列

1
2
3
4
5
6
7
8
9
10
11
12
data['animal'] = data['food'].map(meat_to_animal)
print(data) ==>
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 pastrami 6.0 cow
4 corned beef 7.5 cow
5 bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon

使用map是一种便捷执行按元素转换及其他清洗相关操作的方法

替代值

前面提到的fillna函数只能对缺失值NA进行填充,而这里介绍的replace方法可以对任意指定值进行填充

replace函数可以对单个值,多个值,以及字典值进行填充

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
data = pd.Series([1., -999,2.,-999.,-1000.,3.])
print(data) ==>
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
print(data.replace(-999, np.nan)) ==>
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
print(data.replace([-999,-1000],np.nan)) ==>
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
print(data.replace({-999:np.nan, -1000:0})) ==>
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64

重命名轴索引

可以对轴函数的index列使用map方法进行更改

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
data = pd.DataFrame(np.arange(12).reshape((3,4)),
index=['Ohio', 'Colorado', 'NewYork'],
columns=['one', 'two', 'three', 'four'])
print(data) ==>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
NewYork 8 9 10 11
transform = lambda x: x[:4].upper()
data.index = data.index.map(transform)
print(data) ==>
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEWY 8 9 10 11

但是这里的问题是不会返回一个新的改好的对象,于是我们就可以使用rename函数来返回一个改好了的新对象,同时rename还可以结合字典型对象进行使用

1
2
3
4
5
6
7
8
9
10
11
print(data.rename(index=str.title, columns=str.upper))	==>
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colorado 4 5 6 7
Newyork 8 9 10 11
print(data.rename(index={'Ohio':'INDIANA'},
columns={'three':'peekaboo'})) ==>
one two peekaboo four
INDIANA 0 1 2 3
Colorado 4 5 6 7
NewYork 8 9 10 11

离散化和分箱

分箱指的是将连续的值分为离散的分组。可以使用cat方法指定分组进行分箱。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
ages =[20,22,25,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
print(cats) ==>
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

#输出一个列表表示每个元素所属的分组
print(cats.codes) ==>
[0 0 0 1 0 0 2 1 3 2 2 1]

#输出有哪几个组别
print(cats.categories) ==>
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')

#输出有每个组中元素的个数
pd.value_counts(cats) ==>
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64

默认区间为左开又闭,如果要人工指定的话可以使用类似right=False参数

可以使用labels选项传递一个列表或数组来传入自定义的箱名

1
2
3
4
5
6
7
8
ages =[20,22,25,27,21,23,37,31,61,45,41,32]
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins, labels=group_names)
print(cats) ==>
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

如果要限制比较的精度可以使用precision=进行控制

qcut是一个与分箱密切相关的函数,可以基于样本的分布进行分箱下面的例子将正态分布的数平均切成了四份

1
2
3
4
5
6
data = np.random.randn(1000)
cats = pd.qcut(data,4)
print(cats) ==>
[(-3.158, -0.723], (-0.723, -0.0431], (0.642, 3.168], (-0.0431, 0.642], (-0.723, -0.0431], ..., (-0.723, -0.0431], (0.642, 3.168], (0.642, 3.168], (-3.158, -0.723], (-3.158, -0.723]]
Length: 1000
Categories (4, interval[float64]): [(-3.158, -0.723] < (-0.723, -0.0431] < (-0.0431, 0.642] < (0.642, 3.168]]

检测和过滤异常值

如果要找出所有大于3或者小于3的行可以使用any方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
data = pd.DataFrame(np.random.randn(1000,4))
print(data.describe()) ==>
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.059778 -0.032303 -0.008715 0.006456
std 0.986740 1.001720 0.995813 0.994907
min -3.012142 -3.288286 -2.937343 -2.918211
25% -0.588659 -0.657177 -0.684822 -0.681862
50% 0.061355 -0.034852 -0.007025 0.012280
75% 0.701610 0.612375 0.689996 0.716508
max 3.490700 3.580543 2.949492 2.870731
print(data[(np.abs(data)>3).any(1)]) ==>
0 1 2 3
56 3.490700 -0.501911 -0.306923 -0.104226
64 0.680653 3.580543 0.660139 0.029404
572 3.406488 -1.507746 1.440605 0.906701
657 -0.513807 3.529303 -0.627621 -1.722967
714 -3.012142 2.199210 -0.396182 -0.461979
839 0.151665 -3.119611 1.878828 -0.690340
861 -1.369058 -3.288286 -1.141665 -0.705685
930 0.076559 -3.193390 -0.024365 -0.919133

np.sign(data)可以根据数值中值的正负生成1和-1

置换和随机抽样

如果要对行进行置换可以使用DateFrame的sampler方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
print(df) ==>
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
sampler = np.random.permutation(5)
print(sampler) ==>
[0 3 1 2 4]
print(df.take(sampler)) ==>
0 1 2 3
0 0 1 2 3
3 12 13 14 15
1 4 5 6 7
2 8 9 10 11
4 16 17 18 19

如果我们有对数据集进行抽样的需求可以使用Series和Dataframe的sample方法

这里是选出一个不含有替代值的随机子集

1
2
3
4
5
6
7
8
9
10
11
12
13
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
print(df) ==>
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
print(df.sample(n=3)) ==>
0 1 2 3
3 12 13 14 15
4 16 17 18 19
2 8 9 10 11

如果要生成一个带有替代值的样本(允许重复),将replace=True传入sample方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
choices = pd.Series([5,7,-1,6,4])
draws = choices.sample(n=10,replace=True)
print(draws) ==>
2 -1
0 5
4 4
2 -1
3 6
0 5
4 4
1 7
3 6
4 4
dtype: int64

计算指标/虚拟变量

我们可以用一个get_dummies函数来对某一列进行指标分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
df = pd.DataFrame({'key':['b','b','a','c','a','b'],
'data1':range(6)})
print(df) ==>
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
print(pd.get_dummies(df['key'])) ==>
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0

这里的a列表示在原先key列中如果索引下为a则是1,索引下非a则是0

可以使用get_dummies与cut等离散化函数的组合来对数据进行统计分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
np.random.seed(12345)
values = np.random.rand(10)
print(values) ==>
[0.92961609 0.31637555 0.18391881 0.20456028 0.56772503 0.5955447
0.96451452 0.6531771 0.74890664 0.65356987]
bins = [0,0.2,0.4,0.6,0.8,1]
print(pd.get_dummies(pd.cut(values,bins))) ==>
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 0 0 1
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 1 0 0
6 0 0 0 0 1
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0

这里实现了对数据区间的直观分析

Author: YihangBao
Link: https://roarboil.github.io/2019/09/22/datachange/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.