obj1 = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d','b','a','c']) print(obj1) ==> d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 obj2 = obj1.reindex(['a','b','c','d','e']) print(obj2) ==> a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns=['Ohio','Texas','California']) print(frame) ==> Ohio Texas California a 012 c 345 d 678 frame1 = frame.reindex(['a','b','c','d']) print(frame1) ==> Ohio Texas California a 0.01.02.0 b NaN NaN NaN c 3.04.05.0 d 6.07.08.0
同理,如果要对列进行修改则应当使用columns关键字
轴向上删除条目
drop方法会返回一个含有指示值和轴向上删除值的新对象
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
obj = pd.Series(np.arange(5), index =['a','b','c','d','e']) print(obj) ==> a 0 b 1 c 2 d 3 e 4 dtype: int64 obj1 = obj.drop('c') print(obj1) ==> a 0 b 1 d 3 e 4 dtype: int64
data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four']) print(data) ==> one two three four Ohio 0123 Colorado 4567 Utah 891011 New York 12131415 print(data.drop(['Colorado','Ohio'])) ==> one two three four Utah 891011 New York 12131415 print(data.drop('two',axis='columns')) ==> one three four Ohio 023 Colorado 467 Utah 81011 New York 121415
data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four']) data.drop('Utah',inplace=True) print(data) ==> one two three four Ohio 0123 Colorado 4567 New York 12131415
obj =pd.Series(np.arange(4.),index=['a','b','c','d']) print(obj) ==> a 0.0 b 1.0 c 2.0 d 3.0 dtype: float64 print(obj['b':'c']) ==> b 1.0 c 2.0 dtype: float64 obj['b':'c']=5 print(obj) ==> a 0.0 b 5.0 c 5.0 d 3.0 dtype: float64
使用单个值或序列,可以从DataFrame中索引出一个或多个列:
1 2 3 4 5 6 7 8 9
data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four']) print(data[['three','one']]) ==> three one Ohio 20 Colorado 64 Utah 108 New York 1412
data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four']) print(data) ==> one two three four Ohio 0123 Colorado 4567 Utah 891011 New York 12131415 print(data[:2]) ==> one two three four Ohio 0123 Colorado 4567 data[data<5]=0 print(data) ==> one two three four Ohio 0000 Colorado 0567 Utah 891011 New York 12131415
针对DataFrame在行上的标签索引,可以使用轴标签loc和整数标签iloc
1 2 3 4 5 6 7 8 9 10 11 12
data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four']) print(data.loc['Colorado',['two','three']]) ==> two 5 three 6 Name: Colorado, dtype: int64 print(data.iloc[2,[3,0,1]]) ==> four 11 one 8 two 9 Name: Utah, dtype: int64
除了单个标签,iloc和loc还可用于切片
1 2 3 4 5 6 7 8 9 10 11 12 13
data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four']) print(data.loc[:'Utah','two']) ==> Ohio 1 Colorado 5 Utah 9 Name: two, dtype: int64 print(data.iloc[:,:3][data.three>5]) ==> one two three Colorado 456 Utah 8910 New York 121314
s1 = pd.Series([7.3,-2.5,3.4,1.5], index=['a','c','d','e']) s2 = pd.Series([-2.1,3.6,-1.5,4,3.1], index=['a','c','e','f','g']) print(s1) ==> a 7.3 c -2.5 d 3.4 e 1.5 dtype: float64 print(s2) ==> a -2.1 c 3.6 e -1.5 f 4.0 g 3.1 dtype: float64 print(s1+s2) ==> a 5.2 c 1.1 d NaN e 0.0 f NaN g NaN dtype: float64
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado']) df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon']) print(df1) ==> b c d Ohio 0.01.02.0 Texas 3.04.05.0 Colorado 6.07.08.0 print(df2) ==> b d e Utah 0.01.02.0 Ohio 3.04.05.0 Texas 6.07.08.0 Oregon 9.010.011.0 print(df1+df2) ==> b c d e Colorado NaN NaN NaN NaN Ohio 3.0 NaN 6.0 NaN Oregon NaN NaN NaN NaN Texas 9.0 NaN 12.0 NaN Utah NaN NaN NaN NaN
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado']) df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon']) print(df1) ==> b c d Ohio 0.01.02.0 Texas 3.04.05.0 Colorado 6.07.08.0 print(df2) ==> b d e Utah 0.01.02.0 Ohio 3.04.05.0 Texas 6.07.08.0 Oregon 9.010.011.0 print(df1.add(df2,fill_value=0)) ==> b c d e Colorado 6.07.08.0 NaN Ohio 3.01.06.05.0 Oregon 9.0 NaN 10.011.0 Texas 9.04.012.08.0 Utah 0.0 NaN 1.02.0
注意:这里的填充值指的是填充如原先不存在的地方然后参与加法运算,并非将结果中的NaN改为填充值
下面是一些常用的计算方法
DataFrame和Series间的操作
DataFrame和Series之间的操作涉及到了广播机制
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
df = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon']) ser = df.iloc[0] print(df) ==> b d e Utah 0.01.02.0 Ohio 3.04.05.0 Texas 6.07.08.0 Oregon 9.010.011.0 print(ser) ==> b 0.0 d 1.0 e 2.0 Name: Utah, dtype: float64 print(df-ser) ==> b d e Utah 0.00.00.0 Ohio 3.03.03.0 Texas 6.06.06.0 Oregon 9.09.09.0
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon']) print(frame) ==> b d e Utah 0.371684-0.4283360.366052 Ohio 0.508354-1.108785-1.001388 Texas -0.7510850.879442-1.595899 Oregon -0.270006-0.010182-0.677792 f = lambda x: x.max()-x.min() print(frame.apply(f)) ==> b 1.259439 d 1.988227 e 1.961951 dtype: float64
deff(x): return pd.Series([x.min(),x.max()],index=['min','max']) frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon']) print(frame) ==> b d e Utah 0.9019811.2393831.412554 Ohio 1.5842490.120927-0.849564 Texas -0.7208630.607950-0.481181 Oregon 0.6417401.217635-0.589586 print(frame.apply(f)) ==> b d e min -0.7208630.120927-0.849564 max 1.5842491.2393831.412554
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon']) print(frame) ==> b d e Utah -0.762464-0.076285-0.177307 Ohio 0.755204-2.211331-0.259672 Texas 0.2337970.5390001.620073 Oregon 0.738784-0.342989-0.006291 format = lambda x:'%.2f' % x print(frame.applymap(format)) ==> b d e Utah -0.76-0.08-0.18 Ohio 0.76-2.21-0.26 Texas 0.230.541.62 Oregon 0.74-0.34-0.01
排序和排名
可使用sort_index方法对行索引或列索引进行排序,该方法返回一个新的对象
1 2 3 4 5 6 7 8 9 10 11 12 13
obj = pd.Series(range(4),index=['d','a','b','c']) print(obj) ==> d 0 a 1 b 2 c 3 dtype: int64 print(obj.sort_index()) ==> a 1 b 2 c 3 d 0 dtype: int64
这里数据会默认在轴上按照升序进行排序。
如果要对列进行排序可使用axis=1参数,进行降序排序使用ascending=False参数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
frame = pd.DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c']) print(frame) ==> d a b c three 0123 one 4567 print(frame.sort_index()); ==> d a b c one 4567 three 0123 print(frame.sort_index(axis=1)) ==> a b c d three 1230 one 5674 print(frame.sort_index(axis=1,ascending=False)) ==> d c b a three 0321 one 4765
obj = pd.Series([4, np.nan,7, np.nan,-3,2]) print(obj); ==> 04.0 1 NaN 27.0 3 NaN 4-3.0 52.0 dtype: float64 print(obj.sort_values()) ==> 4-3.0 52.0 04.0 27.0 1 NaN 3 NaN dtype: float64
当给DataFrame排序时,可以使用一列或多列作为排序键,使用参数by
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
frame = pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]}) print(frame) ==> b a 040 171 2-30 321 print(frame.sort_values(by='b')) ==> b a 2-30 321 040 171 print(frame.sort_values(by=['a','b'])) ==> b a 2-30 040 321 171
frame = pd.DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],'c':[-2,5,8,-2.5]}) print(frame) ==> b a c 04.30-2.0 17.015.0 2-3.008.0 32.01-2.5 print(frame.rank(axis='columns')) ==> b a c 03.02.01.0 13.01.02.0 21.02.03.0 33.02.01.0