这一部分将会介绍Series和DataFrame中的数据交互的基础机制

重建索引

reindex方法用于创建一个符合新索引的新对象。Series在调用reindex方法时，会将数据按照新的索引值进行排列，如果某个索引之前并不存在，则会引入缺失值

obj1 = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d','b','a','c'])
print(obj1)  ==>
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
obj2 = obj1.reindex(['a','b','c','d','e'])
print(obj2)		==>
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

对于一些顺序数据（如时间顺序），可能会需要进行插值和自动填充。method可选参数允许使用ffill方法在重建索引时插值。ffill将会将值向前填充

obj1 = pd.Series(['Blue','purple','yellow'], index=[0,2,4])
print(obj1)	==>
0      Blue
2    purple
4    yellow
dtype: object
obj2 = obj1.reindex(range(6), method='ffill')
print(obj2)	==>
0      Blue
1      Blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

在DataFrame中reindex可以改变行索引，列索引或同时，当仅传入一个序列中时，结果中的行会重建索引

frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns=['Ohio','Texas','California'])
print(frame)	==>
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
frame1 = frame.reindex(['a','b','c','d'])
print(frame1) ==>
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0

同理，如果要对列进行修改则应当使用columns关键字

轴向上删除条目

drop方法会返回一个含有指示值和轴向上删除值的新对象

obj = pd.Series(np.arange(5), index =['a','b','c','d','e'])
print(obj) ==>
a    0
b    1
c    2
d    3
e    4
dtype: int64
obj1 = obj.drop('c')
print(obj1)	==>
a    0
b    1
d    3
e    4
dtype: int64

对于DataFrame来说，调用drop时使用标签序列会根据行标签删除值，可以通过axis=’columns’来从列中删除值

data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio','Colorado','Utah','New York'],
                    columns=['one','two','three','four'])
print(data)	==>
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
print(data.drop(['Colorado','Ohio']))	==>
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15
print(data.drop('two',axis='columns'))	==>
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

上面所说的都是返回一个新的对象而原先的不发生改变，但如果加上inplace参数则可以直接操作原对象而不返回新对象

data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio','Colorado','Utah','New York'],
                    columns=['one','two','three','four'])
data.drop('Utah',inplace=True)
print(data)	==>
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York   12   13     14    15

索引、选择与过滤

Series的索引（obj[…]）和Numpy类似，但Series的索引值可以不仅仅是整数

普通python切片不包含尾部，但Series与之不同。并且在对这些方法设值的时候会修改Series相应的部分

obj =pd.Series(np.arange(4.),index=['a','b','c','d'])
print(obj)	==>
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
print(obj['b':'c'])	==>
b    1.0
c    2.0
dtype: float64
obj['b':'c']=5
print(obj)	==>
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

使用单个值或序列，可以从DataFrame中索引出一个或多个列：

data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio','Colorado','Utah','New York'],
                    columns=['one','two','three','four'])
print(data[['three','one']])	==>
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12

如果要进行列索引则可以使用切片语法，同时也能用Numpy风格的语法进行布尔检索

data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio','Colorado','Utah','New York'],
                    columns=['one','two','three','four'])
print(data)	==>
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
print(data[:2])	==>
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
data[data<5]=0
print(data)	==>
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

针对DataFrame在行上的标签索引，可以使用轴标签loc和整数标签iloc

data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio','Colorado','Utah','New York'],
                    columns=['one','two','three','four'])
print(data.loc['Colorado',['two','three']])	==>
two      5
three    6
Name: Colorado, dtype: int64
print(data.iloc[2,[3,0,1]])	==>
four    11
one      8
two      9
Name: Utah, dtype: int64

除了单个标签，iloc和loc还可用于切片

data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio','Colorado','Utah','New York'],
                    columns=['one','two','three','four'])
print(data.loc[:'Utah','two'])	==>
Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int64
print(data.iloc[:,:3][data.three>5])	==>
          one  two  three
Colorado    4    5      6
Utah        8    9     10
New York   12   13     14

整数索引

由于整数索引在使用时经常会出现歧义，数据选择时最好使用标签索引，可以使用loc(标签)或iloc(整数)

ser = pd.Series(np.arange(3.))
print(ser)	==>
0    0.0
1    1.0
2    2.0
dtype: float64
print(ser.loc[:1])	==>
0    0.0
1    1.0
dtype: float64
print(ser.iloc[:1])	==>
0    0.0
dtype: float64

算术和数据对齐

如果两个对象相加，所返回结果的索引将是索引对的并集。

s1 = pd.Series([7.3,-2.5,3.4,1.5], index=['a','c','d','e'])
s2 = pd.Series([-2.1,3.6,-1.5,4,3.1], index=['a','c','e','f','g'])
print(s1)	==>
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
print(s2)	==>
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
print(s1+s2)	==>
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

通过上面的例子我们可以看到，对于两者都有的索引，所得的就是二者相加的值。而对于没有交叠的标签位置上，内部数据对齐会产生缺失值

对于DataFrame来说，行和列上都会执行对齐

df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
print(df1)	==>
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
print(df2)	==>
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
print(df1+df2)	==>
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

由上面的例子我们看到它的索引、列是每个DataFrame的索引、列的并集，并在一些地方产生了缺失值

使用填充值的算术方法

上述操作方法我们发现在一些地方出现了缺失值，如果当轴标签在一个对象中存在而在另一个对象中不存在想要指定填充值可以使用add方法并在其中传入fill_value参数

df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
print(df1)	==>
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
print(df2)	==>
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
print(df1.add(df2,fill_value=0))	==>
            b    c     d     e
Colorado  6.0  7.0   8.0   NaN
Ohio      3.0  1.0   6.0   5.0
Oregon    9.0  NaN  10.0  11.0
Texas     9.0  4.0  12.0   8.0
Utah      0.0  NaN   1.0   2.0

注意：这里的填充值指的是填充如原先不存在的地方然后参与加法运算，并非将结果中的NaN改为填充值

下面是一些常用的计算方法

DataFrame和Series间的操作

DataFrame和Series之间的操作涉及到了广播机制

df = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
ser = df.iloc[0]
print(df)	==>
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
print(ser)	==>
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
print(df-ser)	==>
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

这里我们看到，因为广播机制的原因，减法在每一行都进行了操作。默认情况下，DataFrame和Series的数学操作会将Series的索引和DataFrame的列进行匹配并广播到各行

如果我们想要改在列上进行广播，行上匹配，则必须使用算术方法的一种：

df = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
ser = df['d']
print(df)	==>
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
print(ser)	==>
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64
print(df.sub(ser,axis='index'))	==>
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0

使用axis进行匹配轴并进行广播

函数应用和映射

Numpy的通用函数（逐元素数组方法）对pandas对象也有效：

frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
print(frame)	==>
               b         d         e
Utah    0.823234 -1.230791  0.679275
Ohio    0.820003 -0.280677 -0.684320
Texas  -0.373359 -1.338919 -0.629548
Oregon -0.059928 -2.414072  0.476505
print(np.abs(frame))	==>
               b         d         e
Utah    0.823234  1.230791  0.679275
Ohio    0.820003  0.280677  0.684320
Texas   0.373359  1.338919  0.629548
Oregon  0.059928  2.414072  0.476505

另外一个常用的操作是将自定义函数应用在一行或者一列的一维数组上，可用DataFrame的apply方法实现这个功能

frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
print(frame)	==>
               b         d         e
Utah    0.371684 -0.428336  0.366052
Ohio    0.508354 -1.108785 -1.001388
Texas  -0.751085  0.879442 -1.595899
Oregon -0.270006 -0.010182 -0.677792
f = lambda x: x.max()-x.min()
print(frame.apply(f))	==>
b    1.259439
d    1.988227
e    1.961951
dtype: float64

这里的f就对frame中的每一列计算其最大值和最小值的差，结果是一个以frame为列作为索引的Series

如果想对列进行操作就传入axis=’columns’

传递给apply函数的不一定返回标量值，也可以返回带有多个值的Series

def f(x):
    return pd.Series([x.min(),x.max()],index=['min','max'])
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
print(frame)	==>
               b         d         e
Utah    0.901981  1.239383  1.412554
Ohio    1.584249  0.120927 -0.849564
Texas  -0.720863  0.607950 -0.481181
Oregon  0.641740  1.217635 -0.589586
print(frame.apply(f))	==>
            b         d         e
min -0.720863  0.120927 -0.849564
max  1.584249  1.239383  1.412554

也可使用逐元素的python函数，如果想要根据frame中的每个浮点数计算格式化字符串应当使用applymap方法

frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
print(frame)	==>
               b         d         e
Utah   -0.762464 -0.076285 -0.177307
Ohio    0.755204 -2.211331 -0.259672
Texas   0.233797  0.539000  1.620073
Oregon  0.738784 -0.342989 -0.006291
format = lambda x:'%.2f' % x
print(frame.applymap(format))	==>
            b      d      e
Utah    -0.76  -0.08  -0.18
Ohio     0.76  -2.21  -0.26
Texas    0.23   0.54   1.62
Oregon   0.74  -0.34  -0.01

排序和排名

可使用sort_index方法对行索引或列索引进行排序，该方法返回一个新的对象

obj = pd.Series(range(4),index=['d','a','b','c'])
print(obj)	==>
d    0
a    1
b    2
c    3
dtype: int64
print(obj.sort_index())	==>
a    1
b    2
c    3
d    0
dtype: int64

这里数据会默认在轴上按照升序进行排序。

如果要对列进行排序可使用axis=1参数，进行降序排序使用ascending=False参数。

frame = pd.DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
print(frame)	==>
       d  a  b  c
three  0  1  2  3
one    4  5  6  7
print(frame.sort_index());	==>
       d  a  b  c
one    4  5  6  7
three  0  1  2  3
print(frame.sort_index(axis=1))	==>
       a  b  c  d
three  1  2  3  0
one    5  6  7  4
print(frame.sort_index(axis=1,ascending=False))	==>
       d  c  b  a
three  0  3  2  1
one    4  7  6  5

如果想对值进行排序可以使用sort_values()方法。默认情况下所有缺失值NaN都会被排序到底部

obj = pd.Series([4, np.nan,7, np.nan,-3,2])
print(obj);	==>
0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64
print(obj.sort_values())	==>
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

当给DataFrame排序时，可以使用一列或多列作为排序键，使用参数by

frame = pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
print(frame)	==>
   b  a
0  4  0
1  7  1
2 -3  0
3  2  1
print(frame.sort_values(by='b'))	==>
   b  a
2 -3  0
3  2  1
0  4  0
1  7  1
print(frame.sort_values(by=['a','b']))	==>
   b  a
2 -3  0
0  4  0
3  2  1
1  7  1

排名指的是在数据中有效数据点数就行分配，比如可能出现1，2，2，4这样的排名序列。当然如果对于排名使用观察序列的话可以加上method=参数

obj = pd.Series([7,-5,7,4,2,0,4])
print(obj)	==>
0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64
print(obj.rank())	==>
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
print(obj.rank(method='first'))	==>
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
print(obj.rank(ascending=False, method='max'))	==>
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

接下来是一些用来打破平级关系的method方法

load failed

对DataFrame也可以进行类似操作

frame = pd.DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
print(frame)	==>
     b  a    c
0  4.3  0 -2.0
1  7.0  1  5.0
2 -3.0  0  8.0
3  2.0  1 -2.5
print(frame.rank(axis='columns'))	==>
     b    a    c
0  3.0  2.0  1.0
1  3.0  1.0  2.0
2  1.0  2.0  3.0
3  3.0  2.0  1.0

含有重复标签的轴索引

如果标签存在重复现象会使代码变得极为复杂，因为来自索引的输出类型可能因标签是否重复而有所不同。比如根据一个标签索引多个条目会返回一个序列，而单个条目会返回标量值。