第三章 索引


本章數據:

import numpy as np
import pandas as pd

一、索引器

1. 表的列索引

列索引是最常見的索引形式,一般通過[]來實現。通過[列名]可以從DataFrame中取出相應的列,返回值為Series,例如從表中取出姓名一列:

df = pd.read_csv('../data/learn_pandas.csv', usecols = ['School', 'Grade', 'Name', 'Gender', 'Weight', 'Transfer'])
df['Name'].head()
0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

如果要取出多個列,則可以通過[列名組成的列表],其返回值為一個DataFrame,例如從表中取出性別和姓名兩列:

df[['Gender', 'Name']].head()



Gender

Name

0

Female

Gaopeng Yang

1

Male

Changqiang You

2

Male

Mei Sun

3

Female

Xiaojuan Sun

4

Male

Gaojuan You

此外,若要取出單列,且列名中不包含空格,則可以用.列名取出,這和[列名]是等價的:

df.Name.head()
0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

2. 序列的行索引

【a】以字符串為索引的Series

如果取出單個索引的對應元素,則可以使用[item],若Series只有單個值對應,則返回這個標量值,如果有多個值對應,則返回一個Series

s = pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'a', 'a', 'a', 'c'])
s['a']
a    1
a    3
a    4
a    5
dtype: int64
s['b']
2

如果取出多個索引的對應元素,則可以使用[items的列表]

s[['c', 'b']]
c    6
b    2
dtype: int64

如果想要取出某兩個索引之間的元素,並且這兩個索引是在整個索引中唯一出現,則可以使用切片,,同時需要注意這裏的切片會包含兩個端點:

s['c': 'b' :-2]
c    6
a    4
b    2
dtype: int64

如果前後端點的值重複出現,那麼需要經過排序才能使用切片:

try:
    s['a': 'b']
except Exception as e:
    Err_Msg = e
Err_Msg
KeyError("Cannot get left slice bound for non-unique label: 'a'")
s.sort_index()['a': 'b']
a    1
a    3
a    4
a    5
b    2
dtype: int64

【b】以整數為索引的Series

在使用數據的讀入函數時,如果不特別指定所對應的列作為索引,那麼會生成從0開始的整數索引作為默認索引。當然,任意一組符合長度要求的整數都可以作為索引。

和字符串一樣,如果使用[int][int_list],則可以取出對應索引元素的值:

s = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'], index=[1, 3, 1, 2, 5, 4])
s[1]
1    a
1    c
dtype: object
s[[2,3]]
2    d
3    b
dtype: object

如果使用整數切片,則會取出對應索引位置的值,注意這裏的整數切片同Python中的切片一樣不包含右端點:

s[1:-1]
3    b
1    c
2    d
5    e
dtype: object
【WARNING】關於索引類型的説明

如果不想陷入麻煩,那麼請不要把純浮點以及任何混合類型(字符串、整數、浮點類型等的混合)作為索引,否則可能會在具體的操作時報錯或者返回非預期的結果,並且在實際的數據分析中也不存在這樣做的動機。

【END】

3. loc索引器

前面講到了對DataFrame的列進行選取,下面要討論其行的選取。對於表而言,有兩種索引器,一種是基於元素loc索引器,另一種是基於位置iloc索引器。

loc索引器的一般形式是loc[*, *],其中第一個*代表行的選擇,第二個*代表列的選擇,如果省略第二個位置寫作loc[*],這個*是指行的篩選。其中,*的位置一共有五類合法對象,分別是:單個元素、元素列表、元素切片、布爾列表以及函數,下面將依次説明。

為了演示相應操作,先利用set_index方法把Name列設為索引,關於該函數的其他用法將在多級索引一章介紹。

df_demo = df.set_index('Name')
df_demo.head()



School

Grade

Gender

Weight

Transfer

Name

Gaopeng Yang

Shanghai Jiao Tong University

Freshman

Female

46.0

N

Changqiang You

Peking University

Freshman

Male

70.0

N

Mei Sun

Shanghai Jiao Tong University

Senior

Male

89.0

N

Xiaojuan Sun

Fudan University

Sophomore

Female

41.0

N

Gaojuan You

Fudan University

Sophomore

Male

74.0

N

【a】*為單個元素

此時,直接取出相應的行或列,如果該元素在索引中重複則結果為DataFrame,否則為Series

df_demo.loc['Qiang Sun'] # 多個人叫此名字



School

Grade

Gender

Weight

Transfer

Name

Qiang Sun

Tsinghua University

Junior

Female

53.0

N

Qiang Sun

Tsinghua University

Sophomore

Female

40.0

N

Qiang Sun

Shanghai Jiao Tong University

Junior

Female

NaN

N

df_demo.loc['Quan Zhao'] # 名字唯一
School      Shanghai Jiao Tong University
Grade                              Junior
Gender                             Female
Weight                                 53
Transfer                                N
Name: Quan Zhao, dtype: object

也可以同時選擇行和列:

df_demo.loc['Qiang Sun', 'School'] # 返回Series
Name
Qiang Sun              Tsinghua University
Qiang Sun              Tsinghua University
Qiang Sun    Shanghai Jiao Tong University
Name: School, dtype: object
df_demo.loc['Quan Zhao', 'School'] # 返回單個元素
'Shanghai Jiao Tong University'

【b】*為元素列表

此時,取出列表中所有元素值對應的行或列:

df_demo.loc[['Qiang Sun','Quan Zhao'], ['School','Gender']]



School

Gender

Name

Qiang Sun

Tsinghua University

Female

Qiang Sun

Tsinghua University

Female

Qiang Sun

Shanghai Jiao Tong University

Female

Quan Zhao

Shanghai Jiao Tong University

Female

【c】*為切片

之前的Series使用字符串索引時提到,如果是唯一值的起點和終點字符,那麼就可以使用切片,並且包含兩個端點,如果不唯一則報錯:

df_demo.loc['Gaojuan You':'Gaoqiang Qian', 'School':'Gender']



School

Grade

Gender

Name

Gaojuan You

Fudan University

Sophomore

Male

Xiaoli Qian

Tsinghua University

Freshman

Female

Qiang Chu

Shanghai Jiao Tong University

Freshman

Female

Gaoqiang Qian

Tsinghua University

Junior

Female

需要注意的是,如果DataFrame使用整數索引,其使用整數切片的時候和上面字符串索引的要求一致,都是元素切片,包含端點且起點、終點不允許有重複值。

df_loc_slice_demo = df_demo.copy()
df_loc_slice_demo.index = range(df_demo.shape[0],0,-1)
df_loc_slice_demo.loc[5:3]
# df_loc_slice_demo



School

Grade

Gender

Weight

Transfer

5

Fudan University

Junior

Female

46.0

N

4

Tsinghua University

Senior

Female

50.0

N

3

Shanghai Jiao Tong University

Senior

Female

45.0

N

df_loc_slice_demo.loc[3:5] # 沒有返回,説明不是整數位置切片



School

Grade

Gender

Weight

Transfer

【d】*為布爾列表

在實際的數據處理中,根據條件來篩選行是極其常見的,此處傳入loc的布爾列表與DataFrame長度相同,且列表為True的位置所對應的行會被選中,False則會被剔除。

例如,選出體重超過70kg的學生:

df_demo.loc[df_demo.Weight>70].head()



School

Grade

Gender

Weight

Transfer

Name

Mei Sun

Shanghai Jiao Tong University

Senior

Male

89.0

N

Gaojuan You

Fudan University

Sophomore

Male

74.0

N

Xiaopeng Zhou

Shanghai Jiao Tong University

Freshman

Male

74.0

N

Xiaofeng Sun

Tsinghua University

Senior

Male

71.0

N

Qiang Zheng

Shanghai Jiao Tong University

Senior

Male

87.0

N

前面所提到的傳入元素列表,也可以通過isin方法返回的布爾列表等價寫出,例如選出所有大一和大四的同學信息:

df_demo.loc[df_demo.Grade.isin(['Freshman', 'Senior'])].head()



School

Grade

Gender

Weight

Transfer

Name

Gaopeng Yang

Shanghai Jiao Tong University

Freshman

Female

46.0

N

Changqiang You

Peking University

Freshman

Male

70.0

N

Mei Sun

Shanghai Jiao Tong University

Senior

Male

89.0

N

Xiaoli Qian

Tsinghua University

Freshman

Female

51.0

N

Qiang Chu

Shanghai Jiao Tong University

Freshman

Female

52.0

N

對於複合條件而言,可以用|(或), &(且), ~(取反)的組合來實現,例如選出復旦大學中體重超過70kg的大四學生,或者北大男生中體重超過80kg的非大四的學生:

condition_1_1 = df_demo.School == 'Fudan University'
condition_1_2 = df_demo.Grade == 'Senior'
condition_1_3 = df_demo.Weight > 70
condition_1 = condition_1_1 & condition_1_2 & condition_1_3
condition_2_1 = df_demo.School == 'Peking University'
condition_2_2 = df_demo.Grade == 'Senior'
condition_2_3 = df_demo.Weight > 80
condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
df_demo.loc[condition_1 | condition_2]



School

Grade

Gender

Weight

Transfer

Name

Qiang Han

Peking University

Freshman

Male

87.0

N

Chengpeng Zhou

Fudan University

Senior

Male

81.0

N

Changpeng Zhao

Peking University

Freshman

Male

83.0

N

Chengpeng Qian

Fudan University

Senior

Male

73.0

Y

【練一練】

select_dtypes是一個實用函數,它能夠從表中選出相應類型的列,若要選出所有數值型的列,只需使用.select_dtypes('number'),請利用布爾列表選擇的方法結合DataFramedtypes屬性在learn_pandas數據集上實現這個功能。

learn_pandas = pd.read_csv('../data/learn_pandas.csv')
learn_pandas.select_dtypes('number').head()



Height

Weight

Test_Number

0

158.9

46.0

1

1

166.5

70.0

1

2

188.9

89.0

2

3

NaN

41.0

2

4

174.0

74.0

2

【END】

【e】*為函數

這裏的函數,必須以前面的四種合法形式之一為返回值,並且函數的輸入值為DataFrame本身。假設仍然是上述複合條件篩選的例子,可以把邏輯寫入一個函數中再返回,需要注意的是函數的形式參數x本質上即為df_demo

def condition(x):
    condition_1_1 = x.School == 'Fudan University'
    condition_1_2 = x.Grade == 'Senior'
    condition_1_3 = x.Weight > 70
    condition_1 = condition_1_1 & condition_1_2 & condition_1_3
    condition_2_1 = x.School == 'Peking University'
    condition_2_2 = x.Grade == 'Senior'
    condition_2_3 = x.Weight > 80
    condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
    result = condition_1 | condition_2
    return result
df_demo.loc[condition]



School

Grade

Gender

Weight

Transfer

Name

Qiang Han

Peking University

Freshman

Male

87.0

N

Chengpeng Zhou

Fudan University

Senior

Male

81.0

N

Changpeng Zhao

Peking University

Freshman

Male

83.0

N

Chengpeng Qian

Fudan University

Senior

Male

73.0

Y

此外,還支持使用lambda表達式,其返回值也同樣必須是先前提到的四種形式之一:

df_demo.loc[lambda x:'Quan Zhao', lambda x:'Gender']
'Female'

由於函數無法返回如start: end: step的切片形式,故返回切片時要用slice對象進行包裝:

df_demo.loc[lambda x: slice('Gaojuan You', 'Gaoqiang Qian')]



School

Grade

Gender

Weight

Transfer

Name

Gaojuan You

Fudan University

Sophomore

Male

74.0

N

Xiaoli Qian

Tsinghua University

Freshman

Female

51.0

N

Qiang Chu

Shanghai Jiao Tong University

Freshman

Female

52.0

N

Gaoqiang Qian

Tsinghua University

Junior

Female

50.0

N

最後需要指出的是,對於Series也可以使用loc索引,其遵循的原則與DataFrame中用於行篩選的loc[*]完全一致,此處不再贅述。

【WARNING】不要使用鏈式賦值

在對錶或者序列賦值時,應當在使用一層索引器後直接進行賦值操作,這樣做是由於進行多次索引後賦值是賦在臨時返回的copy副本上的,而沒有真正修改元素從而報出SettingWithCopyWarning警告。例如,下面給出的例子:

df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]], columns=list('AB'))
df_chain



A

B

0

0

0

1

1

0

2

-1

0

import warnings
with warnings.catch_warnings():
    warnings.filterwarnings('error')
    try:
        df_chain[df_chain.A!=0].B = 1 # 使用方括號列索引後,再使用點的列索引
    except Warning as w:
        Warning_Msg = w
print(Warning_Msg)
df_chain
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A

B

0

0

0

1

1

0

2

-1

0

df_chain.loc[df_chain.A!=0,'B'] = 1
df_chain
【END】

4. iloc索引器

iloc的使用與loc完全類似,只不過是針對位置進行篩選,在相應的*位置處一共也有五類合法對象,分別是:整數、整數列表、整數切片、布爾列表以及函數,函數的返回值必須是前面的四類合法對象中的一個,其輸入同樣也為DataFrame本身。

df_demo.iloc[1, 1] # 第二行第二列
'Freshman'
df_demo.iloc[[0, 1], [0, 1]] # 前兩行前兩列



School

Grade

Name

Gaopeng Yang

Shanghai Jiao Tong University

Freshman

Changqiang You

Peking University

Freshman

df_demo.iloc[1: 4, 2:4] # 切片不包含結束端點



Gender

Weight

Name

Changqiang You

Male

70.0

Mei Sun

Male

89.0

Xiaojuan Sun

Female

41.0

df_demo.iloc[lambda x: slice(1, 4)] # 傳入切片為返回值的函數



School

Grade

Gender

Weight

Transfer

Name

Changqiang You

Peking University

Freshman

Male

70.0

N

Mei Sun

Shanghai Jiao Tong University

Senior

Male

89.0

N

Xiaojuan Sun

Fudan University

Sophomore

Female

41.0

N

在使用布爾列表的時候要特別注意,不能傳入Series而必須傳入序列的values,否則會報錯。因此,在使用布爾篩選的時候還是應當優先考慮loc的方式。

例如,選出體重超過80kg的學生:

(df_demo.Weight>80).values[0:5]
array([False, False,  True, False, False])
df_demo.iloc[(df_demo.Weight>80).values].head()



School

Grade

Gender

Weight

Transfer

Name

Mei Sun

Shanghai Jiao Tong University

Senior

Male

89.0

N

Qiang Zheng

Shanghai Jiao Tong University

Senior

Male

87.0

N

Qiang Han

Peking University

Freshman

Male

87.0

N

Chengpeng Zhou

Fudan University

Senior

Male

81.0

N

Feng Han

Shanghai Jiao Tong University

Sophomore

Male

82.0

N

Series而言同樣也可以通過iloc返回相應位置的值或子序列:

df_demo.School.iloc[1]
'Peking University'
df_demo.School.iloc[1:5:2]
Name
Changqiang You    Peking University
Xiaojuan Sun       Fudan University
Name: School, dtype: object

5. query方法

pandas中,支持把字符串形式的查詢表達式傳入query方法來查詢數據,其表達式的執行結果必須返回布爾列表。在進行復雜索引時,由於這種檢索方式無需像普通方法一樣重複使用DataFrame的名字來引用列名,一般而言會使代碼長度在不降低可讀性的前提下有所減少。

例如,將loc一節中的複合條件查詢例子可以如下改寫:

df.query('((School == "Fudan University")&'
         ' (Grade == "Senior")&'
         ' (Weight > 70))|'
         '((School == "Peking University")&'
         ' (Grade != "Senior")&'
         ' (Weight > 80))')



School

Grade

Name

Gender

Weight

Transfer

38

Peking University

Freshman

Qiang Han

Male

87.0

N

66

Fudan University

Senior

Chengpeng Zhou

Male

81.0

N

99

Peking University

Freshman

Changpeng Zhao

Male

83.0

N

131

Fudan University

Senior

Chengpeng Qian

Male

73.0

Y

df.query('((School == "Fudan University")& (Grade == "Senior")& (Weight > 70))| ((School == "Peking University")&(Grade != "Senior")&(Weight > 80))')



School

Grade

Name

Gender

Weight

Transfer

38

Peking University

Freshman

Qiang Han

Male

87.0

N

66

Fudan University

Senior

Chengpeng Zhou

Male

81.0

N

99

Peking University

Freshman

Changpeng Zhao

Male

83.0

N

131

Fudan University

Senior

Chengpeng Qian

Male

73.0

Y

query表達式中,幫用户註冊了所有來自DataFrame的列名,所有屬於該Series的方法都可以被調用,和正常的函數調用並沒有區別,例如查詢體重超過均值的學生:

df.query('Weight > Weight.mean()').head()



School

Grade

Name

Gender

Weight

Transfer

1

Peking University

Freshman

Changqiang You

Male

70.0

N

2

Shanghai Jiao Tong University

Senior

Mei Sun

Male

89.0

N

4

Fudan University

Sophomore

Gaojuan You

Male

74.0

N

10

Shanghai Jiao Tong University

Freshman

Xiaopeng Zhou

Male

74.0

N

14

Tsinghua University

Senior

Xiaomei Zhou

Female

57.0

N

【NOTE】query中引用帶空格的列名

對於含有空格的列名,需要使用`col name`的方式進行引用。

【END】

同時,在query中還註冊了若干英語的字面用法,幫助提高可讀性,例如:or, and, or, is in, not in。例如,篩選出男生中不是大一大二的學生:

df.query('(Grade not in ["Freshman", "Sophomore"]) and (Gender == "Male")').head()



School

Grade

Name

Gender

Weight

Transfer

2

Shanghai Jiao Tong University

Senior

Mei Sun

Male

89.0

N

16

Tsinghua University

Junior

Xiaoqiang Qin

Male

68.0

N

17

Tsinghua University

Junior

Peng Wang

Male

65.0

N

18

Tsinghua University

Senior

Xiaofeng Sun

Male

71.0

N

21

Shanghai Jiao Tong University

Senior

Xiaopeng Shen

Male

62.0

NaN

此外,在字符串中出現與列表的比較時,==!=分別表示元素出現在列表和沒有出現在列表,等價於is innot in,例如查詢所有大三和大四的學生:

df.query('Grade == ["Junior", "Senior"]').head()



School

Grade

Name

Gender

Weight

Transfer

2

Shanghai Jiao Tong University

Senior

Mei Sun

Male

89.0

N

7

Tsinghua University

Junior

Gaoqiang Qian

Female

50.0

N

9

Peking University

Junior

Juan Xu

Female

NaN

N

11

Tsinghua University

Junior

Xiaoquan Lv

Female

43.0

N

12

Shanghai Jiao Tong University

Senior

Peng You

Female

48.0

NaN

對於query中的字符串,如果要引用外部變量,只需在變量名前加@符號。例如,取出體重位於70kg到80kg之間的學生:

low, high =70, 80
df.query('Weight.between(@low, @high)').head()



School

Grade

Name

Gender

Weight

Transfer

1

Peking University

Freshman

Changqiang You

Male

70.0

N

4

Fudan University

Sophomore

Gaojuan You

Male

74.0

N

10

Shanghai Jiao Tong University

Freshman

Xiaopeng Zhou

Male

74.0

N

18

Tsinghua University

Senior

Xiaofeng Sun

Male

71.0

N

35

Peking University

Freshman

Gaoli Zhao

Male

78.0

N

6. 隨機抽樣

如果把DataFrame的每一行看作一個樣本,或把每一列看作一個特徵,再把整個DataFrame看作總體,想要對樣本或特徵進行隨機抽樣就可以用sample函數。有時在拿到大型數據集後,想要對統計特徵進行計算來了解數據的大致分佈,但是這很費時間。同時,由於許多統計特徵在等概率不放回的簡單隨機抽樣條件下,是總體統計特徵的無偏估計,比如樣本均值和總體均值,那麼就可以先從整張表中抽出一部分來做近似估計。

sample函數中的主要參數為n, axis, frac, replace, weights,前三個分別是指抽樣數量、抽樣的方向(0為行、1為列)和抽樣比例(0.3則為從總體中抽出30%的樣本)。

replaceweights分別是指是否放回和每個樣本的抽樣相對概率,當replace = True則表示有放回抽樣。例如,對下面構造的df_samplevalue值的相對大小為抽樣概率進行有放回抽樣,抽樣數量為3。

df_sample = pd.DataFrame({'id': list('abcde'), 'value': [1, 2, 3, 4, 91]})
df_sample



id

value

0

a

1

1

b

2

2

c

3

3

d

4

4

e

91

df_sample.sample(3, replace = True, weights = df_sample.value)



id

value

3

d

4

4

e

91

4

e

91

二、多級索引

1. 多級索引及其表的結構

為了更加清晰地説明具有多級索引的DataFrame結構,下面新構造一張表,讀者可以忽略這裏的構造方法,它們將會在第4小節被更詳細地講解。

np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'), df.Gender.unique()], names=('School', 'Gender'))
multi_column = pd.MultiIndex.from_product([['Height', 'Weight'], df.Grade.unique()], names=('Indicator', 'Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(), (np.random.randn(8,4)*5 + 65).tolist()],
                        index = multi_index, columns = multi_column).round(1)
df_multi



Indicator

Height

Weight

Grade

Freshman

Senior

Sophomore

Junior

Freshman

Senior

Sophomore

Junior

School

Gender

A

Female

171.8

165.0

167.9

174.2

60.6

55.1

63.3

65.8

Male

172.3

158.1

167.8

162.2

71.2

71.0

63.1

63.5

B

Female

162.5

165.1

163.7

170.3

59.8

57.9

56.5

74.8

Male

166.8

163.6

165.2

164.7

62.5

62.8

58.7

68.9

C

Female

170.5

162.0

164.6

158.7

56.9

63.9

60.5

66.9

Male

150.2

166.3

167.3

159.3

62.4

59.1

64.9

67.1

D

Female

174.3

155.7

163.2

162.1

65.3

66.5

61.8

63.2

Male

170.7

170.3

163.8

164.9

61.6

63.2

60.9

56.4

下圖通過顏色區分,標記了DataFrame的結構。與單層索引的表一樣,具備元素值、行索引和列索引三個部分。其中,這裏的行索引和列索引都是MultiIndex類型,只不過索引中的一個元素是元組而不是單層索引中的標量。例如,行索引的第四個元素為("B", "Male"),列索引的第二個元素為("Height", "Senior"),這裏需要注意,外層連續出現相同的值時,第一次之後出現的會被隱藏顯示,使結果的可讀性增強。


與單層索引類似,MultiIndex也具有名字屬性,圖中的SchoolGender分別對應了表的第一層和第二層行索引的名字,IndicatorGrade分別對應了第一層和第二層列索引的名字。

索引的名字和值屬性分別可以通過namesvalues獲得:

df_multi.index.names
FrozenList(['School', 'Gender'])
df_multi.columns.names
FrozenList(['Indicator', 'Grade'])
df_multi.index.values
array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
       ('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
      dtype=object)
df_multi.columns.values
array([('Height', 'Freshman'), ('Height', 'Senior'),
       ('Height', 'Sophomore'), ('Height', 'Junior'),
       ('Weight', 'Freshman'), ('Weight', 'Senior'),
       ('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)

如果想要得到某一層的索引,則需要通過get_level_values獲得:

df_multi.index.get_level_values(1)
Index(['Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'], dtype='object', name='Gender')

但對於索引而言,無論是單層還是多層,用户都無法通過index_obj[0] = item的方式來修改元素,也不能通過index_name[0] = new_name的方式來修改名字,關於如何修改這些屬性的話題將在第三節被討論。

2. 多級索引中的loc索引器

熟悉了結構後,現在回到原表,將學校和年級設為索引,此時的行為多級索引,列為單級索引,由於默認狀態的列索引不含名字,因此對應於剛剛圖中IndicatorGrade的索引名位置是空缺的。

df_multi = df.set_index(['School', 'Grade'])
df_multi.head()



Name

Gender

Weight

Transfer

School

Grade

Shanghai Jiao Tong University

Freshman

Gaopeng Yang

Female

46.0

N

Peking University

Freshman

Changqiang You

Male

70.0

N

Shanghai Jiao Tong University

Senior

Mei Sun

Male

89.0

N

Fudan University

Sophomore

Xiaojuan Sun

Female

41.0

N

Sophomore

Gaojuan You

Male

74.0

N

由於多級索引中的單個元素以元組為單位,因此之前在第一節介紹的 lociloc 方法完全可以照搬,只需把標量的位置替換成對應的元組。

當傳入元組列表或單個元組或返回前二者的函數時,需要先進行索引排序以避免性能警告:

with warnings.catch_warnings():
    warnings.filterwarnings('error')
    try:
        df_multi.loc[('Fudan University', 'Junior')].head()
    except Warning as w:
        Warning_Msg = w
Warning_Msg
pandas.errors.PerformanceWarning('indexing past lexsort depth may impact performance.')
df_sorted = df_multi.sort_index()
df_sorted.loc[('Fudan University', 'Junior')].head()



Name

Gender

Weight

Transfer

School

Grade

Fudan University

Junior

Yanli You

Female

48.0

N

Junior

Chunqiang Chu

Male

72.0

N

Junior

Changfeng Lv

Male

76.0

N

Junior

Yanjuan Lv

Female

49.0

NaN

Junior

Gaoqiang Zhou

Female

43.0

N

df_sorted.loc[[('Fudan University', 'Senior'), ('Shanghai Jiao Tong University', 'Freshman')]]



Name

Gender

Weight

Transfer

School

Grade

Fudan University

Senior

Chengpeng Zheng

Female

38.0

N

Senior

Feng Zhou

Female

47.0

N

Senior

Gaomei Lv

Female

34.0

N

Senior

Chunli Lv

Female

56.0

N

Senior

Chengpeng Zhou

Male

81.0

N

Senior

Gaopeng Qin

Female

52.0

N

Senior

Chunjuan Xu

Female

47.0

N

Senior

Juan Zhang

Female

47.0

N

Senior

Chengpeng Qian

Male

73.0

Y

Senior

Xiaojuan Qian

Female

50.0

N

Senior

Quan Xu

Female

44.0

N

Shanghai Jiao Tong University

Freshman

Gaopeng Yang

Female

46.0

N

Freshman

Qiang Chu

Female

52.0

N

Freshman

Xiaopeng Zhou

Male

74.0

N

Freshman

Yanpeng Lv

Male

65.0

N

Freshman

Xiaopeng Zhao

Female

53.0

N

Freshman

Chunli Zhao

Male

83.0

N

Freshman

Peng Zhang

Female

NaN

N

Freshman

Xiaoquan Sun

Female

40.0

N

Freshman

Chunmei Shi

Female

52.0

N

Freshman

Xiaomei Yang

Female

49.0

N

Freshman

Xiaofeng Qian

Female

49.0

N

Freshman

Changmei Lv

Male

75.0

N

Freshman

Qiang Feng

Male

80.0

N

df_sorted.loc[df_sorted.Weight > 70].head() # 布爾列表也是可用的



Name

Gender

Weight

Transfer

School

Grade

Fudan University

Freshman

Feng Wang

Male

74.0

N

Junior

Chunqiang Chu

Male

72.0

N

Junior

Changfeng Lv

Male

76.0

N

Senior

Chengpeng Zhou

Male

81.0

N

Senior

Chengpeng Qian

Male

73.0

Y

df_sorted.loc[lambda x:('Fudan University','Junior')].head()



Name

Gender

Weight

Transfer

School

Grade

Fudan University

Junior

Yanli You

Female

48.0

N

Junior

Chunqiang Chu

Male

72.0

N

Junior

Changfeng Lv

Male

76.0

N

Junior

Yanjuan Lv

Female

49.0

NaN

Junior

Gaoqiang Zhou

Female

43.0

N

當使用切片時需要注意,在單級索引中只要切片端點元素是唯一的,那麼就可以進行切片,但在多級索引中,無論元組在索引中是否重複出現,都必須經過排序才能使用切片,否則報錯:

try:
    df_multi.loc[('Fudan University', 'Senior'):].head()
except Exception as e:
    Err_Msg = e
Err_Msg
pandas.errors.UnsortedIndexError('Key length (2) was greater than MultiIndex lexsort depth (0)')
df_sorted.loc[('Fudan University', 'Senior'):].head()



Name

Gender

Weight

Transfer

School

Grade

Fudan University

Senior

Chengpeng Zheng

Female

38.0

N

Senior

Feng Zhou

Female

47.0

N

Senior

Gaomei Lv

Female

34.0

N

Senior

Chunli Lv

Female

56.0

N

Senior

Chengpeng Zhou

Male

81.0

N

df_unique = df.drop_duplicates(subset=['School','Grade']).set_index(['School', 'Grade'])
df_unique.head()



Name

Gender

Weight

Transfer

School

Grade

Shanghai Jiao Tong University

Freshman

Gaopeng Yang

Female

46.0

N

Peking University

Freshman

Changqiang You

Male

70.0

N

Shanghai Jiao Tong University

Senior

Mei Sun

Male

89.0

N

Fudan University

Sophomore

Xiaojuan Sun

Female

41.0

N

Tsinghua University

Freshman

Xiaoli Qian

Female

51.0

N

try:
    df_unique.loc[('Fudan University', 'Senior'):].head()
except Exception as e:
    Err_Msg = e
Err_Msg
pandas.errors.UnsortedIndexError('Key length (2) was greater than MultiIndex lexsort depth (0)')
df_unique.sort_index().loc[('Fudan University', 'Senior'):].head()



Name

Gender

Weight

Transfer

School

Grade

Fudan University

Senior

Chengpeng Zheng

Female

38.0

N

Sophomore

Xiaojuan Sun

Female

41.0

N

Peking University

Freshman

Changqiang You

Male

70.0

N

Junior

Juan Xu

Female

NaN

N

Senior

Changli Lv

Female

41.0

N

此外,在多級索引中的元組有一種特殊的用法,可以對多層的元素進行交叉組合後索引,但同時需要指定loc的列,全選則用:表示。其中,每一層需要選中的元素用列表存放,傳入loc的形式為[(level_0_list, level_1_list), cols]。例如,想要得到所有北大和復旦的大二大三學生,可以如下寫出:

res = df_multi.loc[(['Peking University', 'Fudan University'], ['Sophomore', 'Junior']), :]
res



Name

Gender

Weight

Transfer

School

Grade

Peking University

Sophomore

Changmei Xu

Female

43.0

N

Sophomore

Xiaopeng Qin

Male

NaN

N

Sophomore

Mei Xu

Female

39.0

N

Sophomore

Xiaoli Zhou

Female

55.0

N

Sophomore

Peng Han

Female

34.0

NaN

Junior

Juan Xu

Female

NaN

N

Junior

Changjuan You

Female

47.0

N

Junior

Gaoli Xu

Female

48.0

N

Junior

Gaoquan Zhou

Male

70.0

N

Junior

Qiang You

Female

56.0

N

Junior

Chengli Zhao

Male

NaN

NaN

Junior

Chengpeng Zhao

Female

44.0

N

Junior

Xiaofeng Zhao

Female

46.0

N

Fudan University

Sophomore

Xiaojuan Sun

Female

41.0

N

Sophomore

Gaojuan You

Male

74.0

N

Sophomore

Xiaoquan Zhang

Female

45.0

N

Sophomore

Mei Xu

Male

79.0

N

Sophomore

Chengqiang Lv

Female

53.0

N

Sophomore

Xiaojuan Chu

Male

68.0

N

Sophomore

Changqiang Qian

Male

64.0

N

Sophomore

Li Sun

Female

57.0

N

Junior

Yanli You

Female

48.0

N

Junior

Chunqiang Chu

Male

72.0

N

Junior

Changfeng Lv

Male

76.0

N

Junior

Yanjuan Lv

Female

49.0

NaN

Junior

Gaoqiang Zhou

Female

43.0

N

Junior

Xiaojuan Zhao

Female

49.0

N

Junior

Gaoquan Chu

Female

51.0

N

Junior

Chengli Sun

Male

62.0

N

Junior

Gaojuan Qian

Female

44.0

N

Junior

Xiaojuan Qian

Female

51.0

N

Junior

Chunjuan Zhang

Female

47.0

N

Junior

Xiaojuan Sun

Female

46.0

N

res.shape
(33, 4)

下面的語句和上面類似,但仍然傳入的是元素(這裏為元組)的列表,它們的意義是不同的,表示的是選出北大的大三學生和復旦的大二學生:

res = df_multi.loc[[('Peking University', 'Junior'), ('Fudan University', 'Sophomore')]]
res



Name

Gender

Weight

Transfer

School

Grade

Peking University

Junior

Juan Xu

Female

NaN

N

Junior

Changjuan You

Female

47.0

N

Junior

Gaoli Xu

Female

48.0

N

Junior

Gaoquan Zhou

Male

70.0

N

Junior

Qiang You

Female

56.0

N

Junior

Chengli Zhao

Male

NaN

NaN

Junior

Chengpeng Zhao

Female

44.0

N

Junior

Xiaofeng Zhao

Female

46.0

N

Fudan University

Sophomore

Xiaojuan Sun

Female

41.0

N

Sophomore

Gaojuan You

Male

74.0

N

Sophomore

Xiaoquan Zhang

Female

45.0

N

Sophomore

Mei Xu

Male

79.0

N

Sophomore

Chengqiang Lv

Female

53.0

N

Sophomore

Xiaojuan Chu

Male

68.0

N

Sophomore

Changqiang Qian

Male

64.0

N

Sophomore

Li Sun

Female

57.0

N

res.shape
(16, 4)

3. IndexSlice對象

前面介紹的方法,即使在索引不重複的時候,也只能對元組整體進行切片,而不能對每層進行切片,也不允許將切片和布爾列表混合使用,引入IndexSlice對象就能解決這個問題。Slice對象一共有兩種形式,第一種為loc[idx[*,*]]型,第二種為loc[idx[*,*],idx[*,*]]型,下面將進行介紹。為了方便演示,下面構造一個索引不重複的DataFrame

np.random.seed(0)
L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)), index=mul_index1, columns=mul_index2)
df_ex



Big

D

E

F

Small

d

e

f

d

e

f

d

e

f

Upper

Lower

A

a

3

6

-9

-6

-6

-2

0

9

-5

b

-3

3

-8

-3

-2

5

8

-4

4

c

-1

0

7

-4

6

6

-9

9

-6

B

a

8

5

-2

-9

-8

0

-9

1

-6

b

2

9

-7

-9

-9

-5

-4

-3

-1

c

8

6

-5

0

1

-8

-8

-2

0

C

a

-6

-3

2

5

9

-9

5

-6

3

b

1

2

-5

-3

-5

6

-6

3

-5

c

-1

5

6

-6

6

4

7

8

-4

為了使用silce對象,先要進行定義:

idx = pd.IndexSlice

【a】loc[idx[*,*]]

這種情況並不能進行多層分別切片,前一個*表示行的選擇,後一個*表示列的選擇,與單純的loc是類似的:

df_ex.loc[idx['C':, ('D', 'f'):]]



Big

D

E

F

Small

f

d

e

f

d

e

f

Upper

Lower

C

a

2

5

9

-9

5

-6

3

b

-5

-3

-5

6

-6

3

-5

c

6

-6

6

4

7

8

-4

另外,也支持布爾序列的索引:

df_ex.sum()
Big  Small
D    d        11
     e        33
     f       -21
E    d       -35
     e        -8
     f        -3
F    d       -16
     e        15
     f       -20
dtype: int64
df_ex.loc[idx[:'A', lambda x:x.sum()>0]] # 列所有和大於0



Big

D

F

Small

d

e

e

Upper

Lower

A

a

3

6

9

b

-3

3

-4

c

-1

0

9

【b】loc[idx[*,*],idx[*,*]]

這種情況能夠分層進行切片,前一個idx指代的是行索引,後一個是列索引。

df_ex.loc[idx[:'A', 'b':], idx['E':, 'e':]]



Big

E

F

Small

e

f

e

f

Upper

Lower

A

b

-2

5

-4

4

c

6

6

9

-6

但需要注意的是,此時不支持使用函數:

try:
    df_ex.loc[idx[:'A', lambda x: 'b'], idx['E':, 'e':]]
except Exception as e:
    Err_Msg = e
Err_Msg
KeyError(<function __main__.<lambda>(x)>)

4. 多級索引的構造

前面提到了多級索引表的結構和切片,那麼除了使用set_index之外,如何自己構造多級索引呢?常用的有from_tuples, from_arrays, from_product三種方法,它們都是pd.MultiIndex對象下的函數。

from_tuples指根據傳入由元組組成的列表進行構造:

my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

from_arrays指根據傳入列表中,對應層的列表進行構造:

my_array = [list('aabb'), ['cat', 'dog']*2]
pd.MultiIndex.from_arrays(my_array, names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

from_product指根據給定多個列表的笛卡爾積進行構造:

my_list1 = ['a','b']
my_list2 = ['cat','dog']
pd.MultiIndex.from_product([my_list1, my_list2], names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

三、索引的常用方法

1. 索引層的交換和刪除

為了方便理解交換的過程,這裏構造一個三級索引的例子:

np.random.seed(0)
L1,L2,L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3], names=('Upper', 'Lower','Extra'))
L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6], names=('Big', 'Small', 'Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)), index=mul_index1,  columns=mul_index2)
df_ex



Big

C

D

Small

c

d

c

d

Other

cat

dog

cat

dog

cat

dog

cat

dog

Upper

Lower

Extra

A

a

alpha

3

6

-9

-6

-6

-2

0

9

beta

-5

-3

3

-8

-3

-2

5

8

b

alpha

-4

4

-1

0

7

-4

6

6

beta

-9

9

-6

8

5

-2

-9

-8

B

a

alpha

0

-9

1

-6

2

9

-7

-9

beta

-9

-5

-4

-3

-1

8

6

-5

b

alpha

0

1

-8

-8

-2

0

-6

-3

beta

2

5

9

-9

5

-6

3

1

索引層的交換由swaplevelreorder_levels完成,前者只能交換兩個層,而後者可以交換任意層,兩者都可以指定交換的是軸是哪一個,即行索引或列索引:

df_ex.swaplevel(0,2,axis=1) # 列索引的第一層和第三層交換



Other

cat

dog

cat

dog

cat

dog

cat

dog

Small

c

c

d

d

c

c

d

d

Big

C

C

C

C

D

D

D

D

Upper

Lower

Extra

A

a

alpha

3

6

-9

-6

-6

-2

0

9

beta

-5

-3

3

-8

-3

-2

5

8

b

alpha

-4

4

-1

0

7

-4

6

6

beta

-9

9

-6

8

5

-2

-9

-8

B

a

alpha

0

-9

1

-6

2

9

-7

-9

beta

-9

-5

-4

-3

-1

8

6

-5

b

alpha

0

1

-8

-8

-2

0

-6

-3

beta

2

5

9

-9

5

-6

3

1

df_ex.reorder_levels([2,0,1],axis=0).head() # 列表數字指代原來索引中的層



Big

C

D

Small

c

d

c

d

Other

cat

dog

cat

dog

cat

dog

cat

dog

Extra

Upper

Lower

alpha

A

a

3

6

-9

-6

-6

-2

0

9

beta

A

a

-5

-3

3

-8

-3

-2

5

8

alpha

A

b

-4

4

-1

0

7

-4

6

6

beta

A

b

-9

9

-6

8

5

-2

-9

-8

alpha

B

a

0

-9

1

-6

2

9

-7

-9

【NOTE】軸之間的索引交換

這裏只涉及行或列索引內部的交換,不同方向索引之間的交換將在第五章中被討論。

【END】

若想要刪除某一層的索引,可以使用droplevel方法:

df_ex.droplevel(1,axis=1)



Big

C

D

Other

cat

dog

cat

dog

cat

dog

cat

dog

Upper

Lower

Extra

A

a

alpha

3

6

-9

-6

-6

-2

0

9

beta

-5

-3

3

-8

-3

-2

5

8

b

alpha

-4

4

-1

0

7

-4

6

6

beta

-9

9

-6

8

5

-2

-9

-8

B

a

alpha

0

-9

1

-6

2

9

-7

-9

beta

-9

-5

-4

-3

-1

8

6

-5

b

alpha

0

1

-8

-8

-2

0

-6

-3

beta

2

5

9

-9

5

-6

3

1

df_ex.droplevel([0,1],axis=0)



Big

C

D

Small

c

d

c

d

Other

cat

dog

cat

dog

cat

dog

cat

dog

Extra

alpha

3

6

-9

-6

-6

-2

0

9

beta

-5

-3

3

-8

-3

-2

5

8

alpha

-4

4

-1

0

7

-4

6

6

beta

-9

9

-6

8

5

-2

-9

-8

alpha

0

-9

1

-6

2

9

-7

-9

beta

-9

-5

-4

-3

-1

8

6

-5

alpha

0

1

-8

-8

-2

0

-6

-3

beta

2

5

9

-9

5

-6

3

1

2. 索引屬性的修改

通過rename_axis可以對索引層的名字進行修改,常用的修改方式是傳入字典的映射:

df_ex.rename_axis(index={'Upper':'Changed_row'}, columns={'Other':'Changed_Col'}).head()



Big

C

D

Small

c

d

c

d

Changed_Col

cat

dog

cat

dog

cat

dog

cat

dog

Changed_row

Lower

Extra

A

a

alpha

3

6

-9

-6

-6

-2

0

9

beta

-5

-3

3

-8

-3

-2

5

8

b

alpha

-4

4

-1

0

7

-4

6

6

beta

-9

9

-6

8

5

-2

-9

-8

B

a

alpha

0

-9

1

-6

2

9

-7

-9

通過rename可以對索引的值進行修改,如果是多級索引需要指定修改的層號level

df_ex.rename(columns={'cat':'not_cat'}, level=2).head()



Big

C

D

Small

c

d

c

d

Other

not_cat

dog

not_cat

dog

not_cat

dog

not_cat

dog

Upper

Lower

Extra

A

a

alpha

3

6

-9

-6

-6

-2

0

9

beta

-5

-3

3

-8

-3

-2

5

8

b

alpha

-4

4

-1

0

7

-4

6

6

beta

-9

9

-6

8

5

-2

-9

-8

B

a

alpha

0

-9

1

-6

2

9

-7

-9

傳入參數也可以是函數,其輸入值就是索引元素:

df_ex.rename(index=lambda x:str.upper(x), level=2).head()



Big

C

D

Small

c

d

c

d

Other

cat

dog

cat

dog

cat

dog

cat

dog

Upper

Lower

Extra

A

a

ALPHA

3

6

-9

-6

-6

-2

0

9

BETA

-5

-3

3

-8

-3

-2

5

8

b

ALPHA

-4

4

-1

0

7

-4

6

6

BETA

-9

9

-6

8

5

-2

-9

-8

B

a

ALPHA

0

-9

1

-6

2

9

-7

-9

對於整個索引的元素替換,可以利用迭代器實現:

new_values = iter(list('abcdefgh'))
df_ex.rename(index=lambda x:next(new_values), level=2)



Big

C

D

Small

c

d

c

d

Other

cat

dog

cat

dog

cat

dog

cat

dog

Upper

Lower

Extra

A

a

a

3

6

-9

-6

-6

-2

0

9

b

-5

-3

3

-8

-3

-2

5

8

b

c

-4

4

-1

0

7

-4

6

6

d

-9

9

-6

8

5

-2

-9

-8

B

a

e

0

-9

1

-6

2

9

-7

-9

f

-9

-5

-4

-3

-1

8

6

-5

b

g

0

1

-8

-8

-2

0

-6

-3

h

2

5

9

-9

5

-6

3

1

若想要對某個位置的元素進行修改,在單層索引時容易實現,即先取出索引的values屬性,再給對得到的列表進行修改,最後再對index對象重新賦值。但是如果是多級索引的話就有些麻煩,一個解決的方案是先把某一層索引臨時轉為表的元素,然後再進行修改,最後重新設定為索引,下面一節將介紹這些操作。

另外一個需要介紹的函數是map,它是定義在Index上的方法,與前面rename方法中層的函數式用法是類似的,只不過它傳入的不是層的標量值,而是直接傳入索引的元組,這為用户進行跨層的修改提供了遍歷。例如,可以等價地寫出上面的字符串轉大寫的操作:

df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x: (x[0], x[1], str.upper(x[2])))
df_temp.index= new_idx
df_temp.head()



Big

C

D

Small

c

d

c

d

Other

cat

dog

cat

dog

cat

dog

cat

dog

Upper

Lower

Extra

A

a

ALPHA

3

6

-9

-6

-6

-2

0

9

BETA

-5

-3

3

-8

-3

-2

5

8

b

ALPHA

-4

4

-1

0

7

-4

6

6

BETA

-9

9

-6

8

5

-2

-9

-8

B

a

ALPHA

0

-9

1

-6

2

9

-7

-9

關於map的另一個使用方法是對多級索引的壓縮,這在第四章和第五章的一些操作中是有用的:

df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x: (x[0]+'-'+x[1]+'-'+x[2]))
df_temp.index = new_idx
df_temp.head() # 單層索引



Big

C

D

Small

c

d

c

d

Other

cat

dog

cat

dog

cat

dog

cat

dog

A-a-alpha

3

6

-9

-6

-6

-2

0

9

A-a-beta

-5

-3

3

-8

-3

-2

5

8

A-b-alpha

-4

4

-1

0

7

-4

6

6

A-b-beta

-9

9

-6

8

5

-2

-9

-8

B-a-alpha

0

-9

1

-6

2

9

-7

-9

同時,也可以反向地展開:

new_idx = df_temp.index.map(lambda x:tuple(x.split('-')))
df_temp.index = new_idx
df_temp.head() # 三層索引



Big

C

D

Small

c

d

c

d

Other

cat

dog

cat

dog

cat

dog

cat

dog

A

a

alpha

3

6

-9

-6

-6

-2

0

9

beta

-5

-3

3

-8

-3

-2

5

8

b

alpha

-4

4

-1

0

7

-4

6

6

beta

-9

9

-6

8

5

-2

-9

-8

B

a

alpha

0

-9

1

-6

2

9

-7

-9

3. 索引的設置與重置

為了説明本節的函數,下面構造一個新表:

df_new = pd.DataFrame({'A':list('aacd'), 'B':list('PQRT'), 'C':[1,2,3,4]})
df_new



A

B

C

0

a

P

1

1

a

Q

2

2

c

R

3

3

d

T

4

索引的設置可以使用set_index完成,這裏的主要參數是append,表示是否來保留原來的索引,直接把新設定的添加到原索引的內層:

df_new.set_index('A')



B

C

A

a

P

1

a

Q

2

c

R

3

d

T

4

df_new.set_index('A', append=True)



B

C

A

0

a

P

1

1

a

Q

2

2

c

R

3

3

d

T

4

可以同時指定多個列作為索引:

df_new.set_index(['A', 'B'])



C

A

B

a

P

1

Q

2

c

R

3

d

T

4

如果想要添加索引的列沒有出現在其中,那麼可以直接在參數中傳入相應的Series

my_index = pd.Series(list('WXYZ'), name='D')
df_new = df_new.set_index(['A', my_index])
df_new



B

C

A

D

a

W

P

1

X

Q

2

c

Y

R

3

d

Z

T

4

reset_indexset_index的逆函數,其主要參數是drop,表示是否要把去掉的索引層丟棄,而不是添加到列中:

df_new.reset_index(['D'])



D

B

C

A

a

W

P

1

a

X

Q

2

c

Y

R

3

d

Z

T

4

df_new.reset_index(['D'], drop=True)



B

C

A

a

P

1

a

Q

2

c

R

3

d

T

4

如果重置了所有的索引,那麼pandas會直接重新生成一個默認索引:

df_new.reset_index()



A

D

B

C

0

a

W

P

1

1

a

X

Q

2

2

c

Y

R

3

3

d

Z

T

4

4. 索引的變形

在某些場合下,需要對索引做一些擴充或者剔除,更具體地要求是給定一個新的索引,把原表中相應的索引對應元素填充到新索引構成的表中。例如,下面的表中給出了員工信息,需要重新制作一張新的表,要求增加一名員工的同時去掉身高列並增加性別列:

df_reindex = pd.DataFrame({"Weight":[60,70,80], "Height":[176,180,179]}, index=['1001','1003','1002'])
df_reindex



Weight

Height

1001

60

176

1003

70

180

1002

80

179

df_reindex.reindex(index=['1001','1002','1003','1004'], columns=['Weight','Gender'])



Weight

Gender

1001

60.0

NaN

1002

80.0

NaN

1003

70.0

NaN

1004

NaN

NaN

這種需求常出現在時間序列索引的時間點填充以及ID編號的擴充。另外,需要注意的是原來表中的數據和新表中會根據索引自動對齊,例如原先的1002號位置在1003號之後,而新表中相反,那麼reindex中會根據元素對齊,與位置無關。

還有一個與reindex功能類似的函數是reindex_like,其功能是仿照傳入的表索引來進行被調用表索引的變形。例如,現在已經存在一張表具備了目標索引的條件,那麼上述功能可採用下述代碼得到:

df_existed = pd.DataFrame(index=['1001','1002','1003','1004'], columns=['Weight','Gender'])
df_reindex.reindex_like(df_existed)



Weight

Gender

1001

60.0

NaN

1002

80.0

NaN

1003

70.0

NaN

1004

NaN

NaN

四、索引運算

1. 集合的運算法則

經常會有一種利用集合運算來取出符合條件行的需求,例如有兩張表AB,它們的索引都是員工編號,現在需要篩選出兩表索引交集的所有員工信息,此時通過Index上的運算操作就很容易實現。

不過在此之前,不妨先複習一下常見的四種集合運算:

df 根據索引其中之一篩選 根據索引名得到索引號_機器學習
df 根據索引其中之一篩選 根據索引名得到索引號_機器學習_02
df 根據索引其中之一篩選 根據索引名得到索引號_python_03
df 根據索引其中之一篩選 根據索引名得到索引號_pandas_04

2. 一般的索引運算

由於集合的元素是互異的,但是索引中可能有相同的元素,先用unique去重後再進行運算。下面構造兩張最為簡單的示例表進行演示:

df_set_1 = pd.DataFrame([[0,1],[1,2],[3,4]], index = pd.Index(['a','b','a'],name='id1'))
df_set_2 = pd.DataFrame([[4,5],[2,6],[7,1]], index = pd.Index(['b','b','c'],name='id2'))
id1, id2 = df_set_1.index.unique(), df_set_2.index.unique()
id1.intersection(id2)
Index(['b'], dtype='object')
id1.union(id2)
Index(['a', 'b', 'c'], dtype='object')
id1.difference(id2)
Index(['a'], dtype='object')
id1.symmetric_difference(id2)
Index(['a', 'c'], dtype='object')

上述的四類運算可以用等價的符號表示代替:

id1 & id2 , id1 | id2 , (id1 ^ id2) & id1 , id1 ^ id2
(Index(['b'], dtype='object'),
 Index(['a', 'b', 'c'], dtype='object'),
 Index(['a'], dtype='object'),
 Index(['a', 'c'], dtype='object'))

若兩張表需要做集合運算的列並沒有被設置索引,一種辦法是先轉成索引,運算後再恢復,另一種方法是利用isin函數,例如在重置索引的第一張表中選出id列交集的所在行:

df_set_in_col_1 = df_set_1.reset_index()
df_set_in_col_2 = df_set_2.reset_index()
df_set_in_col_1



id1

0

1

0

a

0

1

1

b

1

2

2

a

3

4

df_set_in_col_2



id2

0

1

0

b

4

5

1

b

2

6

2

c

7

1

df_set_in_col_2[df_set_in_col_2.id2.isin(df_set_in_col_1.id1)]



id2

0

1

0

b

4

5

1

b

2

6

五、練習

Ex1:公司員工數據集

現有一份公司員工數據集:

df = pd.read_csv('../data/company.csv')
df.head(3)



EmployeeID

birthdate_key

age

city_name

department

job_title

gender

0

1318

1/3/1954

61

Vancouver

Executive

CEO

M

1

1319

1/3/1957

58

Vancouver

Executive

VP Stores

F

2

1320

1/2/1955

60

Vancouver

Executive

Legal Counsel

F

  1. 分別只使用queryloc選出年齡不超過四十歲且工作部門為DairyBakery的男性。
#使用query選出年齡不超過四十歲且工作部門為Dairy或Bakery的男性。
dpt = ['Dairy', 'Bakery']
df.query("(age<=40) & (department==@dpt) & (gender=='M')").head()



EmployeeID

birthdate_key

age

city_name

department

job_title

gender

3611

5791

1/14/1975

40

Kelowna

Dairy

Dairy Person

M

3613

5793

1/22/1975

40

Richmond

Bakery

Baker

M

3615

5795

1/30/1975

40

Nanaimo

Dairy

Dairy Person

M

3617

5797

2/3/1975

40

Nanaimo

Dairy

Dairy Person

M

3618

5798

2/4/1975

40

Surrey

Dairy

Dairy Person

M

#使用loc選出年齡不超過四十歲且工作部門為Dairy或Bakery的男性。
df.loc[(df.age<=40)&df.department.isin(dpt)&(df.gender=='M')].head()



EmployeeID

birthdate_key

age

city_name

department

job_title

gender

3611

5791

1/14/1975

40

Kelowna

Dairy

Dairy Person

M

3613

5793

1/22/1975

40

Richmond

Bakery

Baker

M

3615

5795

1/30/1975

40

Nanaimo

Dairy

Dairy Person

M

3617

5797

2/3/1975

40

Nanaimo

Dairy

Dairy Person

M

3618

5798

2/4/1975

40

Surrey

Dairy

Dairy Person

M

  1. 選出員工ID號 為奇數所在行的第1、第3和倒數第2列。
df.iloc[(df.EmployeeID%2==1).values,[0,2,-2]].head()



EmployeeID

age

job_title

1

1319

58

VP Stores

3

1321

56

VP Human Resources

5

1323

53

Exec Assistant, VP Stores

6

1325

51

Exec Assistant, Legal Counsel

8

1329

48

Store Manager

  1. 按照以下步驟進行索引操作:
  • 把後三列設為索引後交換內外兩層
  • 恢復中間層索引
  • 修改外層索引名為Gender
  • 用下劃線合併兩層行索引
  • 把行索引拆分為原狀態
  • 修改索引名為原表名稱
  • 恢復默認索引並將列保持為原表的相對位置
df_op = df.copy()
#把後三列設為索引後交換內外兩層
df_op = df_op.set_index(df_op.columns[-3:].tolist()).swaplevel(0,2,axis=0)
df_op



EmployeeID

birthdate_key

age

city_name

gender

job_title

department

M

CEO

Executive

1318

1/3/1954

61

Vancouver

F

VP Stores

Executive

1319

1/3/1957

58

Vancouver

Legal Counsel

Executive

1320

1/2/1955

60

Vancouver

M

VP Human Resources

Executive

1321

1/2/1959

56

Vancouver

VP Finance

Executive

1322

1/9/1958

57

Vancouver

...

...

...

...

...

...

...

F

Cashier

Customer Service

8036

8/9/1992

23

New Westminister

M

Cashier

Customer Service

8181

9/26/1993

22

Prince George

Customer Service

8223

2/11/1994

21

Trail

F

Cashier

Customer Service

8226

2/16/1994

21

Victoria

Customer Service

8264

6/13/1994

21

Vancouver

6284 rows × 4 columns

#恢復中間層索引
df_op = df_op.reset_index(level=1)
df_op.head()



job_title

EmployeeID

birthdate_key

age

city_name

gender

department

M

Executive

CEO

1318

1/3/1954

61

Vancouver

F

Executive

VP Stores

1319

1/3/1957

58

Vancouver

Executive

Legal Counsel

1320

1/2/1955

60

Vancouver

M

Executive

VP Human Resources

1321

1/2/1959

56

Vancouver

Executive

VP Finance

1322

1/9/1958

57

Vancouver

#修改外層索引名為Gender
df_op = df_op.rename_axis(index={'gender':'Gender'})
df_op.head()



job_title

EmployeeID

birthdate_key

age

city_name

Gender

department

M

Executive

CEO

1318

1/3/1954

61

Vancouver

F

Executive

VP Stores

1319

1/3/1957

58

Vancouver

Executive

Legal Counsel

1320

1/2/1955

60

Vancouver

M

Executive

VP Human Resources

1321

1/2/1959

56

Vancouver

Executive

VP Finance

1322

1/9/1958

57

Vancouver

#用下劃線合併兩層行索引
df_op.index = df_op.index.map(lambda x: '_'.join(x))
df_op.head()



job_title

EmployeeID

birthdate_key

age

city_name

M_Executive

CEO

1318

1/3/1954

61

Vancouver

F_Executive

VP Stores

1319

1/3/1957

58

Vancouver

F_Executive

Legal Counsel

1320

1/2/1955

60

Vancouver

M_Executive

VP Human Resources

1321

1/2/1959

56

Vancouver

M_Executive

VP Finance

1322

1/9/1958

57

Vancouver

#把行索引拆分為原狀態
df_op.index = df_op.index.map(lambda x: tuple(x.split('_')))
df_op.head()



job_title

EmployeeID

birthdate_key

age

city_name

M

Executive

CEO

1318

1/3/1954

61

Vancouver

F

Executive

VP Stores

1319

1/3/1957

58

Vancouver

Executive

Legal Counsel

1320

1/2/1955

60

Vancouver

M

Executive

VP Human Resources

1321

1/2/1959

56

Vancouver

Executive

VP Finance

1322

1/9/1958

57

Vancouver

#修改索引名為原表名稱
df_op = df_op.rename_axis(index=['gender', 'department'])
df_op



job_title

EmployeeID

birthdate_key

age

city_name

gender

department

M

Executive

CEO

1318

1/3/1954

61

Vancouver

F

Executive

VP Stores

1319

1/3/1957

58

Vancouver

Executive

Legal Counsel

1320

1/2/1955

60

Vancouver

M

Executive

VP Human Resources

1321

1/2/1959

56

Vancouver

Executive

VP Finance

1322

1/9/1958

57

Vancouver

...

...

...

...

...

...

...

F

Customer Service

Cashier

8036

8/9/1992

23

New Westminister

M

Customer Service

Cashier

8181

9/26/1993

22

Prince George

Customer Service

Cashier

8223

2/11/1994

21

Trail

F

Customer Service

Cashier

8226

2/16/1994

21

Victoria

Customer Service

Cashier

8264

6/13/1994

21

Vancouver

6284 rows × 5 columns

#恢復默認索引並將列保持為原表的相對位置
df_op = df_op.reset_index().reindex(df.columns, axis=1)
df_op.equals(df)
True

Ex2:巧克力數據集

現有一份關於巧克力評價的數據集:

df = pd.read_csv('../data/chocolate.csv')
df.head(3)



Company

Review\nDate

Cocoa\nPercent

Company\nLocation

Rating

0

A. Morin

2016

63%

France

3.75

1

A. Morin

2015

70%

France

2.75

2

A. Morin

2015

70%

France

3.00

  1. 把列索引名中的\n替換為空格。
df.columns = [' '.join(i.split('\n')) for i in df.columns]
df.head()



Company

Review Date

Cocoa Percent

Company Location

Rating

0

A. Morin

2016

63%

France

3.75

1

A. Morin

2015

70%

France

2.75

2

A. Morin

2015

70%

France

3.00

3

A. Morin

2015

70%

France

3.50

4

A. Morin

2015

70%

France

3.50

  1. 巧克力Rating評分為1至5,每0.25分一檔,請選出2.75分及以下且可可含量Cocoa Percent高於中位數的樣本。
df['Cocoa Percent'] = df['Cocoa Percent'].apply(lambda x:float(x[:-1])/100)
df.query('(Rating<=2.75)&(`Cocoa Percent`>`Cocoa Percent`.median())').head()



Company

Review Date

Cocoa Percent

Company Location

Rating

33

Akesson's (Pralus)

2010

0.75

Switzerland

2.75

34

Akesson's (Pralus)

2010

0.75

Switzerland

2.75

36

Alain Ducasse

2014

0.75

France

2.75

38

Alain Ducasse

2013

0.75

France

2.50

39

Alain Ducasse

2013

0.75

France

2.50

  1. Review DateCompany Location設為索引後,選出Review Date在2012年之後且Company Location不屬於France, Canada, Amsterdam, Belgium的樣本。
idx = pd.IndexSlice
exclude = ['France', 'Canada', 'Amsterdam', 'Belgium']
res = df.set_index(['Review Date', 'Company Location']).sort_index(level=0)
res.loc[idx[2012:,~res.index.get_level_values(1).isin(exclude)],:].head()



Company

Cocoa Percent

Rating

Review Date

Company Location

2012

Australia

Bahen & Co.

0.70

3.00

Australia

Bahen & Co.

0.70

2.50

Australia

Bahen & Co.

0.70

2.50

Australia

Cravve

0.75

3.25

Australia

Cravve

0.65

3.25