第三章 索引
本章數據:
import numpy as np
import pandas as pd
一、索引器
1. 表的列索引
列索引是最常見的索引形式,一般通過[]來實現。通過[列名]可以從DataFrame中取出相應的列,返回值為Series,例如從表中取出姓名一列:
df = pd.read_csv('../data/learn_pandas.csv', usecols = ['School', 'Grade', 'Name', 'Gender', 'Weight', 'Transfer'])
df['Name'].head()
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
如果要取出多個列,則可以通過[列名組成的列表],其返回值為一個DataFrame,例如從表中取出性別和姓名兩列:
df[['Gender', 'Name']].head()
|
Gender
|
Name
|
|
|
0
|
Female
|
Gaopeng Yang
|
|
1
|
Male
|
Changqiang You
|
|
2
|
Male
|
Mei Sun
|
|
3
|
Female
|
Xiaojuan Sun
|
|
4
|
Male
|
Gaojuan You
|
此外,若要取出單列,且列名中不包含空格,則可以用.列名取出,這和[列名]是等價的:
df.Name.head()
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
2. 序列的行索引
【a】以字符串為索引的Series
如果取出單個索引的對應元素,則可以使用[item],若Series只有單個值對應,則返回這個標量值,如果有多個值對應,則返回一個Series:
s = pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'a', 'a', 'a', 'c'])
s['a']
a 1
a 3
a 4
a 5
dtype: int64
s['b']
2
如果取出多個索引的對應元素,則可以使用[items的列表]:
s[['c', 'b']]
c 6
b 2
dtype: int64
如果想要取出某兩個索引之間的元素,並且這兩個索引是在整個索引中唯一出現,則可以使用切片,,同時需要注意這裏的切片會包含兩個端點:
s['c': 'b' :-2]
c 6
a 4
b 2
dtype: int64
如果前後端點的值重複出現,那麼需要經過排序才能使用切片:
try:
s['a': 'b']
except Exception as e:
Err_Msg = e
Err_Msg
KeyError("Cannot get left slice bound for non-unique label: 'a'")
s.sort_index()['a': 'b']
a 1
a 3
a 4
a 5
b 2
dtype: int64
【b】以整數為索引的Series
在使用數據的讀入函數時,如果不特別指定所對應的列作為索引,那麼會生成從0開始的整數索引作為默認索引。當然,任意一組符合長度要求的整數都可以作為索引。
和字符串一樣,如果使用[int]或[int_list],則可以取出對應索引元素的值:
s = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'], index=[1, 3, 1, 2, 5, 4])
s[1]
1 a
1 c
dtype: object
s[[2,3]]
2 d
3 b
dtype: object
如果使用整數切片,則會取出對應索引位置的值,注意這裏的整數切片同Python中的切片一樣不包含右端點:
s[1:-1]
3 b
1 c
2 d
5 e
dtype: object
【WARNING】關於索引類型的説明
如果不想陷入麻煩,那麼請不要把純浮點以及任何混合類型(字符串、整數、浮點類型等的混合)作為索引,否則可能會在具體的操作時報錯或者返回非預期的結果,並且在實際的數據分析中也不存在這樣做的動機。
【END】
3. loc索引器
前面講到了對DataFrame的列進行選取,下面要討論其行的選取。對於表而言,有兩種索引器,一種是基於元素的loc索引器,另一種是基於位置的iloc索引器。
loc索引器的一般形式是loc[*, *],其中第一個*代表行的選擇,第二個*代表列的選擇,如果省略第二個位置寫作loc[*],這個*是指行的篩選。其中,*的位置一共有五類合法對象,分別是:單個元素、元素列表、元素切片、布爾列表以及函數,下面將依次説明。
為了演示相應操作,先利用set_index方法把Name列設為索引,關於該函數的其他用法將在多級索引一章介紹。
df_demo = df.set_index('Name')
df_demo.head()
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
|
|
Name
|
|||||
|
Gaopeng Yang
|
Shanghai Jiao Tong University
|
Freshman
|
Female
|
46.0
|
N
|
|
Changqiang You
|
Peking University
|
Freshman
|
Male
|
70.0
|
N
|
|
Mei Sun
|
Shanghai Jiao Tong University
|
Senior
|
Male
|
89.0
|
N
|
|
Xiaojuan Sun
|
Fudan University
|
Sophomore
|
Female
|
41.0
|
N
|
|
Gaojuan You
|
Fudan University
|
Sophomore
|
Male
|
74.0
|
N
|
【a】*為單個元素
此時,直接取出相應的行或列,如果該元素在索引中重複則結果為DataFrame,否則為Series:
df_demo.loc['Qiang Sun'] # 多個人叫此名字
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
|
|
Name
|
|||||
|
Qiang Sun
|
Tsinghua University
|
Junior
|
Female
|
53.0
|
N
|
|
Qiang Sun
|
Tsinghua University
|
Sophomore
|
Female
|
40.0
|
N
|
|
Qiang Sun
|
Shanghai Jiao Tong University
|
Junior
|
Female
|
NaN
|
N
|
df_demo.loc['Quan Zhao'] # 名字唯一
School Shanghai Jiao Tong University
Grade Junior
Gender Female
Weight 53
Transfer N
Name: Quan Zhao, dtype: object
也可以同時選擇行和列:
df_demo.loc['Qiang Sun', 'School'] # 返回Series
Name
Qiang Sun Tsinghua University
Qiang Sun Tsinghua University
Qiang Sun Shanghai Jiao Tong University
Name: School, dtype: object
df_demo.loc['Quan Zhao', 'School'] # 返回單個元素
'Shanghai Jiao Tong University'
【b】*為元素列表
此時,取出列表中所有元素值對應的行或列:
df_demo.loc[['Qiang Sun','Quan Zhao'], ['School','Gender']]
|
School
|
Gender
|
|
|
Name
|
||
|
Qiang Sun
|
Tsinghua University
|
Female
|
|
Qiang Sun
|
Tsinghua University
|
Female
|
|
Qiang Sun
|
Shanghai Jiao Tong University
|
Female
|
|
Quan Zhao
|
Shanghai Jiao Tong University
|
Female
|
【c】*為切片
之前的Series使用字符串索引時提到,如果是唯一值的起點和終點字符,那麼就可以使用切片,並且包含兩個端點,如果不唯一則報錯:
df_demo.loc['Gaojuan You':'Gaoqiang Qian', 'School':'Gender']
|
School
|
Grade
|
Gender
|
|
|
Name
|
|||
|
Gaojuan You
|
Fudan University
|
Sophomore
|
Male
|
|
Xiaoli Qian
|
Tsinghua University
|
Freshman
|
Female
|
|
Qiang Chu
|
Shanghai Jiao Tong University
|
Freshman
|
Female
|
|
Gaoqiang Qian
|
Tsinghua University
|
Junior
|
Female
|
需要注意的是,如果DataFrame使用整數索引,其使用整數切片的時候和上面字符串索引的要求一致,都是元素切片,包含端點且起點、終點不允許有重複值。
df_loc_slice_demo = df_demo.copy()
df_loc_slice_demo.index = range(df_demo.shape[0],0,-1)
df_loc_slice_demo.loc[5:3]
# df_loc_slice_demo
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
|
|
5
|
Fudan University
|
Junior
|
Female
|
46.0
|
N
|
|
4
|
Tsinghua University
|
Senior
|
Female
|
50.0
|
N
|
|
3
|
Shanghai Jiao Tong University
|
Senior
|
Female
|
45.0
|
N
|
df_loc_slice_demo.loc[3:5] # 沒有返回,説明不是整數位置切片
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
【d】*為布爾列表
在實際的數據處理中,根據條件來篩選行是極其常見的,此處傳入loc的布爾列表與DataFrame長度相同,且列表為True的位置所對應的行會被選中,False則會被剔除。
例如,選出體重超過70kg的學生:
df_demo.loc[df_demo.Weight>70].head()
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
|
|
Name
|
|||||
|
Mei Sun
|
Shanghai Jiao Tong University
|
Senior
|
Male
|
89.0
|
N
|
|
Gaojuan You
|
Fudan University
|
Sophomore
|
Male
|
74.0
|
N
|
|
Xiaopeng Zhou
|
Shanghai Jiao Tong University
|
Freshman
|
Male
|
74.0
|
N
|
|
Xiaofeng Sun
|
Tsinghua University
|
Senior
|
Male
|
71.0
|
N
|
|
Qiang Zheng
|
Shanghai Jiao Tong University
|
Senior
|
Male
|
87.0
|
N
|
前面所提到的傳入元素列表,也可以通過isin方法返回的布爾列表等價寫出,例如選出所有大一和大四的同學信息:
df_demo.loc[df_demo.Grade.isin(['Freshman', 'Senior'])].head()
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
|
|
Name
|
|||||
|
Gaopeng Yang
|
Shanghai Jiao Tong University
|
Freshman
|
Female
|
46.0
|
N
|
|
Changqiang You
|
Peking University
|
Freshman
|
Male
|
70.0
|
N
|
|
Mei Sun
|
Shanghai Jiao Tong University
|
Senior
|
Male
|
89.0
|
N
|
|
Xiaoli Qian
|
Tsinghua University
|
Freshman
|
Female
|
51.0
|
N
|
|
Qiang Chu
|
Shanghai Jiao Tong University
|
Freshman
|
Female
|
52.0
|
N
|
對於複合條件而言,可以用|(或), &(且), ~(取反)的組合來實現,例如選出復旦大學中體重超過70kg的大四學生,或者北大男生中體重超過80kg的非大四的學生:
condition_1_1 = df_demo.School == 'Fudan University'
condition_1_2 = df_demo.Grade == 'Senior'
condition_1_3 = df_demo.Weight > 70
condition_1 = condition_1_1 & condition_1_2 & condition_1_3
condition_2_1 = df_demo.School == 'Peking University'
condition_2_2 = df_demo.Grade == 'Senior'
condition_2_3 = df_demo.Weight > 80
condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
df_demo.loc[condition_1 | condition_2]
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
|
|
Name
|
|||||
|
Qiang Han
|
Peking University
|
Freshman
|
Male
|
87.0
|
N
|
|
Chengpeng Zhou
|
Fudan University
|
Senior
|
Male
|
81.0
|
N
|
|
Changpeng Zhao
|
Peking University
|
Freshman
|
Male
|
83.0
|
N
|
|
Chengpeng Qian
|
Fudan University
|
Senior
|
Male
|
73.0
|
Y
|
【練一練】
select_dtypes是一個實用函數,它能夠從表中選出相應類型的列,若要選出所有數值型的列,只需使用.select_dtypes('number'),請利用布爾列表選擇的方法結合DataFrame的dtypes屬性在learn_pandas數據集上實現這個功能。
learn_pandas = pd.read_csv('../data/learn_pandas.csv')
learn_pandas.select_dtypes('number').head()
|
Height
|
Weight
|
Test_Number
|
|
|
0
|
158.9
|
46.0
|
1
|
|
1
|
166.5
|
70.0
|
1
|
|
2
|
188.9
|
89.0
|
2
|
|
3
|
NaN
|
41.0
|
2
|
|
4
|
174.0
|
74.0
|
2
|
【END】
【e】*為函數
這裏的函數,必須以前面的四種合法形式之一為返回值,並且函數的輸入值為DataFrame本身。假設仍然是上述複合條件篩選的例子,可以把邏輯寫入一個函數中再返回,需要注意的是函數的形式參數x本質上即為df_demo:
def condition(x):
condition_1_1 = x.School == 'Fudan University'
condition_1_2 = x.Grade == 'Senior'
condition_1_3 = x.Weight > 70
condition_1 = condition_1_1 & condition_1_2 & condition_1_3
condition_2_1 = x.School == 'Peking University'
condition_2_2 = x.Grade == 'Senior'
condition_2_3 = x.Weight > 80
condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
result = condition_1 | condition_2
return result
df_demo.loc[condition]
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
|
|
Name
|
|||||
|
Qiang Han
|
Peking University
|
Freshman
|
Male
|
87.0
|
N
|
|
Chengpeng Zhou
|
Fudan University
|
Senior
|
Male
|
81.0
|
N
|
|
Changpeng Zhao
|
Peking University
|
Freshman
|
Male
|
83.0
|
N
|
|
Chengpeng Qian
|
Fudan University
|
Senior
|
Male
|
73.0
|
Y
|
此外,還支持使用lambda表達式,其返回值也同樣必須是先前提到的四種形式之一:
df_demo.loc[lambda x:'Quan Zhao', lambda x:'Gender']
'Female'
由於函數無法返回如start: end: step的切片形式,故返回切片時要用slice對象進行包裝:
df_demo.loc[lambda x: slice('Gaojuan You', 'Gaoqiang Qian')]
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
|
|
Name
|
|||||
|
Gaojuan You
|
Fudan University
|
Sophomore
|
Male
|
74.0
|
N
|
|
Xiaoli Qian
|
Tsinghua University
|
Freshman
|
Female
|
51.0
|
N
|
|
Qiang Chu
|
Shanghai Jiao Tong University
|
Freshman
|
Female
|
52.0
|
N
|
|
Gaoqiang Qian
|
Tsinghua University
|
Junior
|
Female
|
50.0
|
N
|
最後需要指出的是,對於Series也可以使用loc索引,其遵循的原則與DataFrame中用於行篩選的loc[*]完全一致,此處不再贅述。
【WARNING】不要使用鏈式賦值
在對錶或者序列賦值時,應當在使用一層索引器後直接進行賦值操作,這樣做是由於進行多次索引後賦值是賦在臨時返回的copy副本上的,而沒有真正修改元素從而報出SettingWithCopyWarning警告。例如,下面給出的例子:
df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]], columns=list('AB'))
df_chain
|
A
|
B
|
|
|
0
|
0
|
0
|
|
1
|
1
|
0
|
|
2
|
-1
|
0
|
import warnings
with warnings.catch_warnings():
warnings.filterwarnings('error')
try:
df_chain[df_chain.A!=0].B = 1 # 使用方括號列索引後,再使用點的列索引
except Warning as w:
Warning_Msg = w
print(Warning_Msg)
df_chain
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
|
A
|
B
|
|
|
0
|
0
|
0
|
|
1
|
1
|
0
|
|
2
|
-1
|
0
|
df_chain.loc[df_chain.A!=0,'B'] = 1
df_chain
【END】
4. iloc索引器
iloc的使用與loc完全類似,只不過是針對位置進行篩選,在相應的*位置處一共也有五類合法對象,分別是:整數、整數列表、整數切片、布爾列表以及函數,函數的返回值必須是前面的四類合法對象中的一個,其輸入同樣也為DataFrame本身。
df_demo.iloc[1, 1] # 第二行第二列
'Freshman'
df_demo.iloc[[0, 1], [0, 1]] # 前兩行前兩列
|
School
|
Grade
|
|
|
Name
|
||
|
Gaopeng Yang
|
Shanghai Jiao Tong University
|
Freshman
|
|
Changqiang You
|
Peking University
|
Freshman
|
df_demo.iloc[1: 4, 2:4] # 切片不包含結束端點
|
Gender
|
Weight
|
|
|
Name
|
||
|
Changqiang You
|
Male
|
70.0
|
|
Mei Sun
|
Male
|
89.0
|
|
Xiaojuan Sun
|
Female
|
41.0
|
df_demo.iloc[lambda x: slice(1, 4)] # 傳入切片為返回值的函數
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
|
|
Name
|
|||||
|
Changqiang You
|
Peking University
|
Freshman
|
Male
|
70.0
|
N
|
|
Mei Sun
|
Shanghai Jiao Tong University
|
Senior
|
Male
|
89.0
|
N
|
|
Xiaojuan Sun
|
Fudan University
|
Sophomore
|
Female
|
41.0
|
N
|
在使用布爾列表的時候要特別注意,不能傳入Series而必須傳入序列的values,否則會報錯。因此,在使用布爾篩選的時候還是應當優先考慮loc的方式。
例如,選出體重超過80kg的學生:
(df_demo.Weight>80).values[0:5]
array([False, False, True, False, False])
df_demo.iloc[(df_demo.Weight>80).values].head()
|
School
|
Grade
|
Gender
|
Weight
|
Transfer
|
|
|
Name
|
|||||
|
Mei Sun
|
Shanghai Jiao Tong University
|
Senior
|
Male
|
89.0
|
N
|
|
Qiang Zheng
|
Shanghai Jiao Tong University
|
Senior
|
Male
|
87.0
|
N
|
|
Qiang Han
|
Peking University
|
Freshman
|
Male
|
87.0
|
N
|
|
Chengpeng Zhou
|
Fudan University
|
Senior
|
Male
|
81.0
|
N
|
|
Feng Han
|
Shanghai Jiao Tong University
|
Sophomore
|
Male
|
82.0
|
N
|
對Series而言同樣也可以通過iloc返回相應位置的值或子序列:
df_demo.School.iloc[1]
'Peking University'
df_demo.School.iloc[1:5:2]
Name
Changqiang You Peking University
Xiaojuan Sun Fudan University
Name: School, dtype: object
5. query方法
在pandas中,支持把字符串形式的查詢表達式傳入query方法來查詢數據,其表達式的執行結果必須返回布爾列表。在進行復雜索引時,由於這種檢索方式無需像普通方法一樣重複使用DataFrame的名字來引用列名,一般而言會使代碼長度在不降低可讀性的前提下有所減少。
例如,將loc一節中的複合條件查詢例子可以如下改寫:
df.query('((School == "Fudan University")&'
' (Grade == "Senior")&'
' (Weight > 70))|'
'((School == "Peking University")&'
' (Grade != "Senior")&'
' (Weight > 80))')
|
School
|
Grade
|
Name
|
Gender
|
Weight
|
Transfer
|
|
|
38
|
Peking University
|
Freshman
|
Qiang Han
|
Male
|
87.0
|
N
|
|
66
|
Fudan University
|
Senior
|
Chengpeng Zhou
|
Male
|
81.0
|
N
|
|
99
|
Peking University
|
Freshman
|
Changpeng Zhao
|
Male
|
83.0
|
N
|
|
131
|
Fudan University
|
Senior
|
Chengpeng Qian
|
Male
|
73.0
|
Y
|
df.query('((School == "Fudan University")& (Grade == "Senior")& (Weight > 70))| ((School == "Peking University")&(Grade != "Senior")&(Weight > 80))')
|
School
|
Grade
|
Name
|
Gender
|
Weight
|
Transfer
|
|
|
38
|
Peking University
|
Freshman
|
Qiang Han
|
Male
|
87.0
|
N
|
|
66
|
Fudan University
|
Senior
|
Chengpeng Zhou
|
Male
|
81.0
|
N
|
|
99
|
Peking University
|
Freshman
|
Changpeng Zhao
|
Male
|
83.0
|
N
|
|
131
|
Fudan University
|
Senior
|
Chengpeng Qian
|
Male
|
73.0
|
Y
|
在query表達式中,幫用户註冊了所有來自DataFrame的列名,所有屬於該Series的方法都可以被調用,和正常的函數調用並沒有區別,例如查詢體重超過均值的學生:
df.query('Weight > Weight.mean()').head()
|
School
|
Grade
|
Name
|
Gender
|
Weight
|
Transfer
|
|
|
1
|
Peking University
|
Freshman
|
Changqiang You
|
Male
|
70.0
|
N
|
|
2
|
Shanghai Jiao Tong University
|
Senior
|
Mei Sun
|
Male
|
89.0
|
N
|
|
4
|
Fudan University
|
Sophomore
|
Gaojuan You
|
Male
|
74.0
|
N
|
|
10
|
Shanghai Jiao Tong University
|
Freshman
|
Xiaopeng Zhou
|
Male
|
74.0
|
N
|
|
14
|
Tsinghua University
|
Senior
|
Xiaomei Zhou
|
Female
|
57.0
|
N
|
【NOTE】query中引用帶空格的列名
對於含有空格的列名,需要使用`col name`的方式進行引用。
【END】
同時,在query中還註冊了若干英語的字面用法,幫助提高可讀性,例如:or, and, or, is in, not in。例如,篩選出男生中不是大一大二的學生:
df.query('(Grade not in ["Freshman", "Sophomore"]) and (Gender == "Male")').head()
|
School
|
Grade
|
Name
|
Gender
|
Weight
|
Transfer
|
|
|
2
|
Shanghai Jiao Tong University
|
Senior
|
Mei Sun
|
Male
|
89.0
|
N
|
|
16
|
Tsinghua University
|
Junior
|
Xiaoqiang Qin
|
Male
|
68.0
|
N
|
|
17
|
Tsinghua University
|
Junior
|
Peng Wang
|
Male
|
65.0
|
N
|
|
18
|
Tsinghua University
|
Senior
|
Xiaofeng Sun
|
Male
|
71.0
|
N
|
|
21
|
Shanghai Jiao Tong University
|
Senior
|
Xiaopeng Shen
|
Male
|
62.0
|
NaN
|
此外,在字符串中出現與列表的比較時,==和!=分別表示元素出現在列表和沒有出現在列表,等價於is in和not in,例如查詢所有大三和大四的學生:
df.query('Grade == ["Junior", "Senior"]').head()
|
School
|
Grade
|
Name
|
Gender
|
Weight
|
Transfer
|
|
|
2
|
Shanghai Jiao Tong University
|
Senior
|
Mei Sun
|
Male
|
89.0
|
N
|
|
7
|
Tsinghua University
|
Junior
|
Gaoqiang Qian
|
Female
|
50.0
|
N
|
|
9
|
Peking University
|
Junior
|
Juan Xu
|
Female
|
NaN
|
N
|
|
11
|
Tsinghua University
|
Junior
|
Xiaoquan Lv
|
Female
|
43.0
|
N
|
|
12
|
Shanghai Jiao Tong University
|
Senior
|
Peng You
|
Female
|
48.0
|
NaN
|
對於query中的字符串,如果要引用外部變量,只需在變量名前加@符號。例如,取出體重位於70kg到80kg之間的學生:
low, high =70, 80
df.query('Weight.between(@low, @high)').head()
|
School
|
Grade
|
Name
|
Gender
|
Weight
|
Transfer
|
|
|
1
|
Peking University
|
Freshman
|
Changqiang You
|
Male
|
70.0
|
N
|
|
4
|
Fudan University
|
Sophomore
|
Gaojuan You
|
Male
|
74.0
|
N
|
|
10
|
Shanghai Jiao Tong University
|
Freshman
|
Xiaopeng Zhou
|
Male
|
74.0
|
N
|
|
18
|
Tsinghua University
|
Senior
|
Xiaofeng Sun
|
Male
|
71.0
|
N
|
|
35
|
Peking University
|
Freshman
|
Gaoli Zhao
|
Male
|
78.0
|
N
|
6. 隨機抽樣
如果把DataFrame的每一行看作一個樣本,或把每一列看作一個特徵,再把整個DataFrame看作總體,想要對樣本或特徵進行隨機抽樣就可以用sample函數。有時在拿到大型數據集後,想要對統計特徵進行計算來了解數據的大致分佈,但是這很費時間。同時,由於許多統計特徵在等概率不放回的簡單隨機抽樣條件下,是總體統計特徵的無偏估計,比如樣本均值和總體均值,那麼就可以先從整張表中抽出一部分來做近似估計。
sample函數中的主要參數為n, axis, frac, replace, weights,前三個分別是指抽樣數量、抽樣的方向(0為行、1為列)和抽樣比例(0.3則為從總體中抽出30%的樣本)。
replace和weights分別是指是否放回和每個樣本的抽樣相對概率,當replace = True則表示有放回抽樣。例如,對下面構造的df_sample以value值的相對大小為抽樣概率進行有放回抽樣,抽樣數量為3。
df_sample = pd.DataFrame({'id': list('abcde'), 'value': [1, 2, 3, 4, 91]})
df_sample
|
id
|
value
|
|
|
0
|
a
|
1
|
|
1
|
b
|
2
|
|
2
|
c
|
3
|
|
3
|
d
|
4
|
|
4
|
e
|
91
|
df_sample.sample(3, replace = True, weights = df_sample.value)
|
id
|
value
|
|
|
3
|
d
|
4
|
|
4
|
e
|
91
|
|
4
|
e
|
91
|
二、多級索引
1. 多級索引及其表的結構
為了更加清晰地説明具有多級索引的DataFrame結構,下面新構造一張表,讀者可以忽略這裏的構造方法,它們將會在第4小節被更詳細地講解。
np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'), df.Gender.unique()], names=('School', 'Gender'))
multi_column = pd.MultiIndex.from_product([['Height', 'Weight'], df.Grade.unique()], names=('Indicator', 'Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(), (np.random.randn(8,4)*5 + 65).tolist()],
index = multi_index, columns = multi_column).round(1)
df_multi
|
Indicator
|
Height
|
Weight
|
|||||||
|
Grade
|
Freshman
|
Senior
|
Sophomore
|
Junior
|
Freshman
|
Senior
|
Sophomore
|
Junior
|
|
|
School
|
Gender
|
||||||||
|
A
|
Female
|
171.8
|
165.0
|
167.9
|
174.2
|
60.6
|
55.1
|
63.3
|
65.8
|
|
Male
|
172.3
|
158.1
|
167.8
|
162.2
|
71.2
|
71.0
|
63.1
|
63.5
|
|
|
B
|
Female
|
162.5
|
165.1
|
163.7
|
170.3
|
59.8
|
57.9
|
56.5
|
74.8
|
|
Male
|
166.8
|
163.6
|
165.2
|
164.7
|
62.5
|
62.8
|
58.7
|
68.9
|
|
|
C
|
Female
|
170.5
|
162.0
|
164.6
|
158.7
|
56.9
|
63.9
|
60.5
|
66.9
|
|
Male
|
150.2
|
166.3
|
167.3
|
159.3
|
62.4
|
59.1
|
64.9
|
67.1
|
|
|
D
|
Female
|
174.3
|
155.7
|
163.2
|
162.1
|
65.3
|
66.5
|
61.8
|
63.2
|
|
Male
|
170.7
|
170.3
|
163.8
|
164.9
|
61.6
|
63.2
|
60.9
|
56.4
|
下圖通過顏色區分,標記了DataFrame的結構。與單層索引的表一樣,具備元素值、行索引和列索引三個部分。其中,這裏的行索引和列索引都是MultiIndex類型,只不過索引中的一個元素是元組而不是單層索引中的標量。例如,行索引的第四個元素為("B", "Male"),列索引的第二個元素為("Height", "Senior"),這裏需要注意,外層連續出現相同的值時,第一次之後出現的會被隱藏顯示,使結果的可讀性增強。
與單層索引類似,MultiIndex也具有名字屬性,圖中的School和Gender分別對應了表的第一層和第二層行索引的名字,Indicator和Grade分別對應了第一層和第二層列索引的名字。
索引的名字和值屬性分別可以通過names和values獲得:
df_multi.index.names
FrozenList(['School', 'Gender'])
df_multi.columns.names
FrozenList(['Indicator', 'Grade'])
df_multi.index.values
array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
dtype=object)
df_multi.columns.values
array([('Height', 'Freshman'), ('Height', 'Senior'),
('Height', 'Sophomore'), ('Height', 'Junior'),
('Weight', 'Freshman'), ('Weight', 'Senior'),
('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)
如果想要得到某一層的索引,則需要通過get_level_values獲得:
df_multi.index.get_level_values(1)
Index(['Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'], dtype='object', name='Gender')
但對於索引而言,無論是單層還是多層,用户都無法通過index_obj[0] = item的方式來修改元素,也不能通過index_name[0] = new_name的方式來修改名字,關於如何修改這些屬性的話題將在第三節被討論。
2. 多級索引中的loc索引器
熟悉了結構後,現在回到原表,將學校和年級設為索引,此時的行為多級索引,列為單級索引,由於默認狀態的列索引不含名字,因此對應於剛剛圖中Indicator和Grade的索引名位置是空缺的。
df_multi = df.set_index(['School', 'Grade'])
df_multi.head()
|
Name
|
Gender
|
Weight
|
Transfer
|
||
|
School
|
Grade
|
||||
|
Shanghai Jiao Tong University
|
Freshman
|
Gaopeng Yang
|
Female
|
46.0
|
N
|
|
Peking University
|
Freshman
|
Changqiang You
|
Male
|
70.0
|
N
|
|
Shanghai Jiao Tong University
|
Senior
|
Mei Sun
|
Male
|
89.0
|
N
|
|
Fudan University
|
Sophomore
|
Xiaojuan Sun
|
Female
|
41.0
|
N
|
|
Sophomore
|
Gaojuan You
|
Male
|
74.0
|
N
|
由於多級索引中的單個元素以元組為單位,因此之前在第一節介紹的 loc 和 iloc 方法完全可以照搬,只需把標量的位置替換成對應的元組。
當傳入元組列表或單個元組或返回前二者的函數時,需要先進行索引排序以避免性能警告:
with warnings.catch_warnings():
warnings.filterwarnings('error')
try:
df_multi.loc[('Fudan University', 'Junior')].head()
except Warning as w:
Warning_Msg = w
Warning_Msg
pandas.errors.PerformanceWarning('indexing past lexsort depth may impact performance.')
df_sorted = df_multi.sort_index()
df_sorted.loc[('Fudan University', 'Junior')].head()
|
Name
|
Gender
|
Weight
|
Transfer
|
||
|
School
|
Grade
|
||||
|
Fudan University
|
Junior
|
Yanli You
|
Female
|
48.0
|
N
|
|
Junior
|
Chunqiang Chu
|
Male
|
72.0
|
N
|
|
|
Junior
|
Changfeng Lv
|
Male
|
76.0
|
N
|
|
|
Junior
|
Yanjuan Lv
|
Female
|
49.0
|
NaN
|
|
|
Junior
|
Gaoqiang Zhou
|
Female
|
43.0
|
N
|
df_sorted.loc[[('Fudan University', 'Senior'), ('Shanghai Jiao Tong University', 'Freshman')]]
|
Name
|
Gender
|
Weight
|
Transfer
|
||
|
School
|
Grade
|
||||
|
Fudan University
|
Senior
|
Chengpeng Zheng
|
Female
|
38.0
|
N
|
|
Senior
|
Feng Zhou
|
Female
|
47.0
|
N
|
|
|
Senior
|
Gaomei Lv
|
Female
|
34.0
|
N
|
|
|
Senior
|
Chunli Lv
|
Female
|
56.0
|
N
|
|
|
Senior
|
Chengpeng Zhou
|
Male
|
81.0
|
N
|
|
|
Senior
|
Gaopeng Qin
|
Female
|
52.0
|
N
|
|
|
Senior
|
Chunjuan Xu
|
Female
|
47.0
|
N
|
|
|
Senior
|
Juan Zhang
|
Female
|
47.0
|
N
|
|
|
Senior
|
Chengpeng Qian
|
Male
|
73.0
|
Y
|
|
|
Senior
|
Xiaojuan Qian
|
Female
|
50.0
|
N
|
|
|
Senior
|
Quan Xu
|
Female
|
44.0
|
N
|
|
|
Shanghai Jiao Tong University
|
Freshman
|
Gaopeng Yang
|
Female
|
46.0
|
N
|
|
Freshman
|
Qiang Chu
|
Female
|
52.0
|
N
|
|
|
Freshman
|
Xiaopeng Zhou
|
Male
|
74.0
|
N
|
|
|
Freshman
|
Yanpeng Lv
|
Male
|
65.0
|
N
|
|
|
Freshman
|
Xiaopeng Zhao
|
Female
|
53.0
|
N
|
|
|
Freshman
|
Chunli Zhao
|
Male
|
83.0
|
N
|
|
|
Freshman
|
Peng Zhang
|
Female
|
NaN
|
N
|
|
|
Freshman
|
Xiaoquan Sun
|
Female
|
40.0
|
N
|
|
|
Freshman
|
Chunmei Shi
|
Female
|
52.0
|
N
|
|
|
Freshman
|
Xiaomei Yang
|
Female
|
49.0
|
N
|
|
|
Freshman
|
Xiaofeng Qian
|
Female
|
49.0
|
N
|
|
|
Freshman
|
Changmei Lv
|
Male
|
75.0
|
N
|
|
|
Freshman
|
Qiang Feng
|
Male
|
80.0
|
N
|
df_sorted.loc[df_sorted.Weight > 70].head() # 布爾列表也是可用的
|
Name
|
Gender
|
Weight
|
Transfer
|
||
|
School
|
Grade
|
||||
|
Fudan University
|
Freshman
|
Feng Wang
|
Male
|
74.0
|
N
|
|
Junior
|
Chunqiang Chu
|
Male
|
72.0
|
N
|
|
|
Junior
|
Changfeng Lv
|
Male
|
76.0
|
N
|
|
|
Senior
|
Chengpeng Zhou
|
Male
|
81.0
|
N
|
|
|
Senior
|
Chengpeng Qian
|
Male
|
73.0
|
Y
|
df_sorted.loc[lambda x:('Fudan University','Junior')].head()
|
Name
|
Gender
|
Weight
|
Transfer
|
||
|
School
|
Grade
|
||||
|
Fudan University
|
Junior
|
Yanli You
|
Female
|
48.0
|
N
|
|
Junior
|
Chunqiang Chu
|
Male
|
72.0
|
N
|
|
|
Junior
|
Changfeng Lv
|
Male
|
76.0
|
N
|
|
|
Junior
|
Yanjuan Lv
|
Female
|
49.0
|
NaN
|
|
|
Junior
|
Gaoqiang Zhou
|
Female
|
43.0
|
N
|
當使用切片時需要注意,在單級索引中只要切片端點元素是唯一的,那麼就可以進行切片,但在多級索引中,無論元組在索引中是否重複出現,都必須經過排序才能使用切片,否則報錯:
try:
df_multi.loc[('Fudan University', 'Senior'):].head()
except Exception as e:
Err_Msg = e
Err_Msg
pandas.errors.UnsortedIndexError('Key length (2) was greater than MultiIndex lexsort depth (0)')
df_sorted.loc[('Fudan University', 'Senior'):].head()
|
Name
|
Gender
|
Weight
|
Transfer
|
||
|
School
|
Grade
|
||||
|
Fudan University
|
Senior
|
Chengpeng Zheng
|
Female
|
38.0
|
N
|
|
Senior
|
Feng Zhou
|
Female
|
47.0
|
N
|
|
|
Senior
|
Gaomei Lv
|
Female
|
34.0
|
N
|
|
|
Senior
|
Chunli Lv
|
Female
|
56.0
|
N
|
|
|
Senior
|
Chengpeng Zhou
|
Male
|
81.0
|
N
|
df_unique = df.drop_duplicates(subset=['School','Grade']).set_index(['School', 'Grade'])
df_unique.head()
|
Name
|
Gender
|
Weight
|
Transfer
|
||
|
School
|
Grade
|
||||
|
Shanghai Jiao Tong University
|
Freshman
|
Gaopeng Yang
|
Female
|
46.0
|
N
|
|
Peking University
|
Freshman
|
Changqiang You
|
Male
|
70.0
|
N
|
|
Shanghai Jiao Tong University
|
Senior
|
Mei Sun
|
Male
|
89.0
|
N
|
|
Fudan University
|
Sophomore
|
Xiaojuan Sun
|
Female
|
41.0
|
N
|
|
Tsinghua University
|
Freshman
|
Xiaoli Qian
|
Female
|
51.0
|
N
|
try:
df_unique.loc[('Fudan University', 'Senior'):].head()
except Exception as e:
Err_Msg = e
Err_Msg
pandas.errors.UnsortedIndexError('Key length (2) was greater than MultiIndex lexsort depth (0)')
df_unique.sort_index().loc[('Fudan University', 'Senior'):].head()
|
Name
|
Gender
|
Weight
|
Transfer
|
||
|
School
|
Grade
|
||||
|
Fudan University
|
Senior
|
Chengpeng Zheng
|
Female
|
38.0
|
N
|
|
Sophomore
|
Xiaojuan Sun
|
Female
|
41.0
|
N
|
|
|
Peking University
|
Freshman
|
Changqiang You
|
Male
|
70.0
|
N
|
|
Junior
|
Juan Xu
|
Female
|
NaN
|
N
|
|
|
Senior
|
Changli Lv
|
Female
|
41.0
|
N
|
此外,在多級索引中的元組有一種特殊的用法,可以對多層的元素進行交叉組合後索引,但同時需要指定loc的列,全選則用:表示。其中,每一層需要選中的元素用列表存放,傳入loc的形式為[(level_0_list, level_1_list), cols]。例如,想要得到所有北大和復旦的大二大三學生,可以如下寫出:
res = df_multi.loc[(['Peking University', 'Fudan University'], ['Sophomore', 'Junior']), :]
res
|
Name
|
Gender
|
Weight
|
Transfer
|
||
|
School
|
Grade
|
||||
|
Peking University
|
Sophomore
|
Changmei Xu
|
Female
|
43.0
|
N
|
|
Sophomore
|
Xiaopeng Qin
|
Male
|
NaN
|
N
|
|
|
Sophomore
|
Mei Xu
|
Female
|
39.0
|
N
|
|
|
Sophomore
|
Xiaoli Zhou
|
Female
|
55.0
|
N
|
|
|
Sophomore
|
Peng Han
|
Female
|
34.0
|
NaN
|
|
|
Junior
|
Juan Xu
|
Female
|
NaN
|
N
|
|
|
Junior
|
Changjuan You
|
Female
|
47.0
|
N
|
|
|
Junior
|
Gaoli Xu
|
Female
|
48.0
|
N
|
|
|
Junior
|
Gaoquan Zhou
|
Male
|
70.0
|
N
|
|
|
Junior
|
Qiang You
|
Female
|
56.0
|
N
|
|
|
Junior
|
Chengli Zhao
|
Male
|
NaN
|
NaN
|
|
|
Junior
|
Chengpeng Zhao
|
Female
|
44.0
|
N
|
|
|
Junior
|
Xiaofeng Zhao
|
Female
|
46.0
|
N
|
|
|
Fudan University
|
Sophomore
|
Xiaojuan Sun
|
Female
|
41.0
|
N
|
|
Sophomore
|
Gaojuan You
|
Male
|
74.0
|
N
|
|
|
Sophomore
|
Xiaoquan Zhang
|
Female
|
45.0
|
N
|
|
|
Sophomore
|
Mei Xu
|
Male
|
79.0
|
N
|
|
|
Sophomore
|
Chengqiang Lv
|
Female
|
53.0
|
N
|
|
|
Sophomore
|
Xiaojuan Chu
|
Male
|
68.0
|
N
|
|
|
Sophomore
|
Changqiang Qian
|
Male
|
64.0
|
N
|
|
|
Sophomore
|
Li Sun
|
Female
|
57.0
|
N
|
|
|
Junior
|
Yanli You
|
Female
|
48.0
|
N
|
|
|
Junior
|
Chunqiang Chu
|
Male
|
72.0
|
N
|
|
|
Junior
|
Changfeng Lv
|
Male
|
76.0
|
N
|
|
|
Junior
|
Yanjuan Lv
|
Female
|
49.0
|
NaN
|
|
|
Junior
|
Gaoqiang Zhou
|
Female
|
43.0
|
N
|
|
|
Junior
|
Xiaojuan Zhao
|
Female
|
49.0
|
N
|
|
|
Junior
|
Gaoquan Chu
|
Female
|
51.0
|
N
|
|
|
Junior
|
Chengli Sun
|
Male
|
62.0
|
N
|
|
|
Junior
|
Gaojuan Qian
|
Female
|
44.0
|
N
|
|
|
Junior
|
Xiaojuan Qian
|
Female
|
51.0
|
N
|
|
|
Junior
|
Chunjuan Zhang
|
Female
|
47.0
|
N
|
|
|
Junior
|
Xiaojuan Sun
|
Female
|
46.0
|
N
|
res.shape
(33, 4)
下面的語句和上面類似,但仍然傳入的是元素(這裏為元組)的列表,它們的意義是不同的,表示的是選出北大的大三學生和復旦的大二學生:
res = df_multi.loc[[('Peking University', 'Junior'), ('Fudan University', 'Sophomore')]]
res
|
Name
|
Gender
|
Weight
|
Transfer
|
||
|
School
|
Grade
|
||||
|
Peking University
|
Junior
|
Juan Xu
|
Female
|
NaN
|
N
|
|
Junior
|
Changjuan You
|
Female
|
47.0
|
N
|
|
|
Junior
|
Gaoli Xu
|
Female
|
48.0
|
N
|
|
|
Junior
|
Gaoquan Zhou
|
Male
|
70.0
|
N
|
|
|
Junior
|
Qiang You
|
Female
|
56.0
|
N
|
|
|
Junior
|
Chengli Zhao
|
Male
|
NaN
|
NaN
|
|
|
Junior
|
Chengpeng Zhao
|
Female
|
44.0
|
N
|
|
|
Junior
|
Xiaofeng Zhao
|
Female
|
46.0
|
N
|
|
|
Fudan University
|
Sophomore
|
Xiaojuan Sun
|
Female
|
41.0
|
N
|
|
Sophomore
|
Gaojuan You
|
Male
|
74.0
|
N
|
|
|
Sophomore
|
Xiaoquan Zhang
|
Female
|
45.0
|
N
|
|
|
Sophomore
|
Mei Xu
|
Male
|
79.0
|
N
|
|
|
Sophomore
|
Chengqiang Lv
|
Female
|
53.0
|
N
|
|
|
Sophomore
|
Xiaojuan Chu
|
Male
|
68.0
|
N
|
|
|
Sophomore
|
Changqiang Qian
|
Male
|
64.0
|
N
|
|
|
Sophomore
|
Li Sun
|
Female
|
57.0
|
N
|
res.shape
(16, 4)
3. IndexSlice對象
前面介紹的方法,即使在索引不重複的時候,也只能對元組整體進行切片,而不能對每層進行切片,也不允許將切片和布爾列表混合使用,引入IndexSlice對象就能解決這個問題。Slice對象一共有兩種形式,第一種為loc[idx[*,*]]型,第二種為loc[idx[*,*],idx[*,*]]型,下面將進行介紹。為了方便演示,下面構造一個索引不重複的DataFrame:
np.random.seed(0)
L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)), index=mul_index1, columns=mul_index2)
df_ex
|
Big
|
D
|
E
|
F
|
|||||||
|
Small
|
d
|
e
|
f
|
d
|
e
|
f
|
d
|
e
|
f
|
|
|
Upper
|
Lower
|
|||||||||
|
A
|
a
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
-5
|
|
b
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
-4
|
4
|
|
|
c
|
-1
|
0
|
7
|
-4
|
6
|
6
|
-9
|
9
|
-6
|
|
|
B
|
a
|
8
|
5
|
-2
|
-9
|
-8
|
0
|
-9
|
1
|
-6
|
|
b
|
2
|
9
|
-7
|
-9
|
-9
|
-5
|
-4
|
-3
|
-1
|
|
|
c
|
8
|
6
|
-5
|
0
|
1
|
-8
|
-8
|
-2
|
0
|
|
|
C
|
a
|
-6
|
-3
|
2
|
5
|
9
|
-9
|
5
|
-6
|
3
|
|
b
|
1
|
2
|
-5
|
-3
|
-5
|
6
|
-6
|
3
|
-5
|
|
|
c
|
-1
|
5
|
6
|
-6
|
6
|
4
|
7
|
8
|
-4
|
為了使用silce對象,先要進行定義:
idx = pd.IndexSlice
【a】loc[idx[*,*]]型
這種情況並不能進行多層分別切片,前一個*表示行的選擇,後一個*表示列的選擇,與單純的loc是類似的:
df_ex.loc[idx['C':, ('D', 'f'):]]
|
Big
|
D
|
E
|
F
|
|||||
|
Small
|
f
|
d
|
e
|
f
|
d
|
e
|
f
|
|
|
Upper
|
Lower
|
|||||||
|
C
|
a
|
2
|
5
|
9
|
-9
|
5
|
-6
|
3
|
|
b
|
-5
|
-3
|
-5
|
6
|
-6
|
3
|
-5
|
|
|
c
|
6
|
-6
|
6
|
4
|
7
|
8
|
-4
|
另外,也支持布爾序列的索引:
df_ex.sum()
Big Small
D d 11
e 33
f -21
E d -35
e -8
f -3
F d -16
e 15
f -20
dtype: int64
df_ex.loc[idx[:'A', lambda x:x.sum()>0]] # 列所有和大於0
|
Big
|
D
|
F
|
||
|
Small
|
d
|
e
|
e
|
|
|
Upper
|
Lower
|
|||
|
A
|
a
|
3
|
6
|
9
|
|
b
|
-3
|
3
|
-4
|
|
|
c
|
-1
|
0
|
9
|
【b】loc[idx[*,*],idx[*,*]]型
這種情況能夠分層進行切片,前一個idx指代的是行索引,後一個是列索引。
df_ex.loc[idx[:'A', 'b':], idx['E':, 'e':]]
|
Big
|
E
|
F
|
|||
|
Small
|
e
|
f
|
e
|
f
|
|
|
Upper
|
Lower
|
||||
|
A
|
b
|
-2
|
5
|
-4
|
4
|
|
c
|
6
|
6
|
9
|
-6
|
但需要注意的是,此時不支持使用函數:
try:
df_ex.loc[idx[:'A', lambda x: 'b'], idx['E':, 'e':]]
except Exception as e:
Err_Msg = e
Err_Msg
KeyError(<function __main__.<lambda>(x)>)
4. 多級索引的構造
前面提到了多級索引表的結構和切片,那麼除了使用set_index之外,如何自己構造多級索引呢?常用的有from_tuples, from_arrays, from_product三種方法,它們都是pd.MultiIndex對象下的函數。
from_tuples指根據傳入由元組組成的列表進行構造:
my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
from_arrays指根據傳入列表中,對應層的列表進行構造:
my_array = [list('aabb'), ['cat', 'dog']*2]
pd.MultiIndex.from_arrays(my_array, names=['First','Second'])
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
from_product指根據給定多個列表的笛卡爾積進行構造:
my_list1 = ['a','b']
my_list2 = ['cat','dog']
pd.MultiIndex.from_product([my_list1, my_list2], names=['First','Second'])
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
三、索引的常用方法
1. 索引層的交換和刪除
為了方便理解交換的過程,這裏構造一個三級索引的例子:
np.random.seed(0)
L1,L2,L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3], names=('Upper', 'Lower','Extra'))
L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6], names=('Big', 'Small', 'Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)), index=mul_index1, columns=mul_index2)
df_ex
|
Big
|
C
|
D
|
||||||||
|
Small
|
c
|
d
|
c
|
d
|
||||||
|
Other
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
||
|
Upper
|
Lower
|
Extra
|
||||||||
|
A
|
a
|
alpha
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
beta
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
||
|
b
|
alpha
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
|
beta
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
||
|
B
|
a
|
alpha
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
|
beta
|
-9
|
-5
|
-4
|
-3
|
-1
|
8
|
6
|
-5
|
||
|
b
|
alpha
|
0
|
1
|
-8
|
-8
|
-2
|
0
|
-6
|
-3
|
|
|
beta
|
2
|
5
|
9
|
-9
|
5
|
-6
|
3
|
1
|
索引層的交換由swaplevel和reorder_levels完成,前者只能交換兩個層,而後者可以交換任意層,兩者都可以指定交換的是軸是哪一個,即行索引或列索引:
df_ex.swaplevel(0,2,axis=1) # 列索引的第一層和第三層交換
|
Other
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
||
|
Small
|
c
|
c
|
d
|
d
|
c
|
c
|
d
|
d
|
||
|
Big
|
C
|
C
|
C
|
C
|
D
|
D
|
D
|
D
|
||
|
Upper
|
Lower
|
Extra
|
||||||||
|
A
|
a
|
alpha
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
beta
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
||
|
b
|
alpha
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
|
beta
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
||
|
B
|
a
|
alpha
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
|
beta
|
-9
|
-5
|
-4
|
-3
|
-1
|
8
|
6
|
-5
|
||
|
b
|
alpha
|
0
|
1
|
-8
|
-8
|
-2
|
0
|
-6
|
-3
|
|
|
beta
|
2
|
5
|
9
|
-9
|
5
|
-6
|
3
|
1
|
df_ex.reorder_levels([2,0,1],axis=0).head() # 列表數字指代原來索引中的層
|
Big
|
C
|
D
|
||||||||
|
Small
|
c
|
d
|
c
|
d
|
||||||
|
Other
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
||
|
Extra
|
Upper
|
Lower
|
||||||||
|
alpha
|
A
|
a
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
beta
|
A
|
a
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
|
alpha
|
A
|
b
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
beta
|
A
|
b
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
|
alpha
|
B
|
a
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
【NOTE】軸之間的索引交換
這裏只涉及行或列索引內部的交換,不同方向索引之間的交換將在第五章中被討論。
【END】
若想要刪除某一層的索引,可以使用droplevel方法:
df_ex.droplevel(1,axis=1)
|
Big
|
C
|
D
|
||||||||
|
Other
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
||
|
Upper
|
Lower
|
Extra
|
||||||||
|
A
|
a
|
alpha
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
beta
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
||
|
b
|
alpha
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
|
beta
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
||
|
B
|
a
|
alpha
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
|
beta
|
-9
|
-5
|
-4
|
-3
|
-1
|
8
|
6
|
-5
|
||
|
b
|
alpha
|
0
|
1
|
-8
|
-8
|
-2
|
0
|
-6
|
-3
|
|
|
beta
|
2
|
5
|
9
|
-9
|
5
|
-6
|
3
|
1
|
df_ex.droplevel([0,1],axis=0)
|
Big
|
C
|
D
|
||||||
|
Small
|
c
|
d
|
c
|
d
|
||||
|
Other
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
|
Extra
|
||||||||
|
alpha
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
beta
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
|
alpha
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
beta
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
|
alpha
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
|
beta
|
-9
|
-5
|
-4
|
-3
|
-1
|
8
|
6
|
-5
|
|
alpha
|
0
|
1
|
-8
|
-8
|
-2
|
0
|
-6
|
-3
|
|
beta
|
2
|
5
|
9
|
-9
|
5
|
-6
|
3
|
1
|
2. 索引屬性的修改
通過rename_axis可以對索引層的名字進行修改,常用的修改方式是傳入字典的映射:
df_ex.rename_axis(index={'Upper':'Changed_row'}, columns={'Other':'Changed_Col'}).head()
|
Big
|
C
|
D
|
||||||||
|
Small
|
c
|
d
|
c
|
d
|
||||||
|
Changed_Col
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
||
|
Changed_row
|
Lower
|
Extra
|
||||||||
|
A
|
a
|
alpha
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
beta
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
||
|
b
|
alpha
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
|
beta
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
||
|
B
|
a
|
alpha
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
通過rename可以對索引的值進行修改,如果是多級索引需要指定修改的層號level:
df_ex.rename(columns={'cat':'not_cat'}, level=2).head()
|
Big
|
C
|
D
|
||||||||
|
Small
|
c
|
d
|
c
|
d
|
||||||
|
Other
|
not_cat
|
dog
|
not_cat
|
dog
|
not_cat
|
dog
|
not_cat
|
dog
|
||
|
Upper
|
Lower
|
Extra
|
||||||||
|
A
|
a
|
alpha
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
beta
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
||
|
b
|
alpha
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
|
beta
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
||
|
B
|
a
|
alpha
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
傳入參數也可以是函數,其輸入值就是索引元素:
df_ex.rename(index=lambda x:str.upper(x), level=2).head()
|
Big
|
C
|
D
|
||||||||
|
Small
|
c
|
d
|
c
|
d
|
||||||
|
Other
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
||
|
Upper
|
Lower
|
Extra
|
||||||||
|
A
|
a
|
ALPHA
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
BETA
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
||
|
b
|
ALPHA
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
|
BETA
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
||
|
B
|
a
|
ALPHA
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
對於整個索引的元素替換,可以利用迭代器實現:
new_values = iter(list('abcdefgh'))
df_ex.rename(index=lambda x:next(new_values), level=2)
|
Big
|
C
|
D
|
||||||||
|
Small
|
c
|
d
|
c
|
d
|
||||||
|
Other
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
||
|
Upper
|
Lower
|
Extra
|
||||||||
|
A
|
a
|
a
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
b
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
||
|
b
|
c
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
|
d
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
||
|
B
|
a
|
e
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
|
f
|
-9
|
-5
|
-4
|
-3
|
-1
|
8
|
6
|
-5
|
||
|
b
|
g
|
0
|
1
|
-8
|
-8
|
-2
|
0
|
-6
|
-3
|
|
|
h
|
2
|
5
|
9
|
-9
|
5
|
-6
|
3
|
1
|
若想要對某個位置的元素進行修改,在單層索引時容易實現,即先取出索引的values屬性,再給對得到的列表進行修改,最後再對index對象重新賦值。但是如果是多級索引的話就有些麻煩,一個解決的方案是先把某一層索引臨時轉為表的元素,然後再進行修改,最後重新設定為索引,下面一節將介紹這些操作。
另外一個需要介紹的函數是map,它是定義在Index上的方法,與前面rename方法中層的函數式用法是類似的,只不過它傳入的不是層的標量值,而是直接傳入索引的元組,這為用户進行跨層的修改提供了遍歷。例如,可以等價地寫出上面的字符串轉大寫的操作:
df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x: (x[0], x[1], str.upper(x[2])))
df_temp.index= new_idx
df_temp.head()
|
Big
|
C
|
D
|
||||||||
|
Small
|
c
|
d
|
c
|
d
|
||||||
|
Other
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
||
|
Upper
|
Lower
|
Extra
|
||||||||
|
A
|
a
|
ALPHA
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
BETA
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
||
|
b
|
ALPHA
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
|
BETA
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
||
|
B
|
a
|
ALPHA
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
關於map的另一個使用方法是對多級索引的壓縮,這在第四章和第五章的一些操作中是有用的:
df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x: (x[0]+'-'+x[1]+'-'+x[2]))
df_temp.index = new_idx
df_temp.head() # 單層索引
|
Big
|
C
|
D
|
||||||
|
Small
|
c
|
d
|
c
|
d
|
||||
|
Other
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
|
A-a-alpha
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
A-a-beta
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
|
A-b-alpha
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
A-b-beta
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
|
B-a-alpha
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
同時,也可以反向地展開:
new_idx = df_temp.index.map(lambda x:tuple(x.split('-')))
df_temp.index = new_idx
df_temp.head() # 三層索引
|
Big
|
C
|
D
|
||||||||
|
Small
|
c
|
d
|
c
|
d
|
||||||
|
Other
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
cat
|
dog
|
||
|
A
|
a
|
alpha
|
3
|
6
|
-9
|
-6
|
-6
|
-2
|
0
|
9
|
|
beta
|
-5
|
-3
|
3
|
-8
|
-3
|
-2
|
5
|
8
|
||
|
b
|
alpha
|
-4
|
4
|
-1
|
0
|
7
|
-4
|
6
|
6
|
|
|
beta
|
-9
|
9
|
-6
|
8
|
5
|
-2
|
-9
|
-8
|
||
|
B
|
a
|
alpha
|
0
|
-9
|
1
|
-6
|
2
|
9
|
-7
|
-9
|
3. 索引的設置與重置
為了説明本節的函數,下面構造一個新表:
df_new = pd.DataFrame({'A':list('aacd'), 'B':list('PQRT'), 'C':[1,2,3,4]})
df_new
|
A
|
B
|
C
|
|
|
0
|
a
|
P
|
1
|
|
1
|
a
|
Q
|
2
|
|
2
|
c
|
R
|
3
|
|
3
|
d
|
T
|
4
|
索引的設置可以使用set_index完成,這裏的主要參數是append,表示是否來保留原來的索引,直接把新設定的添加到原索引的內層:
df_new.set_index('A')
|
B
|
C
|
|
|
A
|
||
|
a
|
P
|
1
|
|
a
|
Q
|
2
|
|
c
|
R
|
3
|
|
d
|
T
|
4
|
df_new.set_index('A', append=True)
|
B
|
C
|
||
|
A
|
|||
|
0
|
a
|
P
|
1
|
|
1
|
a
|
Q
|
2
|
|
2
|
c
|
R
|
3
|
|
3
|
d
|
T
|
4
|
可以同時指定多個列作為索引:
df_new.set_index(['A', 'B'])
|
C
|
||
|
A
|
B
|
|
|
a
|
P
|
1
|
|
Q
|
2
|
|
|
c
|
R
|
3
|
|
d
|
T
|
4
|
如果想要添加索引的列沒有出現在其中,那麼可以直接在參數中傳入相應的Series:
my_index = pd.Series(list('WXYZ'), name='D')
df_new = df_new.set_index(['A', my_index])
df_new
|
B
|
C
|
||
|
A
|
D
|
||
|
a
|
W
|
P
|
1
|
|
X
|
Q
|
2
|
|
|
c
|
Y
|
R
|
3
|
|
d
|
Z
|
T
|
4
|
reset_index是set_index的逆函數,其主要參數是drop,表示是否要把去掉的索引層丟棄,而不是添加到列中:
df_new.reset_index(['D'])
|
D
|
B
|
C
|
|
|
A
|
|||
|
a
|
W
|
P
|
1
|
|
a
|
X
|
Q
|
2
|
|
c
|
Y
|
R
|
3
|
|
d
|
Z
|
T
|
4
|
df_new.reset_index(['D'], drop=True)
|
B
|
C
|
|
|
A
|
||
|
a
|
P
|
1
|
|
a
|
Q
|
2
|
|
c
|
R
|
3
|
|
d
|
T
|
4
|
如果重置了所有的索引,那麼pandas會直接重新生成一個默認索引:
df_new.reset_index()
|
A
|
D
|
B
|
C
|
|
|
0
|
a
|
W
|
P
|
1
|
|
1
|
a
|
X
|
Q
|
2
|
|
2
|
c
|
Y
|
R
|
3
|
|
3
|
d
|
Z
|
T
|
4
|
4. 索引的變形
在某些場合下,需要對索引做一些擴充或者剔除,更具體地要求是給定一個新的索引,把原表中相應的索引對應元素填充到新索引構成的表中。例如,下面的表中給出了員工信息,需要重新制作一張新的表,要求增加一名員工的同時去掉身高列並增加性別列:
df_reindex = pd.DataFrame({"Weight":[60,70,80], "Height":[176,180,179]}, index=['1001','1003','1002'])
df_reindex
|
Weight
|
Height
|
|
|
1001
|
60
|
176
|
|
1003
|
70
|
180
|
|
1002
|
80
|
179
|
df_reindex.reindex(index=['1001','1002','1003','1004'], columns=['Weight','Gender'])
|
Weight
|
Gender
|
|
|
1001
|
60.0
|
NaN
|
|
1002
|
80.0
|
NaN
|
|
1003
|
70.0
|
NaN
|
|
1004
|
NaN
|
NaN
|
這種需求常出現在時間序列索引的時間點填充以及ID編號的擴充。另外,需要注意的是原來表中的數據和新表中會根據索引自動對齊,例如原先的1002號位置在1003號之後,而新表中相反,那麼reindex中會根據元素對齊,與位置無關。
還有一個與reindex功能類似的函數是reindex_like,其功能是仿照傳入的表索引來進行被調用表索引的變形。例如,現在已經存在一張表具備了目標索引的條件,那麼上述功能可採用下述代碼得到:
df_existed = pd.DataFrame(index=['1001','1002','1003','1004'], columns=['Weight','Gender'])
df_reindex.reindex_like(df_existed)
|
Weight
|
Gender
|
|
|
1001
|
60.0
|
NaN
|
|
1002
|
80.0
|
NaN
|
|
1003
|
70.0
|
NaN
|
|
1004
|
NaN
|
NaN
|
四、索引運算
1. 集合的運算法則
經常會有一種利用集合運算來取出符合條件行的需求,例如有兩張表A和B,它們的索引都是員工編號,現在需要篩選出兩表索引交集的所有員工信息,此時通過Index上的運算操作就很容易實現。
不過在此之前,不妨先複習一下常見的四種集合運算:
2. 一般的索引運算
由於集合的元素是互異的,但是索引中可能有相同的元素,先用unique去重後再進行運算。下面構造兩張最為簡單的示例表進行演示:
df_set_1 = pd.DataFrame([[0,1],[1,2],[3,4]], index = pd.Index(['a','b','a'],name='id1'))
df_set_2 = pd.DataFrame([[4,5],[2,6],[7,1]], index = pd.Index(['b','b','c'],name='id2'))
id1, id2 = df_set_1.index.unique(), df_set_2.index.unique()
id1.intersection(id2)
Index(['b'], dtype='object')
id1.union(id2)
Index(['a', 'b', 'c'], dtype='object')
id1.difference(id2)
Index(['a'], dtype='object')
id1.symmetric_difference(id2)
Index(['a', 'c'], dtype='object')
上述的四類運算可以用等價的符號表示代替:
id1 & id2 , id1 | id2 , (id1 ^ id2) & id1 , id1 ^ id2
(Index(['b'], dtype='object'),
Index(['a', 'b', 'c'], dtype='object'),
Index(['a'], dtype='object'),
Index(['a', 'c'], dtype='object'))
若兩張表需要做集合運算的列並沒有被設置索引,一種辦法是先轉成索引,運算後再恢復,另一種方法是利用isin函數,例如在重置索引的第一張表中選出id列交集的所在行:
df_set_in_col_1 = df_set_1.reset_index()
df_set_in_col_2 = df_set_2.reset_index()
df_set_in_col_1
|
id1
|
0
|
1
|
|
|
0
|
a
|
0
|
1
|
|
1
|
b
|
1
|
2
|
|
2
|
a
|
3
|
4
|
df_set_in_col_2
|
id2
|
0
|
1
|
|
|
0
|
b
|
4
|
5
|
|
1
|
b
|
2
|
6
|
|
2
|
c
|
7
|
1
|
df_set_in_col_2[df_set_in_col_2.id2.isin(df_set_in_col_1.id1)]
|
id2
|
0
|
1
|
|
|
0
|
b
|
4
|
5
|
|
1
|
b
|
2
|
6
|
五、練習
Ex1:公司員工數據集
現有一份公司員工數據集:
df = pd.read_csv('../data/company.csv')
df.head(3)
|
EmployeeID
|
birthdate_key
|
age
|
city_name
|
department
|
job_title
|
gender
|
|
|
0
|
1318
|
1/3/1954
|
61
|
Vancouver
|
Executive
|
CEO
|
M
|
|
1
|
1319
|
1/3/1957
|
58
|
Vancouver
|
Executive
|
VP Stores
|
F
|
|
2
|
1320
|
1/2/1955
|
60
|
Vancouver
|
Executive
|
Legal Counsel
|
F
|
- 分別只使用
query和loc選出年齡不超過四十歲且工作部門為Dairy或Bakery的男性。
#使用query選出年齡不超過四十歲且工作部門為Dairy或Bakery的男性。
dpt = ['Dairy', 'Bakery']
df.query("(age<=40) & (department==@dpt) & (gender=='M')").head()
|
EmployeeID
|
birthdate_key
|
age
|
city_name
|
department
|
job_title
|
gender
|
|
|
3611
|
5791
|
1/14/1975
|
40
|
Kelowna
|
Dairy
|
Dairy Person
|
M
|
|
3613
|
5793
|
1/22/1975
|
40
|
Richmond
|
Bakery
|
Baker
|
M
|
|
3615
|
5795
|
1/30/1975
|
40
|
Nanaimo
|
Dairy
|
Dairy Person
|
M
|
|
3617
|
5797
|
2/3/1975
|
40
|
Nanaimo
|
Dairy
|
Dairy Person
|
M
|
|
3618
|
5798
|
2/4/1975
|
40
|
Surrey
|
Dairy
|
Dairy Person
|
M
|
#使用loc選出年齡不超過四十歲且工作部門為Dairy或Bakery的男性。
df.loc[(df.age<=40)&df.department.isin(dpt)&(df.gender=='M')].head()
|
EmployeeID
|
birthdate_key
|
age
|
city_name
|
department
|
job_title
|
gender
|
|
|
3611
|
5791
|
1/14/1975
|
40
|
Kelowna
|
Dairy
|
Dairy Person
|
M
|
|
3613
|
5793
|
1/22/1975
|
40
|
Richmond
|
Bakery
|
Baker
|
M
|
|
3615
|
5795
|
1/30/1975
|
40
|
Nanaimo
|
Dairy
|
Dairy Person
|
M
|
|
3617
|
5797
|
2/3/1975
|
40
|
Nanaimo
|
Dairy
|
Dairy Person
|
M
|
|
3618
|
5798
|
2/4/1975
|
40
|
Surrey
|
Dairy
|
Dairy Person
|
M
|
- 選出員工
ID號 為奇數所在行的第1、第3和倒數第2列。
df.iloc[(df.EmployeeID%2==1).values,[0,2,-2]].head()
|
EmployeeID
|
age
|
job_title
|
|
|
1
|
1319
|
58
|
VP Stores
|
|
3
|
1321
|
56
|
VP Human Resources
|
|
5
|
1323
|
53
|
Exec Assistant, VP Stores
|
|
6
|
1325
|
51
|
Exec Assistant, Legal Counsel
|
|
8
|
1329
|
48
|
Store Manager
|
- 按照以下步驟進行索引操作:
- 把後三列設為索引後交換內外兩層
- 恢復中間層索引
- 修改外層索引名為
Gender - 用下劃線合併兩層行索引
- 把行索引拆分為原狀態
- 修改索引名為原表名稱
- 恢復默認索引並將列保持為原表的相對位置
df_op = df.copy()
#把後三列設為索引後交換內外兩層
df_op = df_op.set_index(df_op.columns[-3:].tolist()).swaplevel(0,2,axis=0)
df_op
|
EmployeeID
|
birthdate_key
|
age
|
city_name
|
|||
|
gender
|
job_title
|
department
|
||||
|
M
|
CEO
|
Executive
|
1318
|
1/3/1954
|
61
|
Vancouver
|
|
F
|
VP Stores
|
Executive
|
1319
|
1/3/1957
|
58
|
Vancouver
|
|
Legal Counsel
|
Executive
|
1320
|
1/2/1955
|
60
|
Vancouver
|
|
|
M
|
VP Human Resources
|
Executive
|
1321
|
1/2/1959
|
56
|
Vancouver
|
|
VP Finance
|
Executive
|
1322
|
1/9/1958
|
57
|
Vancouver
|
|
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
|
F
|
Cashier
|
Customer Service
|
8036
|
8/9/1992
|
23
|
New Westminister
|
|
M
|
Cashier
|
Customer Service
|
8181
|
9/26/1993
|
22
|
Prince George
|
|
Customer Service
|
8223
|
2/11/1994
|
21
|
Trail
|
||
|
F
|
Cashier
|
Customer Service
|
8226
|
2/16/1994
|
21
|
Victoria
|
|
Customer Service
|
8264
|
6/13/1994
|
21
|
Vancouver
|
6284 rows × 4 columns
#恢復中間層索引
df_op = df_op.reset_index(level=1)
df_op.head()
|
job_title
|
EmployeeID
|
birthdate_key
|
age
|
city_name
|
||
|
gender
|
department
|
|||||
|
M
|
Executive
|
CEO
|
1318
|
1/3/1954
|
61
|
Vancouver
|
|
F
|
Executive
|
VP Stores
|
1319
|
1/3/1957
|
58
|
Vancouver
|
|
Executive
|
Legal Counsel
|
1320
|
1/2/1955
|
60
|
Vancouver
|
|
|
M
|
Executive
|
VP Human Resources
|
1321
|
1/2/1959
|
56
|
Vancouver
|
|
Executive
|
VP Finance
|
1322
|
1/9/1958
|
57
|
Vancouver
|
#修改外層索引名為Gender
df_op = df_op.rename_axis(index={'gender':'Gender'})
df_op.head()
|
job_title
|
EmployeeID
|
birthdate_key
|
age
|
city_name
|
||
|
Gender
|
department
|
|||||
|
M
|
Executive
|
CEO
|
1318
|
1/3/1954
|
61
|
Vancouver
|
|
F
|
Executive
|
VP Stores
|
1319
|
1/3/1957
|
58
|
Vancouver
|
|
Executive
|
Legal Counsel
|
1320
|
1/2/1955
|
60
|
Vancouver
|
|
|
M
|
Executive
|
VP Human Resources
|
1321
|
1/2/1959
|
56
|
Vancouver
|
|
Executive
|
VP Finance
|
1322
|
1/9/1958
|
57
|
Vancouver
|
#用下劃線合併兩層行索引
df_op.index = df_op.index.map(lambda x: '_'.join(x))
df_op.head()
|
job_title
|
EmployeeID
|
birthdate_key
|
age
|
city_name
|
|
|
M_Executive
|
CEO
|
1318
|
1/3/1954
|
61
|
Vancouver
|
|
F_Executive
|
VP Stores
|
1319
|
1/3/1957
|
58
|
Vancouver
|
|
F_Executive
|
Legal Counsel
|
1320
|
1/2/1955
|
60
|
Vancouver
|
|
M_Executive
|
VP Human Resources
|
1321
|
1/2/1959
|
56
|
Vancouver
|
|
M_Executive
|
VP Finance
|
1322
|
1/9/1958
|
57
|
Vancouver
|
#把行索引拆分為原狀態
df_op.index = df_op.index.map(lambda x: tuple(x.split('_')))
df_op.head()
|
job_title
|
EmployeeID
|
birthdate_key
|
age
|
city_name
|
||
|
M
|
Executive
|
CEO
|
1318
|
1/3/1954
|
61
|
Vancouver
|
|
F
|
Executive
|
VP Stores
|
1319
|
1/3/1957
|
58
|
Vancouver
|
|
Executive
|
Legal Counsel
|
1320
|
1/2/1955
|
60
|
Vancouver
|
|
|
M
|
Executive
|
VP Human Resources
|
1321
|
1/2/1959
|
56
|
Vancouver
|
|
Executive
|
VP Finance
|
1322
|
1/9/1958
|
57
|
Vancouver
|
#修改索引名為原表名稱
df_op = df_op.rename_axis(index=['gender', 'department'])
df_op
|
job_title
|
EmployeeID
|
birthdate_key
|
age
|
city_name
|
||
|
gender
|
department
|
|||||
|
M
|
Executive
|
CEO
|
1318
|
1/3/1954
|
61
|
Vancouver
|
|
F
|
Executive
|
VP Stores
|
1319
|
1/3/1957
|
58
|
Vancouver
|
|
Executive
|
Legal Counsel
|
1320
|
1/2/1955
|
60
|
Vancouver
|
|
|
M
|
Executive
|
VP Human Resources
|
1321
|
1/2/1959
|
56
|
Vancouver
|
|
Executive
|
VP Finance
|
1322
|
1/9/1958
|
57
|
Vancouver
|
|
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
|
F
|
Customer Service
|
Cashier
|
8036
|
8/9/1992
|
23
|
New Westminister
|
|
M
|
Customer Service
|
Cashier
|
8181
|
9/26/1993
|
22
|
Prince George
|
|
Customer Service
|
Cashier
|
8223
|
2/11/1994
|
21
|
Trail
|
|
|
F
|
Customer Service
|
Cashier
|
8226
|
2/16/1994
|
21
|
Victoria
|
|
Customer Service
|
Cashier
|
8264
|
6/13/1994
|
21
|
Vancouver
|
6284 rows × 5 columns
#恢復默認索引並將列保持為原表的相對位置
df_op = df_op.reset_index().reindex(df.columns, axis=1)
df_op.equals(df)
True
Ex2:巧克力數據集
現有一份關於巧克力評價的數據集:
df = pd.read_csv('../data/chocolate.csv')
df.head(3)
|
Company
|
Review\nDate
|
Cocoa\nPercent
|
Company\nLocation
|
Rating
|
|
|
0
|
A. Morin
|
2016
|
63%
|
France
|
3.75
|
|
1
|
A. Morin
|
2015
|
70%
|
France
|
2.75
|
|
2
|
A. Morin
|
2015
|
70%
|
France
|
3.00
|
- 把列索引名中的
\n替換為空格。
df.columns = [' '.join(i.split('\n')) for i in df.columns]
df.head()
|
Company
|
Review Date
|
Cocoa Percent
|
Company Location
|
Rating
|
|
|
0
|
A. Morin
|
2016
|
63%
|
France
|
3.75
|
|
1
|
A. Morin
|
2015
|
70%
|
France
|
2.75
|
|
2
|
A. Morin
|
2015
|
70%
|
France
|
3.00
|
|
3
|
A. Morin
|
2015
|
70%
|
France
|
3.50
|
|
4
|
A. Morin
|
2015
|
70%
|
France
|
3.50
|
- 巧克力
Rating評分為1至5,每0.25分一檔,請選出2.75分及以下且可可含量Cocoa Percent高於中位數的樣本。
df['Cocoa Percent'] = df['Cocoa Percent'].apply(lambda x:float(x[:-1])/100)
df.query('(Rating<=2.75)&(`Cocoa Percent`>`Cocoa Percent`.median())').head()
|
Company
|
Review Date
|
Cocoa Percent
|
Company Location
|
Rating
|
|
|
33
|
Akesson's (Pralus)
|
2010
|
0.75
|
Switzerland
|
2.75
|
|
34
|
Akesson's (Pralus)
|
2010
|
0.75
|
Switzerland
|
2.75
|
|
36
|
Alain Ducasse
|
2014
|
0.75
|
France
|
2.75
|
|
38
|
Alain Ducasse
|
2013
|
0.75
|
France
|
2.50
|
|
39
|
Alain Ducasse
|
2013
|
0.75
|
France
|
2.50
|
- 將
Review Date和Company Location設為索引後,選出Review Date在2012年之後且Company Location不屬於France, Canada, Amsterdam, Belgium的樣本。
idx = pd.IndexSlice
exclude = ['France', 'Canada', 'Amsterdam', 'Belgium']
res = df.set_index(['Review Date', 'Company Location']).sort_index(level=0)
res.loc[idx[2012:,~res.index.get_level_values(1).isin(exclude)],:].head()
|
Company
|
Cocoa Percent
|
Rating
|
||
|
Review Date
|
Company Location
|
|||
|
2012
|
Australia
|
Bahen & Co.
|
0.70
|
3.00
|
|
Australia
|
Bahen & Co.
|
0.70
|
2.50
|
|
|
Australia
|
Bahen & Co.
|
0.70
|
2.50
|
|
|
Australia
|
Cravve
|
0.75
|
3.25
|
|
|
Australia
|
Cravve
|
0.65
|
3.25
|