Pandas 学习笔记之索引篇#
版本号: 0.11
创建时间: 2024年04月15日
修改时间: 2024年05月21日
数据来源:
movies.csv http://boxofficemojo.com/daily/
titanic.csv https://github.com/dsaber/py-viz-blog
tips.csv https://github.com/pandas-dev/pandas/blob/master/doc/data/tips.csv
一些准备工作#
[1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
# 辅助函数
def get_random_df():
return pd.DataFrame(
np.random.randn(6, 4),
index=pd.date_range('20200101', periods=6),
columns=list('ABCD'))
把行 Index 改成由1开始#
[2]:
df = get_random_df()
df.index = range(1,len(df) + 1)
df.head()
[2]:
| A | B | C | D | |
|---|---|---|---|---|
| 1 | 0.118502 | -1.808740 | 0.997507 | 0.392562 |
| 2 | -0.457779 | 0.160701 | -0.454315 | -0.794106 |
| 3 | -0.064728 | 0.107220 | -0.173583 | -0.855021 |
| 4 | -0.114218 | 0.033859 | -1.801286 | -0.278239 |
| 5 | 0.898737 | -0.317048 | 0.381936 | -1.160173 |
把某列设置为索引#
使用列名称#
[3]:
# 构建 DataFrame
df = get_random_df();df
[3]:
| A | B | C | D | |
|---|---|---|---|---|
| 2020-01-01 | 1.111450 | 0.182791 | -0.937013 | 0.956929 |
| 2020-01-02 | -0.304885 | 0.697800 | -0.900396 | -1.753596 |
| 2020-01-03 | -0.848140 | 0.318127 | -2.012993 | -1.005838 |
| 2020-01-04 | 0.597240 | 1.817809 | 0.786580 | 0.232008 |
| 2020-01-05 | 0.655043 | 0.762338 | -0.698655 | 1.151653 |
| 2020-01-06 | 0.184392 | -1.333492 | 0.349873 | 0.151685 |
[4]:
df.set_index(['A'])
[4]:
| B | C | D | |
|---|---|---|---|
| A | |||
| 1.111450 | 0.182791 | -0.937013 | 0.956929 |
| -0.304885 | 0.697800 | -0.900396 | -1.753596 |
| -0.848140 | 0.318127 | -2.012993 | -1.005838 |
| 0.597240 | 1.817809 | 0.786580 | 0.232008 |
| 0.655043 | 0.762338 | -0.698655 | 1.151653 |
| 0.184392 | -1.333492 | 0.349873 | 0.151685 |
使用列编号#
[5]:
# 把第三列作为索引
df = get_random_df()
df.set_index(df.columns[2])
[5]:
| A | B | D | |
|---|---|---|---|
| C | |||
| -1.351150 | 1.378955 | 0.152973 | 0.919439 |
| 0.027135 | -0.470867 | 0.009272 | 1.162112 |
| 1.812734 | -0.528501 | 0.248176 | -1.263733 |
| 0.891118 | 0.873387 | -1.498628 | -1.941507 |
| 0.531397 | 1.573426 | -1.207346 | 0.016876 |
| 1.327886 | -0.035997 | -1.003411 | 1.246743 |
手动生成多重索引#
[6]:
m_index = pd.MultiIndex.from_arrays(
[['level-one']*2, ['level-two-one', 'level-two-tow']])
m_index
[6]:
MultiIndex([('level-one', 'level-two-one'),
('level-one', 'level-two-tow')],
)
[7]:
m_index = pd.MultiIndex.from_product(
[['level-one'], ['level-two-one', 'level-two-tow']])
m_index
[7]:
MultiIndex([('level-one', 'level-two-one'),
('level-one', 'level-two-tow')],
)
修改列名称#
修改一个列名称#
[8]:
# 构建 DataFrame
df = get_random_df();df
[8]:
| A | B | C | D | |
|---|---|---|---|---|
| 2020-01-01 | 0.994847 | -0.628494 | 0.199592 | -0.102407 |
| 2020-01-02 | 0.290140 | 0.351781 | -0.300229 | -0.962547 |
| 2020-01-03 | 0.291478 | 0.539801 | 1.129685 | -0.380217 |
| 2020-01-04 | -1.898179 | 0.699758 | 1.908884 | -0.194987 |
| 2020-01-05 | -0.234319 | -1.815653 | 0.055340 | -1.425894 |
| 2020-01-06 | -0.726428 | -1.344369 | 0.799225 | 0.853736 |
[9]:
df.rename(columns={'A':'AA'}, inplace=True);df
[9]:
| AA | B | C | D | |
|---|---|---|---|---|
| 2020-01-01 | 0.994847 | -0.628494 | 0.199592 | -0.102407 |
| 2020-01-02 | 0.290140 | 0.351781 | -0.300229 | -0.962547 |
| 2020-01-03 | 0.291478 | 0.539801 | 1.129685 | -0.380217 |
| 2020-01-04 | -1.898179 | 0.699758 | 1.908884 | -0.194987 |
| 2020-01-05 | -0.234319 | -1.815653 | 0.055340 | -1.425894 |
| 2020-01-06 | -0.726428 | -1.344369 | 0.799225 | 0.853736 |
修改全部列名称#
转自:https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas
Pandas 0.21+ Answer
There have been some significant updates to column renaming in version 0.21.
The rename method has added the axis parameter which may be set to columns or 1. This update makes this method match the rest of the pandas API. It still has the index and columns parameters but you are no longer forced to use them.
The set_axis method with the inplace set to False enables you to rename all the index or column labels with a list.
Examples for Pandas 0.21+
[10]:
# 构建 DataFrame
df = get_random_df();df
[10]:
| A | B | C | D | |
|---|---|---|---|---|
| 2020-01-01 | -1.367012 | 2.318288 | -0.325283 | 0.055034 |
| 2020-01-02 | -0.049454 | 0.008833 | -2.938136 | -0.400900 |
| 2020-01-03 | 1.299508 | -0.088676 | 0.829149 | 0.212462 |
| 2020-01-04 | -1.018557 | -0.186619 | 0.493114 | -0.328296 |
| 2020-01-05 | -0.067945 | 0.546336 | -0.799995 | -0.722019 |
| 2020-01-06 | -0.958219 | 1.424339 | 1.399772 | -0.041341 |
方法一:使用 rename ,并且设置 axis=‘columns’ 或者 axis=1#
[11]:
df.rename({'A':'a', 'B':'b', 'C':'c', 'D':'d'}, axis='columns')
[11]:
| a | b | c | d | |
|---|---|---|---|---|
| 2020-01-01 | -1.367012 | 2.318288 | -0.325283 | 0.055034 |
| 2020-01-02 | -0.049454 | 0.008833 | -2.938136 | -0.400900 |
| 2020-01-03 | 1.299508 | -0.088676 | 0.829149 | 0.212462 |
| 2020-01-04 | -1.018557 | -0.186619 | 0.493114 | -0.328296 |
| 2020-01-05 | -0.067945 | 0.546336 | -0.799995 | -0.722019 |
| 2020-01-06 | -0.958219 | 1.424339 | 1.399772 | -0.041341 |
[12]:
# 下句与上句结果相同
df.rename({'A':'a', 'B':'b', 'C':'c', 'D':'d'}, axis=1)
[12]:
| a | b | c | d | |
|---|---|---|---|---|
| 2020-01-01 | -1.367012 | 2.318288 | -0.325283 | 0.055034 |
| 2020-01-02 | -0.049454 | 0.008833 | -2.938136 | -0.400900 |
| 2020-01-03 | 1.299508 | -0.088676 | 0.829149 | 0.212462 |
| 2020-01-04 | -1.018557 | -0.186619 | 0.493114 | -0.328296 |
| 2020-01-05 | -0.067945 | 0.546336 | -0.799995 | -0.722019 |
| 2020-01-06 | -0.958219 | 1.424339 | 1.399772 | -0.041341 |
[13]:
# 老的方法,结果相同
df.rename(columns={'A':'a', 'B':'b', 'C':'c', 'D':'d'})
[13]:
| a | b | c | d | |
|---|---|---|---|---|
| 2020-01-01 | -1.367012 | 2.318288 | -0.325283 | 0.055034 |
| 2020-01-02 | -0.049454 | 0.008833 | -2.938136 | -0.400900 |
| 2020-01-03 | 1.299508 | -0.088676 | 0.829149 | 0.212462 |
| 2020-01-04 | -1.018557 | -0.186619 | 0.493114 | -0.328296 |
| 2020-01-05 | -0.067945 | 0.546336 | -0.799995 | -0.722019 |
| 2020-01-06 | -0.958219 | 1.424339 | 1.399772 | -0.041341 |
[14]:
#rename 函数接受一个函数作为参数,作为参数的函数作用于每一个列名称。
df = get_random_df()
df.rename(lambda x: x.lower(), axis='columns')
[14]:
| a | b | c | d | |
|---|---|---|---|---|
| 2020-01-01 | -0.389596 | -0.643607 | -1.245727 | -0.079882 |
| 2020-01-02 | 0.389280 | -0.453595 | -0.290197 | -0.413748 |
| 2020-01-03 | -3.399940 | 1.016301 | -0.574126 | 1.070502 |
| 2020-01-04 | -0.069537 | -0.543435 | 0.125908 | 0.344096 |
| 2020-01-05 | 0.022832 | -0.032724 | 0.792888 | 1.206140 |
| 2020-01-06 | 0.599666 | 1.700370 | -1.169337 | -0.899192 |
[15]:
df = get_random_df()
df.rename(lambda x: x.lower(), axis=1)
[15]:
| a | b | c | d | |
|---|---|---|---|---|
| 2020-01-01 | -0.548590 | 0.479504 | -0.638681 | -0.729953 |
| 2020-01-02 | -0.453389 | -0.091043 | -1.277966 | 1.226509 |
| 2020-01-03 | -1.192221 | 0.756808 | 1.386562 | -2.212451 |
| 2020-01-04 | 2.477589 | -0.605837 | 1.069088 | 0.002167 |
| 2020-01-05 | 0.446367 | -0.644324 | -0.905393 | 0.130968 |
| 2020-01-06 | -0.313847 | 1.163701 | -0.240772 | 1.162398 |
方法二:使用 set_axis ,把一个 list 作为列名称,并且设置 inplace=False#
list 的长度必须与列(或者索引)的数量一致。当前版本(0.24.2, inplace 参数的默认值为 True ,以后可能改为 False 。
[16]:
df.set_axis(['a', 'b', 'c', 'd'], axis='columns', copy=False)
[16]:
| a | b | c | d | |
|---|---|---|---|---|
| 2020-01-01 | -0.548590 | 0.479504 | -0.638681 | -0.729953 |
| 2020-01-02 | -0.453389 | -0.091043 | -1.277966 | 1.226509 |
| 2020-01-03 | -1.192221 | 0.756808 | 1.386562 | -2.212451 |
| 2020-01-04 | 2.477589 | -0.605837 | 1.069088 | 0.002167 |
| 2020-01-05 | 0.446367 | -0.644324 | -0.905393 | 0.130968 |
| 2020-01-06 | -0.313847 | 1.163701 | -0.240772 | 1.162398 |
[17]:
df.set_axis(['a', 'b', 'c', 'd'], axis=1, copy=False)
[17]:
| a | b | c | d | |
|---|---|---|---|---|
| 2020-01-01 | -0.548590 | 0.479504 | -0.638681 | -0.729953 |
| 2020-01-02 | -0.453389 | -0.091043 | -1.277966 | 1.226509 |
| 2020-01-03 | -1.192221 | 0.756808 | 1.386562 | -2.212451 |
| 2020-01-04 | 2.477589 | -0.605837 | 1.069088 | 0.002167 |
| 2020-01-05 | 0.446367 | -0.644324 | -0.905393 | 0.130968 |
| 2020-01-06 | -0.313847 | 1.163701 | -0.240772 | 1.162398 |
方法三:使用 columns 属性#
[18]:
df.columns = ['a', 'b', 'c', 'd']
df
[18]:
| a | b | c | d | |
|---|---|---|---|---|
| 2020-01-01 | -0.548590 | 0.479504 | -0.638681 | -0.729953 |
| 2020-01-02 | -0.453389 | -0.091043 | -1.277966 | 1.226509 |
| 2020-01-03 | -1.192221 | 0.756808 | 1.386562 | -2.212451 |
| 2020-01-04 | 2.477589 | -0.605837 | 1.069088 | 0.002167 |
| 2020-01-05 | 0.446367 | -0.644324 | -0.905393 | 0.130968 |
| 2020-01-06 | -0.313847 | 1.163701 | -0.240772 | 1.162398 |
Why not use df.columns = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]?
There is nothing wrong with assigning columns directly like this. It is a perfectly good solution.
The advantage of using set_axis is that it can be used as part of a method chain and that it returns a new copy of the DataFrame. Without it, you would have to store your intermediate steps of the chain to another variable before reassigning the columns.
# new for pandas 0.21+
df.some_method1()
.some_method2()
.set_axis()
.some_method3()
# old way
df1 = df.some_method1()
.some_method2()
df1.columns = columns
df1.some_method3()