Pandas 学习笔记之索引篇#

版本号： 0.11
创建时间： 2024年04月15日
修改时间： 2024年05月21日
数据来源：
movies.csv http://boxofficemojo.com/daily/
iris.csv https://github.com/dsaber/py-viz-blog
titanic.csv https://github.com/dsaber/py-viz-blog
ts.csv https://github.com/dsaber/py-viz-blog
tips.csv https://github.com/pandas-dev/pandas/blob/master/doc/data/tips.csv

一些准备工作#

[1]:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# 辅助函数
def get_random_df():
    return pd.DataFrame(
        np.random.randn(6, 4),
        index=pd.date_range('20200101', periods=6),
        columns=list('ABCD'))

把行 Index 改成由1开始#

[2]:

df = get_random_df()
df.index = range(1,len(df) + 1)
df.head()

[2]:

	A	B	C	D
1	0.118502	-1.808740	0.997507	0.392562
2	-0.457779	0.160701	-0.454315	-0.794106
3	-0.064728	0.107220	-0.173583	-0.855021
4	-0.114218	0.033859	-1.801286	-0.278239
5	0.898737	-0.317048	0.381936	-1.160173

把某列设置为索引#

使用列名称#

[3]:

# 构建 DataFrame
df = get_random_df();df

[3]:

	A	B	C	D
2020-01-01	1.111450	0.182791	-0.937013	0.956929
2020-01-02	-0.304885	0.697800	-0.900396	-1.753596
2020-01-03	-0.848140	0.318127	-2.012993	-1.005838
2020-01-04	0.597240	1.817809	0.786580	0.232008
2020-01-05	0.655043	0.762338	-0.698655	1.151653
2020-01-06	0.184392	-1.333492	0.349873	0.151685

[4]:

df.set_index(['A'])

[4]:

	B	C	D
A
1.111450	0.182791	-0.937013	0.956929
-0.304885	0.697800	-0.900396	-1.753596
-0.848140	0.318127	-2.012993	-1.005838
0.597240	1.817809	0.786580	0.232008
0.655043	0.762338	-0.698655	1.151653
0.184392	-1.333492	0.349873	0.151685

使用列编号#

[5]:

# 把第三列作为索引
df = get_random_df()
df.set_index(df.columns[2])

[5]:

	A	B	D
C
-1.351150	1.378955	0.152973	0.919439
0.027135	-0.470867	0.009272	1.162112
1.812734	-0.528501	0.248176	-1.263733
0.891118	0.873387	-1.498628	-1.941507
0.531397	1.573426	-1.207346	0.016876
1.327886	-0.035997	-1.003411	1.246743

手动生成多重索引#

[6]:

m_index = pd.MultiIndex.from_arrays(
    [['level-one']*2, ['level-two-one', 'level-two-tow']])
m_index

[6]:

MultiIndex([('level-one', 'level-two-one'),
            ('level-one', 'level-two-tow')],
           )

[7]:

m_index = pd.MultiIndex.from_product(
    [['level-one'], ['level-two-one', 'level-two-tow']])
m_index

[7]:

MultiIndex([('level-one', 'level-two-one'),
            ('level-one', 'level-two-tow')],
           )

修改列名称#

修改一个列名称#

[8]:

# 构建 DataFrame
df = get_random_df();df

[8]:

	A	B	C	D
2020-01-01	0.994847	-0.628494	0.199592	-0.102407
2020-01-02	0.290140	0.351781	-0.300229	-0.962547
2020-01-03	0.291478	0.539801	1.129685	-0.380217
2020-01-04	-1.898179	0.699758	1.908884	-0.194987
2020-01-05	-0.234319	-1.815653	0.055340	-1.425894
2020-01-06	-0.726428	-1.344369	0.799225	0.853736

[9]:

df.rename(columns={'A':'AA'}, inplace=True);df

[9]:

	AA	B	C	D
2020-01-01	0.994847	-0.628494	0.199592	-0.102407
2020-01-02	0.290140	0.351781	-0.300229	-0.962547
2020-01-03	0.291478	0.539801	1.129685	-0.380217
2020-01-04	-1.898179	0.699758	1.908884	-0.194987
2020-01-05	-0.234319	-1.815653	0.055340	-1.425894
2020-01-06	-0.726428	-1.344369	0.799225	0.853736

修改全部列名称#

转自：https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas

Pandas 0.21+ Answer

There have been some significant updates to column renaming in version 0.21.

The rename method has added the axis parameter which may be set to columns or 1. This update makes this method match the rest of the pandas API. It still has the index and columns parameters but you are no longer forced to use them.

The set_axis method with the inplace set to False enables you to rename all the index or column labels with a list.

Examples for Pandas 0.21+

[10]:

# 构建 DataFrame
df = get_random_df();df

[10]:

	A	B	C	D
2020-01-01	-1.367012	2.318288	-0.325283	0.055034
2020-01-02	-0.049454	0.008833	-2.938136	-0.400900
2020-01-03	1.299508	-0.088676	0.829149	0.212462
2020-01-04	-1.018557	-0.186619	0.493114	-0.328296
2020-01-05	-0.067945	0.546336	-0.799995	-0.722019
2020-01-06	-0.958219	1.424339	1.399772	-0.041341

方法一：使用 rename ，并且设置 axis=‘columns’ 或者 axis=1#

[11]:

df.rename({'A':'a', 'B':'b', 'C':'c', 'D':'d'}, axis='columns')

[11]:

	a	b	c	d
2020-01-01	-1.367012	2.318288	-0.325283	0.055034
2020-01-02	-0.049454	0.008833	-2.938136	-0.400900
2020-01-03	1.299508	-0.088676	0.829149	0.212462
2020-01-04	-1.018557	-0.186619	0.493114	-0.328296
2020-01-05	-0.067945	0.546336	-0.799995	-0.722019
2020-01-06	-0.958219	1.424339	1.399772	-0.041341

[12]:

# 下句与上句结果相同
df.rename({'A':'a', 'B':'b', 'C':'c', 'D':'d'}, axis=1)

[12]:

	a	b	c	d
2020-01-01	-1.367012	2.318288	-0.325283	0.055034
2020-01-02	-0.049454	0.008833	-2.938136	-0.400900
2020-01-03	1.299508	-0.088676	0.829149	0.212462
2020-01-04	-1.018557	-0.186619	0.493114	-0.328296
2020-01-05	-0.067945	0.546336	-0.799995	-0.722019
2020-01-06	-0.958219	1.424339	1.399772	-0.041341

[13]:

# 老的方法，结果相同
df.rename(columns={'A':'a', 'B':'b', 'C':'c', 'D':'d'})

[13]:

	a	b	c	d
2020-01-01	-1.367012	2.318288	-0.325283	0.055034
2020-01-02	-0.049454	0.008833	-2.938136	-0.400900
2020-01-03	1.299508	-0.088676	0.829149	0.212462
2020-01-04	-1.018557	-0.186619	0.493114	-0.328296
2020-01-05	-0.067945	0.546336	-0.799995	-0.722019
2020-01-06	-0.958219	1.424339	1.399772	-0.041341

[14]:

#rename 函数接受一个函数作为参数，作为参数的函数作用于每一个列名称。
df = get_random_df()
df.rename(lambda x: x.lower(), axis='columns')

[14]:

	a	b	c	d
2020-01-01	-0.389596	-0.643607	-1.245727	-0.079882
2020-01-02	0.389280	-0.453595	-0.290197	-0.413748
2020-01-03	-3.399940	1.016301	-0.574126	1.070502
2020-01-04	-0.069537	-0.543435	0.125908	0.344096
2020-01-05	0.022832	-0.032724	0.792888	1.206140
2020-01-06	0.599666	1.700370	-1.169337	-0.899192

[15]:

df = get_random_df()
df.rename(lambda x: x.lower(), axis=1)

[15]:

	a	b	c	d
2020-01-01	-0.548590	0.479504	-0.638681	-0.729953
2020-01-02	-0.453389	-0.091043	-1.277966	1.226509
2020-01-03	-1.192221	0.756808	1.386562	-2.212451
2020-01-04	2.477589	-0.605837	1.069088	0.002167
2020-01-05	0.446367	-0.644324	-0.905393	0.130968
2020-01-06	-0.313847	1.163701	-0.240772	1.162398

方法二：使用 set_axis ，把一个 list 作为列名称，并且设置 inplace=False#

list 的长度必须与列（或者索引）的数量一致。当前版本（0.24.2， inplace 参数的默认值为 True ，以后可能改为 False 。

[16]:

df.set_axis(['a', 'b', 'c', 'd'], axis='columns', copy=False)

[16]:

	a	b	c	d
2020-01-01	-0.548590	0.479504	-0.638681	-0.729953
2020-01-02	-0.453389	-0.091043	-1.277966	1.226509
2020-01-03	-1.192221	0.756808	1.386562	-2.212451
2020-01-04	2.477589	-0.605837	1.069088	0.002167
2020-01-05	0.446367	-0.644324	-0.905393	0.130968
2020-01-06	-0.313847	1.163701	-0.240772	1.162398

[17]:

df.set_axis(['a', 'b', 'c', 'd'], axis=1, copy=False)

[17]:

	a	b	c	d
2020-01-01	-0.548590	0.479504	-0.638681	-0.729953
2020-01-02	-0.453389	-0.091043	-1.277966	1.226509
2020-01-03	-1.192221	0.756808	1.386562	-2.212451
2020-01-04	2.477589	-0.605837	1.069088	0.002167
2020-01-05	0.446367	-0.644324	-0.905393	0.130968
2020-01-06	-0.313847	1.163701	-0.240772	1.162398

方法三：使用 columns 属性#

[18]:

df.columns = ['a', 'b', 'c', 'd']
df

[18]:

	a	b	c	d
2020-01-01	-0.548590	0.479504	-0.638681	-0.729953
2020-01-02	-0.453389	-0.091043	-1.277966	1.226509
2020-01-03	-1.192221	0.756808	1.386562	-2.212451
2020-01-04	2.477589	-0.605837	1.069088	0.002167
2020-01-05	0.446367	-0.644324	-0.905393	0.130968
2020-01-06	-0.313847	1.163701	-0.240772	1.162398

Why not use df.columns = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]?

There is nothing wrong with assigning columns directly like this. It is a perfectly good solution.

The advantage of using set_axis is that it can be used as part of a method chain and that it returns a new copy of the DataFrame. Without it, you would have to store your intermediate steps of the chain to another variable before reassigning the columns.

# new for pandas 0.21+
df.some_method1()
  .some_method2()
  .set_axis()
  .some_method3()

# old way
df1 = df.some_method1()
        .some_method2()
df1.columns = columns
df1.some_method3()

Pandas 处理缺失值

Bean