Pandas 学习笔记#

版本号： 0.4
创建时间： 2015年02月07日星期六 10:26:33 CST
修改时间： 2024年04月16日
数据来源：
movies.csv http://boxofficemojo.com/daily/
iris.csv https://github.com/dsaber/py-viz-blog
titanic.csv https://github.com/dsaber/py-viz-blog
ts.csv https://github.com/dsaber/py-viz-blog

一些准备工作#

[163]:

import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime
from pandas import Series, DataFrame
from pathlib import Path

# 辅助函数

def get_titanic_df():
    return pd.read_csv('datas/titanic.csv')

def get_iris_df():
    return pd.read_csv('datas/iris.csv')

def get_tips_df():
    return pd.read_csv('datas/tips.csv')

def get_random_df():
    return pd.DataFrame(
        np.random.randn(6, 4),
        index=pd.date_range('20200101', periods=6),
        columns=list('ABCD'))

安装使用#

2020年1月更新:

jupyter 已经进化出 jupyter-lab 了。

2018年更新:

ipython notebook 已更名为 jupyter notebook 。安装使用方法如下：

安装 anaconda ，anaconda 是一个 Python 发行版。装好后就已经包含 pandas 和 jupyter notebook 。
运行 jupyter notebook 命令就可以使用 jupyter notebook 。
要在 notebook 即时显示图表，可以使用在 notebook 中使用 %matplotlib inline 命令。

下面是以前的安装使用方法：

安装 pandas

sudo apt-get install build-essential python-dev
sudo apt-get install python-pandas python-tk
sudo apt-get install python-scipy python-matplotlib python-tables
sudo apt-get install python-numexpr python-xlrd python-statsmodels
sudo apt-get install python-openpyxl python-xlwt python-bs4

if use virtualenv before install matplotlib should install libpng-dev, libjpeg8-dev, libfreetype6-dev

安装 ipython-notebook

sudo pip install "ipython[notebook]"
sudo pip install pygments

使用ipython notebook运行 ipython-notebook 。如果使用matplotlib内嵌进网页中,那么需要运行:ipython notebook --matplotlib inline ；或者在已经打开的 notebook 中运行 %matplotlib inline 命令。

[164]:

# 查看 pandas 的版本
# pd.__version__
# 查看当前环境涉及软件的版本
# pd.show_versions()

DataFrame 速览#

创建#

[165]:

data = {
    'fruit': ['Apple', 'Apple', 'Apple', 'Grape', 'Grape', 'Grape'],
    'year': [2017, 2018, 2019, 2017, 2018, 2019],
    'price': [20.5, 21.3, 25, 10.1, 10.9, 9.98]
}
frame = DataFrame(data)
frame

[165]:

	fruit	year	price
0	Apple	2017	20.50
1	Apple	2018	21.30
2	Apple	2019	25.00
3	Grape	2017	10.10
4	Grape	2018	10.90
5	Grape	2019	9.98

生成时指定 columns 和 index ，index 的长度要与行数相匹配

[166]:

frame2 = DataFrame(
    data,
    columns = ['year', 'month', 'fruit', 'price'],
    index = ['one', 'two', 'three', 'four', 'five', 'six']
)
frame2

[166]:

	year	month	fruit	price
one	2017	NaN	Apple	20.50
two	2018	NaN	Apple	21.30
three	2019	NaN	Apple	25.00
four	2017	NaN	Grape	10.10
five	2018	NaN	Grape	10.90
six	2019	NaN	Grape	9.98

查看数据#

通过类似字典标记的方式或属性的方式,可以将 DataFrame 的列获取为一个 Series 。 IPython 提供了类似属性的访问(即frame2.year)和tab补全。 frame2[column] 适用于任何列的名,但是 frame2.column 只有在列名是一个合理的 Python 变量名时才适用。注意,返回的 Series 拥有原 DataFrame 相同的索引,且其 name 属性也已经被相应地设置好了。

[167]:

frame2['fruit']

[167]:

one      Apple
two      Apple
three    Apple
four     Grape
five     Grape
six      Grape
Name: fruit, dtype: object

[168]:

frame2.year

[168]:

one      2017
two      2018
three    2019
four     2017
five     2018
six      2019
Name: year, dtype: int64

[169]:

frame2.loc['two']

[169]:

year      2018
month      NaN
fruit    Apple
price     21.3
Name: two, dtype: object

修改数据#

[170]:

frame2.month = 11
frame2

[170]:

	year	month	fruit	price
one	2017	11	Apple	20.50
two	2018	11	Apple	21.30
three	2019	11	Apple	25.00
four	2017	11	Grape	10.10
five	2018	11	Grape	10.90
six	2019	11	Grape	9.98

[171]:

frame2.month = np.arange(6)
frame2

[171]:

	year	month	fruit	price
one	2017	0	Apple	20.50
two	2018	1	Apple	21.30
three	2019	2	Apple	25.00
four	2017	3	Grape	10.10
five	2018	4	Grape	10.90
six	2019	5	Grape	9.98

DataFrame 的切片是 DataFrame 的视图，不是副本。对切片赋值会改变 DataFrame 本身。

[172]:

c_month = frame2['month']
c_month

[172]:

one      0
two      1
three    2
four     3
five     4
six      5
Name: month, dtype: int64

[173]:

month = Series([1,3,5], index=['one', 'three', 'five'])
frame2['month'] = month
frame2

[173]:

	year	month	fruit	price
one	2017	1.0	Apple	20.50
two	2018	NaN	Apple	21.30
three	2019	3.0	Apple	25.00
four	2017	NaN	Grape	10.10
five	2018	5.0	Grape	10.90
six	2019	NaN	Grape	9.98

[174]:

# c_month['two'] = 2.2
# 这种方式也能改写值，但是因为性能的问题不推荐使用，详见：
# https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
frame2.loc['two', 'month'] = 2.2
frame2

[174]:

	year	month	fruit	price
one	2017	1.0	Apple	20.50
two	2018	2.2	Apple	21.30
three	2019	3.0	Apple	25.00
four	2017	NaN	Grape	10.10
five	2018	5.0	Grape	10.90
six	2019	NaN	Grape	9.98

删除数据#

[175]:

del frame2['month'] #  删除 column
frame2

[175]:

	year	fruit	price
one	2017	Apple	20.50
two	2018	Apple	21.30
three	2019	Apple	25.00
four	2017	Grape	10.10
five	2018	Grape	10.90
six	2019	Grape	9.98

pandas 的索引对象#

[176]:

nums = Series(range(3), index=['one', 'two', 'three'])

[177]:

num_index = nums.index

[178]:

num_index

[178]:

Index(['one', 'two', 'three'], dtype='object')

[179]:

num_index[1:]

[179]:

Index(['two', 'three'], dtype='object')

索引对象是不可变的，不能对其赋值

[180]:

num_index2 = pd.Index(['one', 'two', 'three'])

[181]:

num_index is num_index2

[181]:

False

[182]:

num_index == num_index2

[182]:

array([ True,  True,  True])

[183]:

nums2= Series(range(3), index = num_index2)

[184]:

nums2

[184]:

one      0
two      1
three    2
dtype: int64

[185]:

nums2.index is num_index2

[185]:

True

[186]:

nums2.index == num_index2

[186]:

array([ True,  True,  True])

[187]:

frame3 = get_random_df()

[188]:

frame3.columns

[188]:

Index(['A', 'B', 'C', 'D'], dtype='object')

[189]:

frame3.index

[189]:

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

reindex#

Series reindex#

[190]:

obj = Series(range(4), index=['a', 'c', 'd', 'b'])
obj

[190]:

a    0
c    1
d    2
b    3
dtype: int64

[191]:

obj2 = obj.reindex(['a', 'b', 'c', 'd'])

[192]:

obj2

[192]:

a    0
b    3
c    1
d    2
dtype: int64

自动填充#

[193]:

obj3 = Series(['red', 'blue', 'yellow'], index=[1,3,5])

[194]:

obj3

[194]:

1       red
3      blue
5    yellow
dtype: object

[195]:

obj3.reindex(range(6), method='ffill')

[195]:

0       NaN
1       red
2       red
3      blue
4      blue
5    yellow
dtype: object

Frame reindex#

[196]:

range(5)

[196]:

range(0, 5)

[197]:

df = pd.DataFrame(
        np.random.randn(6, 4),
        index=range(6),
        columns=list('ABCD'))
df

[197]:

	A	B	C	D
0	-1.870282	1.847236	1.060884	-0.962755
1	0.639969	1.395878	-1.609093	-0.000310
2	1.262852	0.930063	-0.288271	-1.647747
3	1.844337	-0.448772	1.115280	-0.033701
4	0.095214	-0.154165	-0.941797	-0.515162
5	-0.642843	1.988245	0.806943	-0.834752

[198]:

df.reindex(range(5,-1,-1), columns=['A', 'C', 'DDD'])

[198]:

	A	C	DDD
5	-0.642843	0.806943	NaN
4	0.095214	-0.941797	NaN
3	1.844337	1.115280	NaN
2	1.262852	-0.288271	NaN
1	0.639969	-1.609093	NaN
0	-1.870282	1.060884	NaN

drop#

[199]:

df.drop('D', axis='columns') #  不修改对象

[199]:

	A	B	C
0	-1.870282	1.847236	1.060884
1	0.639969	1.395878	-1.609093
2	1.262852	0.930063	-0.288271
3	1.844337	-0.448772	1.115280
4	0.095214	-0.154165	-0.941797
5	-0.642843	1.988245	0.806943

[200]:

df

[200]:

	A	B	C	D
0	-1.870282	1.847236	1.060884	-0.962755
1	0.639969	1.395878	-1.609093	-0.000310
2	1.262852	0.930063	-0.288271	-1.647747
3	1.844337	-0.448772	1.115280	-0.033701
4	0.095214	-0.154165	-0.941797	-0.515162
5	-0.642843	1.988245	0.806943	-0.834752

[201]:

df.drop('D', axis='columns', inplace=True) #  修改对象

[202]:

df

[202]:

	A	B	C
0	-1.870282	1.847236	1.060884
1	0.639969	1.395878	-1.609093
2	1.262852	0.930063	-0.288271
3	1.844337	-0.448772	1.115280
4	0.095214	-0.154165	-0.941797
5	-0.642843	1.988245	0.806943

DataFrames 创建#

[203]:

dates = pd.date_range('20180101', periods=6);dates

[203]:

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06'],
              dtype='datetime64[ns]', freq='D')

[204]:

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'));df

[204]:

	A	B	C	D
2018-01-01	1.863653	0.053990	-1.256914	1.474340
2018-01-02	-0.164327	0.211501	0.323222	-0.002627
2018-01-03	0.254398	-0.622240	0.329015	0.245014
2018-01-04	-0.496644	0.955061	1.403011	-0.370782
2018-01-05	-0.065887	0.570067	1.524579	-0.854202
2018-01-06	1.132634	0.529680	0.117496	0.880622

[205]:

# 可以使用字典来创建 DataFrame 。
# 如果字典的 Value 是单一值，那么会自动扩展。
# 如果字典的 Value 是列表或者 Series ，那么长度要保持一致。
# 如果字典中只有一个值有 Index ，那么会使用这个 Index 作为整个 DataFrame 的 Index 。
# 如果字典有多个 Index ，那么必须保持一致，否则会报错。
df2 = pd.DataFrame(
    { 'A' : 1.,
      'B' : pd.Timestamp('20130102'),
      'C' : pd.Series(1,index=list(range(2,6)),dtype='float32'),
      'D' : np.array([3] * 4,dtype='int32'),
      'E' : pd.Categorical(["test","train","test","train"]),
      'F' : 'foo' }
)
df2

[205]:

	A	B	C	D	E	F
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo
4	1.0	2013-01-02	1.0	3	test	foo
5	1.0	2013-01-02	1.0	3	train	foo

[206]:

data = {
    'fruit': ['Apple', 'Apple', 'Apple', 'Grape', 'Grape', 'Grape'],
    'year': [2017, 2018, 2019, 2017, 2018, 2019],
    'price': [20.5, 21.3, 25, 10.1, 10.9, 9.98]
}
frame = DataFrame(data)

DataFrame 全局操作#

[207]:

df = get_random_df()

[208]:

df.dtypes

[208]:

A    float64
B    float64
C    float64
D    float64
dtype: object

[209]:

df.index

[209]:

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

[210]:

df.columns

[210]:

Index(['A', 'B', 'C', 'D'], dtype='object')

[211]:

df.values

[211]:

array([[-0.8430307 ,  0.16365112,  0.34819952,  0.1816034 ],
       [-0.2527771 ,  0.35288587, -1.31084768,  0.96593095],
       [-0.98246094, -0.24233779, -1.70510314, -0.07605717],
       [-0.34124125,  1.70597412, -0.46188374,  0.49084553],
       [-0.17822249,  0.24639153,  0.16107507,  0.22037991],
       [ 2.14574608,  1.18792989,  0.75582046,  0.1645418 ]])

[212]:

df.describe()

[212]:

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	-0.075331	0.569082	-0.368790	0.324541
std	1.136505	0.727893	0.973802	0.362325
min	-0.982461	-0.242338	-1.705103	-0.076057
25%	-0.717583	0.184336	-1.098607	0.168807
50%	-0.297009	0.299639	-0.150404	0.200992
75%	-0.196861	0.979169	0.301418	0.423229
max	2.145746	1.705974	0.755820	0.965931

[213]:

df.T

[213]:

	2020-01-01	2020-01-02	2020-01-03	2020-01-04	2020-01-05	2020-01-06
A	-0.843031	-0.252777	-0.982461	-0.341241	-0.178222	2.145746
B	0.163651	0.352886	-0.242338	1.705974	0.246392	1.187930
C	0.348200	-1.310848	-1.705103	-0.461884	0.161075	0.755820
D	0.181603	0.965931	-0.076057	0.490846	0.220380	0.164542

读入数据#

从 CSV 文件读入数据#

[214]:

# 读入 CSV 格式数据
df_movies = pd.read_csv('datas/movies.csv', sep='\t', encoding='utf-8')
df_movies.head()

[214]:

	Row	Date	Day	Day#	Top 10 Gross	#1 Movie	Gross
0	1	Jan. 30	Fri	30	$26,168,351	American Sniper	$9,905,616
1	2	Jan. 31	Sat	31	$41,633,588	American Sniper	$16,510,536
2	3	Feb. 1	Sun	32	$12,515,579	American Sniper	$4,244,376
3	4	Feb. 2	Mon	33	$6,475,068	American Sniper	$2,645,109
4	5	Feb. 3	Tue	34	$7,825,091	American Sniper	$2,923,141

[215]:

df_movies = pd.read_csv('datas/movies.csv', sep='\t', encoding='utf-8',thousands=',',escapechar='$')
df_movies.head()

[215]:

	Row	Date	Day	Day#	Top 10 Gross	#1 Movie	Gross
0	1	Jan. 30	Fri	30	26168351	American Sniper	9905616
1	2	Jan. 31	Sat	31	41633588	American Sniper	16510536
2	3	Feb. 1	Sun	32	12515579	American Sniper	4244376
3	4	Feb. 2	Mon	33	6475068	American Sniper	2645109
4	5	Feb. 3	Tue	34	7825091	American Sniper	2923141

读取目录下所有CSV，合并为1个DataFrame，去重#

[216]:

source_path = Path('./datas')
csv_files = source_path.glob('*.csv')
dfs = [pd.read_csv(item, encoding='utf8', dtype="string") for item in csv_files]
df = pd.concat(dfs).drop_duplicates()
#df.head(3)

从数据库读入数据#

参见：http://stackoverflow.com/questions/10065051/python-pandas-and-databases-like-mysql

从 Excel 文件读入数据#

原型#

pandas.read_excel(io, sheet_name=0, header=0, skiprows=None, skip_footer=0,
                  index_col=None, names=None, usecols=None, parse_dates=False,
                  date_parser=None, na_values=None, thousands=None,
                  convert_float=True, converters=None, dtype=None,
                  true_values=None, false_values=None, engine=None,
                  squeeze=False, **kwds)

Read an Excel table into a pandas DataFrame

示例：

# 读入 Excel 文件，所有字段都是 str
df = pd.read_excel(source_file, dtype="string")

选择数据#

行，列 –> df[]
区域 –> df.loc[], df.iloc[], df.ix[]
单元格 –> df.at[], df.iat[]

选择单个数据#

[217]:

df=get_random_df();df

[217]:

	A	B	C	D
2020-01-01	-0.127414	0.402208	-1.621377	-0.706600
2020-01-02	0.427348	-0.576915	-0.719638	-1.058799
2020-01-03	-0.466531	1.123756	-0.884415	2.203494
2020-01-04	-0.249747	-0.367119	-1.423965	0.971523
2020-01-05	-0.879580	1.904772	0.470134	1.533346
2020-01-06	3.046705	-1.288350	-1.219871	-0.800774

[218]:

df.at['2020-01-01', 'B']

[218]:

0.40220819245805306

[219]:

df.iat[1,1]

[219]:

-0.5769150159100832

选择某个区域#

[220]:

#只显示指定的行和列
df.iloc[[1,3,5],[0,1,2,3]]

[220]:

	A	B	C	D
2020-01-02	0.427348	-0.576915	-0.719638	-1.058799
2020-01-04	-0.249747	-0.367119	-1.423965	0.971523
2020-01-06	3.046705	-1.288350	-1.219871	-0.800774

[221]:

df.loc[['2020-01-01','2020-01-04','2020-01-06'],['A', 'C']]

[221]:

	A	C
2020-01-01	-0.127414	-1.621377
2020-01-04	-0.249747	-1.423965
2020-01-06	3.046705	-1.219871

选择行#

[222]:

df = get_titanic_df()
df = df[['survived', 'age', 'deck', 'class']]
df[1:4] # 第2-4行

[222]:

	survived	age	deck	class
1	1	38.0	C	First
2	1	26.0	NaN	Third
3	1	35.0	C	First

[223]:

# 查看头部数据
df.head()

[223]:

	survived	age	deck	class
0	0	22.0	NaN	Third
1	1	38.0	C	First
2	1	26.0	NaN	Third
3	1	35.0	C	First
4	0	35.0	NaN	Third

[224]:

# 查看尾部数据
df.tail(2)
#head 和 tail 接受一个整数参数，缺省值为 5 。

[224]:

	survived	age	deck	class
889	1	26.0	C	First
890	0	32.0	NaN	Third

[225]:

# 选择 deck 为空值的行
df[df['deck'].isnull()].head()

[225]:

	survived	age	deck	class
0	0	22.0	NaN	Third
2	1	26.0	NaN	Third
4	0	35.0	NaN	Third
5	0	NaN	NaN	Third
7	0	2.0	NaN	Third

[226]:

df[df.deck.isna()].head()

[226]:

	survived	age	deck	class
0	0	22.0	NaN	Third
2	1	26.0	NaN	Third
4	0	35.0	NaN	Third
5	0	NaN	NaN	Third
7	0	2.0	NaN	Third

[227]:

# 选择 deck 为非空值的行
df[df.deck.notnull()].head()

[227]:

	survived	age	deck	class
1	1	38.0	C	First
3	1	35.0	C	First
6	0	54.0	E	First
10	1	4.0	G	Third
11	1	58.0	C	First

[228]:

# 根据条件过滤
df[(df['class'] == 'First') | (df.deck == 'E') ].head()

[228]:

	survived	age	deck	class
1	1	38.0	C	First
3	1	35.0	C	First
6	0	54.0	E	First
11	1	58.0	C	First
23	1	28.0	A	First

[229]:

df[ (df['class'] == 'First') & (df.deck == 'D') & (df.age < 30)].head()

[229]:

	survived	age	deck	class
97	1	23.0	D	First
102	0	21.0	D	First
136	1	19.0	D	First
393	1	23.0	D	First
627	1	21.0	D	First

选择列#

[230]:

df = get_titanic_df()
df[['survived', 'age', 'deck', 'class']].head()

[230]:

	survived	age	deck	class
0	0	22.0	NaN	Third
1	1	38.0	C	First
2	1	26.0	NaN	Third
3	1	35.0	C	First
4	0	35.0	NaN	Third

[231]:

df.iloc[:,[1, 2, 6]].head()

[231]:

	pclass	sex	fare
0	3	male	7.2500
1	1	female	71.2833
2	3	female	7.9250
3	1	female	53.1000
4	3	male	8.0500

[232]:

# 只包含指定字符的列
df.filter(regex='s', axis=1).head()

[232]:

	survived	pclass	sex	sibsp	class
0	0	3	male	1	Third
1	1	1	female	1	First
2	1	3	female	0	Third
3	1	1	female	1	First
4	0	3	male	0	Third

[233]:

# 根据列类型过滤
# 只选择字符串型的列
df.loc[:, (df.dtypes == np.dtype('O')).values].head()

[233]:

	sex	embarked	class	who	deck	embark_town	alive
0	male	S	Third	man	NaN	Southampton	no
1	female	C	First	woman	C	Cherbourg	yes
2	female	S	Third	woman	NaN	Southampton	yes
3	female	S	First	woman	C	Southampton	yes
4	male	S	Third	man	NaN	Southampton	no

操作单元#

单个单元赋值#

[234]:

df = get_random_df()
df

[234]:

	A	B	C	D
2020-01-01	1.148532	0.317324	1.159205	0.767581
2020-01-02	0.006078	-0.933768	1.729512	-0.186436
2020-01-03	0.781555	-0.465289	0.490352	1.365542
2020-01-04	0.010499	0.523784	0.204040	0.082275
2020-01-05	0.221785	-2.691217	0.151030	-1.445744
2020-01-06	-0.405333	-0.838198	-0.493799	-2.455432

[235]:

# 单元格赋值
# 单个单元格赋值
df.iloc[0, 3] = 0.131211
df.loc['2020-01-02', 'B']= -0.131211
df.head(3)

[235]:

	A	B	C	D
2020-01-01	1.148532	0.317324	1.159205	0.131211
2020-01-02	0.006078	-0.131211	1.729512	-0.186436
2020-01-03	0.781555	-0.465289	0.490352	1.365542

多个单元格赋值#

[236]:

df.loc[df.index[0:2], 'C'] = [100, 200]
df.head(3)

[236]:

	A	B	C	D
2020-01-01	1.148532	0.317324	100.000000	0.131211
2020-01-02	0.006078	-0.131211	200.000000	-0.186436
2020-01-03	0.781555	-0.465289	0.490352	1.365542

操作列#

调整列的顺序#

[237]:

# 构建 DataFrame
df = pd.DataFrame({'a':[1,2], 'b': [3,4], 'c':[5,6], 'd':[7,8], 'e':[9,10]});df

[237]:

	a	b	c	d	e
0	1	3	5	7	9
1	2	4	6	8	10

方法一#

[238]:

col_order = ['c', 'a', 'b', 'd']
df = df[col_order]
df

[238]:

	c	a	b	d
0	5	1	3	7
1	6	2	4	8

方法二#

[239]:

new_df = df.drop('c', axis=1)
new_df.insert(2, 'c', df['c'])
new_df

[239]:

	a	b	c	d
0	1	3	5	7
1	2	4	6	8

打印列类型#

[240]:

df = df_movies.copy()
df.columns.to_series().groupby(df.dtypes).groups

[240]:

{int64: ['Row', 'Day#', 'Top 10 Gross', 'Gross'], object: ['Date', 'Day', '#1 Movie']}

[241]:

# 打印列类型(清晰打印中文)
types = df.columns.to_series().groupby(df.dtypes).groups
for key, value in types.items():
    print(key,':\t', ','.join(value))

int64 :  Row,Day#,Top 10 Gross,Gross
object :         Date,Day,#1 Movie

插入列#

[242]:

df = get_random_df()
# 方式一：在末尾添加
df['memo'] = pd.Series('', index=df.index)
df.head(3)

[242]:

	A	B	C	D
2020-01-01	-0.460849	-2.111140	0.526750	-1.723461
2020-01-02	-2.075447	0.771671	0.074949	0.419958
2020-01-03	-1.044952	-0.515388	-1.244958	-0.207565

[243]:

# 方式二：在中间插入
df = get_random_df()
df.insert(loc=1, column=u'year', value=u'2015')
df.head(3)

[243]:

	A	year	B	C	D
2020-01-01	-0.909163	2015	-0.917322	-0.506002	1.088749
2020-01-02	0.237910	2015	-0.330139	-0.185761	-0.282543
2020-01-03	0.644266	2015	-0.552541	0.013399	1.092278

[244]:

# 根据现有值生成一个新的列
df = get_random_df()
df.insert(loc = 2 , column='A-B', value=df['A'] - df['B'])
df.head(3)

[244]:

	A	B	A-B	C	D
2020-01-01	0.448350	-0.290054	0.738404	-0.406046	0.308425
2020-01-02	-0.580417	-1.769612	1.189195	1.239757	-0.759034
2020-01-03	-0.043785	0.072310	-0.116096	0.264677	-0.395628

[245]:

# 根据现有值生成多个新的列
df = df_movies.copy()
def process_date_col(text):
    #根据日期生成月份和日两个新的列
    if pd.isnull(text):
        month = day = np.nan
    else:
        month, day = text.split('.')
    return pd.Series([month, day])

df[[u'month', u'day']] = df.Date.apply(process_date_col)
df.head()

[245]:

	Row	Date	Day	Day#	Top 10 Gross	#1 Movie	Gross	month	day
0	1	Jan. 30	Fri	30	26168351	American Sniper	9905616	Jan	30
1	2	Jan. 31	Sat	31	41633588	American Sniper	16510536	Jan	31
2	3	Feb. 1	Sun	32	12515579	American Sniper	4244376	Feb	1
3	4	Feb. 2	Mon	33	6475068	American Sniper	2645109	Feb	2
4	5	Feb. 3	Tue	34	7825091	American Sniper	2923141	Feb	3

[246]:

# 根据现有值生成多个新的列
df = df_movies.copy()
df.head()

def process_date_col(se):
    #根据日期生成月份和日两个新的列
    if pd.isnull(se['Date']):
        se['month'] = se['day'] = np.nan
    else:
        se['month'], se['day'] = se['Date'].split('.')
    return se
df['month'] = df['day'] = np.nan
df_new = df.apply(process_date_col, axis=1)
df_new.head()

[246]:

	Row	Date	Day	Day#	Top 10 Gross	#1 Movie	Gross	month	day
0	1	Jan. 30	Fri	30	26168351	American Sniper	9905616	Jan	30
1	2	Jan. 31	Sat	31	41633588	American Sniper	16510536	Jan	31
2	3	Feb. 1	Sun	32	12515579	American Sniper	4244376	Feb	1
3	4	Feb. 2	Mon	33	6475068	American Sniper	2645109	Feb	2
4	5	Feb. 3	Tue	34	7825091	American Sniper	2923141	Feb	3

改变列值#

[247]:

df = get_random_df()
df

[247]:

	A	B	C	D
2020-01-01	0.455907	0.440170	-0.956345	0.766485
2020-01-02	-0.400386	-0.218492	-0.737315	-0.841963
2020-01-03	-1.036559	-0.701914	1.589558	0.057975
2020-01-04	1.590878	-0.045251	-0.515069	-1.433473
2020-01-05	-0.071407	0.250663	1.714977	0.262407
2020-01-06	0.029585	-0.255554	-1.276931	-1.298129

根据一列的值改变本列的值#

[248]:

# 方法一
# 把 C 列小于 0 的数据设置为 0
df.loc[df.C < 0, 'C'] = 0
df

[248]:

	A	B	C	D
2020-01-01	0.455907	0.440170	0.000000	0.766485
2020-01-02	-0.400386	-0.218492	0.000000	-0.841963
2020-01-03	-1.036559	-0.701914	1.589558	0.057975
2020-01-04	1.590878	-0.045251	0.000000	-1.433473
2020-01-05	-0.071407	0.250663	1.714977	0.262407
2020-01-06	0.029585	-0.255554	0.000000	-1.298129

[249]:

# 根据一列的值改变本列的值
# 方法二
# 把 C 列小于 0 的数据设置为 0
df['D'] = df['D'].apply(lambda x : 0 if x<0 else x)
df

[249]:

	A	B	C	D
2020-01-01	0.455907	0.440170	0.000000	0.766485
2020-01-02	-0.400386	-0.218492	0.000000	0.000000
2020-01-03	-1.036559	-0.701914	1.589558	0.057975
2020-01-04	1.590878	-0.045251	0.000000	0.000000
2020-01-05	-0.071407	0.250663	1.714977	0.262407
2020-01-06	0.029585	-0.255554	0.000000	0.000000

同时改变多个列的值#

[250]:

# 把 A B 两列数据均乘以 100
cols = ['A', 'B']
df[cols] = df[cols].applymap(lambda x: x*100)
df

[250]:

	A	B	C	D
2020-01-01	45.590705	44.016988	0.000000	0.766485
2020-01-02	-40.038550	-21.849155	0.000000	0.000000
2020-01-03	-103.655878	-70.191427	1.589558	0.057975
2020-01-04	159.087823	-4.525098	0.000000	0.000000
2020-01-05	-7.140684	25.066323	1.714977	0.262407
2020-01-06	2.958459	-25.555402	0.000000	0.000000

按条件过滤行，批量修改字段#

[251]:

# A、B 列均小于零的，C 列改为空值
mask = (df['A'] < 0) & (df['B'] < 0)
df.loc[mask, 'C'] = np.nan
df

[251]:

	A	B	C	D
2020-01-01	45.590705	44.016988	0.000000	0.766485
2020-01-02	-40.038550	-21.849155	NaN	0.000000
2020-01-03	-103.655878	-70.191427	NaN	0.057975
2020-01-04	159.087823	-4.525098	0.000000	0.000000
2020-01-05	-7.140684	25.066323	1.714977	0.262407
2020-01-06	2.958459	-25.555402	0.000000	0.000000

使用字典生成新的列#

[252]:

dict_str = """320281 江阴
320282 宜兴
320213 梁溪"""
# 生成一个字典，key是行政区划代码，value是行政区划名称
dict_rows = [row.split() for row in dict_str.split('\n')]
xzqh_dm_mc = dict([(item[0], item[1]) for item in dict_rows])
xzqh_dm_mc
df['行政区划代码'] = ['320281', '320282', '320283',
                '320281', '320282', '320213']

df['行政区划'] = df['行政区划代码'].map(xzqh_dm_mc)
df

[252]:

	A	B	C	D	行政区划代码	行政区划
2020-01-01	45.590705	44.016988	0.000000	0.766485	320281	江阴
2020-01-02	-40.038550	-21.849155	NaN	0.000000	320282	宜兴
2020-01-03	-103.655878	-70.191427	NaN	0.057975	320283	NaN
2020-01-04	159.087823	-4.525098	0.000000	0.000000	320281	江阴
2020-01-05	-7.140684	25.066323	1.714977	0.262407	320282	宜兴
2020-01-06	2.958459	-25.555402	0.000000	0.000000	320213	梁溪

删除列#

[253]:

df = get_random_df()
df.head()

[253]:

	A	B	C	D
2020-01-01	-1.587218	0.069797	0.714401	0.381977
2020-01-02	0.671095	-1.255362	-1.335127	0.220457
2020-01-03	-0.551792	0.174548	1.120456	-0.709983
2020-01-04	-0.171354	0.851806	-0.319238	-0.310731
2020-01-05	0.254203	0.074290	-1.541995	0.765277

[254]:

df.drop(df[['D', 'C']], axis=1, inplace=True)
df.head()

[254]:

	A	B
2020-01-01	-1.587218	0.069797
2020-01-02	0.671095	-1.255362
2020-01-03	-0.551792	0.174548
2020-01-04	-0.171354	0.851806
2020-01-05	0.254203	0.074290

[255]:

del df['A']
df.head()

[255]:

	B
2020-01-01	0.069797
2020-01-02	-1.255362
2020-01-03	0.174548
2020-01-04	0.851806
2020-01-05	0.074290

[256]:

df = get_random_df()
df.drop(columns=['A', 'C'], axis=1).head()

[256]:

	B	D
2020-01-01	1.267116	1.102446
2020-01-02	-2.601673	-0.973597
2020-01-03	0.330555	0.101620
2020-01-04	-1.300182	-0.316561
2020-01-05	-0.856983	-0.837295

[257]:

# 只保留前两列，删除第三列开始的所有列
df = get_random_df()
df.drop(df.columns[2:], axis=1)

[257]:

	A	B
2020-01-01	1.054027	-1.710003
2020-01-02	-0.895784	0.699684
2020-01-03	-0.906341	1.058156
2020-01-04	-0.534992	-1.327501
2020-01-05	0.460481	-0.613649
2020-01-06	-0.418965	-1.288246

操作行#

添加一个空行#

[258]:

df = get_iris_df()
#df = get_random_df()
new_line = pd.Series([np.nan]*len(df.columns), # Fill cells with NaNs
                     index=df.columns)
# 方法一
df.loc[len(df)]=new_line

# 方法二
new_df = pd.concat([df, new_line.to_frame().T], axis=0, ignore_index=True)

# 方法三
# append 将被废弃
# new_df = df.append(new_line, ignore_index=True)

new_df.tail()

[258]:

	petalLength	petalWidth	sepalLength	sepalWidth	species
147	5.2	2.0	6.5	3.0	virginica
148	5.4	2.3	6.2	3.4	virginica
149	5.1	1.8	5.9	3.0	virginica
150	NaN	NaN	NaN	NaN	NaN
151	NaN	NaN	NaN	NaN	NaN

删除行#

[259]:

# 删除 petalLength 为空的行
new_df[~(new_df['petalLength'].isnull())].tail()

[259]:

	petalLength	petalWidth	sepalLength	sepalWidth	species
145	5.2	2.3	6.7	3.0	virginica
146	5.0	1.9	6.3	2.5	virginica
147	5.2	2.0	6.5	3.0	virginica
148	5.4	2.3	6.2	3.4	virginica
149	5.1	1.8	5.9	3.0	virginica

[260]:

# 删除索引为 1 和 5 的行
df = df.drop([1,5])
df.head()

[260]:

	petalLength	petalWidth	sepalLength	sepalWidth	species
0	1.4	0.2	5.1	3.5	setosa
2	1.3	0.2	4.7	3.2	setosa
3	1.5	0.2	4.6	3.1	setosa
4	1.4	0.2	5.0	3.6	setosa
6	1.4	0.3	4.6	3.4	setosa

apply, applymap, map 区别#

apply：应用在DataFrame的行或列中；
applymap：应用在DataFrame的每个元素中；
map：应用在单独一列（Series）的每个元素中。

[261]:

df = pd.DataFrame(np.random.randn(2, 2), columns=list('AB'));df

[261]:

	A	B
0	-1.793254	-2.011889
1	-0.904923	-0.594622

[262]:

# 行的和
df.apply(np.sum)

[262]:

A   -2.698177
B   -2.606511
dtype: float64

[263]:

# 列的和
df.apply(np.sum, axis=1)

[263]:

0   -3.805143
1   -1.499545
dtype: float64

[264]:

# 保留两位小数
df.applymap(lambda x: '%.2f'%x)

[264]:

	A	B
0	-1.79	-2.01
1	-0.90	-0.59

[265]:

# 保留两位小数，只处理某一列，返回一个 DataFrame
df[['A']].applymap(lambda x: '%.2f'%x)

[265]:

	A
0	-1.79
1	-0.90

[266]:

# 保留两位小数，只处理某一列，返回一个 Series
df['A'].map(lambda x: '%.2f'%x)

[266]:

0    -1.79
1    -0.90
Name: A, dtype: object

处理字符串#

把一列字符串拆分为两列#

[267]:

df = pd.DataFrame({"下发数量": ["下发数(6),未下发数（3)", "下发数(16),未下发数（7)"],
                   "下发时间": ["20220118007", "20220120009"]
                  })
df[['下发数', '去重数']] = df['下发数量'].str.extract(r'下发数\((\d+)\),未下发数（(\d+)\)').astype('int')
df

[267]:

	下发数量	下发时间	下发数	去重数
0	下发数(6),未下发数（3)	20220118007	6	3
1	下发数(16),未下发数（7)	20220120009	16	7

[268]:

firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0]
firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[1]
firstlast[["First", "Last"]] = firstlast["String"].str.split(" ", expand=True)
firstlast

[268]:

	String	First_Name	Last_Name	First	Last
0	John Smith	John	Smith	John	Smith
1	Jane Cook	Jane	Cook	Jane	Cook

合并字符串#

[269]:

df = pd.DataFrame({
    "First": ["John", "Jane"],
    "Last": ["Smith", np.nan],
    "Age":[12, np.nan]
})
df

[269]:

	First	Last	Age
0	John	Smith	12.0
1	Jane	NaN	NaN

[270]:

df['First'].str.cat(df['Last'], sep='.', na_rep='NO_DATA')

[270]:

0      John.Smith
1    Jane.NO_DATA
Name: First, dtype: object

[271]:

df['First'].str.cat(df['Age'].astype(str), sep=' - ', na_rep='NO_DATA')

[271]:

0    John - 12.0
1     Jane - nan
Name: First, dtype: object

[272]:

df['First'].str.cat([df['Last'], df['Age'].astype(str)], sep=' - ', na_rep='NO_DATA')

[272]:

0     John - Smith - 12.0
1    Jane - NO_DATA - nan
Name: First, dtype: object

对一列字符串进行处理，生成新的列#

[273]:

firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]})
firstlast["upper"] = firstlast["string"].str.upper()
firstlast["lower"] = firstlast["string"].str.lower()
firstlast["title"] = firstlast["string"].str.title()
firstlast

[273]:

	string	upper	lower	title
0	John Smith	JOHN SMITH	john smith	John Smith
1	Jane Cook	JANE COOK	jane cook	Jane Cook

截取字符串#

[274]:

firstlast['title'].str[5:12]

[274]:

0    Smith
1     Cook
Name: title, dtype: object

替换字符串#

[275]:

firstlast['title'].str.replace(
    r'o{2,}',
    'XXXX',
    regex=True
)

[275]:

0     John Smith
1    Jane CXXXXk
Name: title, dtype: object

空值处理（NaN）#

[276]:

# 计数有空值的行
nans = df.shape[0] - df.dropna().shape[0]
print(u'一共有 %d 行出现空值' % nans)

# 填充空值为`无`
df.fillna(value=u'无', inplace=True)
df.tail()

一共有 1 行出现空值

[276]:

	First	Last	Age
0	John	Smith	12.0
1	Jane	无	无

排序#

DataFrame 按列排序#

[277]:

df_ti = get_titanic_df()
df = df_ti[['pclass', 'sex', 'age']]
# 按年龄排序
# df.sort_values('age', ascending=True, inplace=True)
df.sort_values(by='age', ascending=True).head()

[277]:

	pclass	sex	age
803	3	male	0.42
755	2	male	0.67
644	3	female	0.75
469	3	female	0.75
78	2	male	0.83

[278]:

df.sort_values(by=['age', 'pclass'], ascending=True)[5:15]

[278]:

	pclass	sex	age
831	2	male	0.83
305	1	male	0.92
183	2	male	1.00
827	2	male	1.00
164	3	male	1.00
172	3	female	1.00
381	3	female	1.00
386	3	male	1.00
788	3	male	1.00
297	1	female	2.00

[279]:

# 排序后重新编制索引
df.index = range(1,len(df.index)+1)
df.head()

[279]:

	pclass	sex	age
1	3	male	22.0
2	1	female	38.0
3	3	female	26.0
4	1	female	35.0
5	3	male	35.0

Series 排序#

按指定顺序排序#

[280]:

dic = {'101': '香蕉', '102': '苹果', '181': '葡萄'}
s = pd.Series(dic)
my_order = ['101', '181', '102']
s.reindex(my_order)

[280]:

101    香蕉
181    葡萄
102    苹果
dtype: object

按 index 排序#

[281]:

s.sort_index()

[281]:

101    香蕉
102    苹果
181    葡萄
dtype: object

按值排序#

[282]:

s.sort_values()

[282]:

102    苹果
181    葡萄
101    香蕉
dtype: object

[283]:

s.sort_values(ascending=False)

[283]:

101    香蕉
181    葡萄
102    苹果
dtype: object

##　去重

[284]:

df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie',pd.NA,pd.NA],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack',pd.NA,pd.NA],
    'rating': [4, 4, 3.5, 15, 5,pd.NA,pd.NA]
})
df

[284]:

	brand	style	rating
0	Yum Yum	cup	4
1	Yum Yum	cup	4
2	Indomie	cup	3.5
3	Indomie	pack	15
4	Indomie	pack	5
5	<NA>	<NA>	<NA>
6	<NA>	<NA>	<NA>

[285]:

df.drop_duplicates()  # 所有列相同的去重

[285]:

	brand	style	rating
0	Yum Yum	cup	4
2	Indomie	cup	3.5
3	Indomie	pack	15
4	Indomie	pack	5
5	<NA>	<NA>	<NA>

[286]:

df.drop_duplicates(subset=['brand'])  # 按某一列去重

[286]:

	brand	style	rating
0	Yum Yum	cup	4
2	Indomie	cup	3.5
5	<NA>	<NA>	<NA>

[287]:

# 按两列去重，保留最后一个
df.drop_duplicates(subset=['brand', 'style'], keep='last')

[287]:

	brand	style	rating
1	Yum Yum	cup	4
2	Indomie	cup	3.5
4	Indomie	pack	5
6	<NA>	<NA>	<NA>

排名#

[288]:

df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog',
                                   'spider', 'snake'],
                        'Number_legs': [4, 2, 4, 8, np.nan]})
df

[288]:

	Animal	Number_legs
0	cat	4.0
1	penguin	2.0
2	dog	4.0
3	spider	8.0
4	snake	NaN

[289]:

df['default_rank'] = df['Number_legs'].rank()
df['max_rank'] = df['Number_legs'].rank(method='max')
df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
df['pct_rank'] = df['Number_legs'].rank(pct=True)
df['max_asc_rank'] = df['Number_legs'].rank(method='min', ascending=False)
df

[289]:

	Animal	Number_legs	default_rank	max_rank	NA_bottom	pct_rank	max_asc_rank
0	cat	4.0	2.5	3.0	2.5	0.625	2.0
1	penguin	2.0	1.0	1.0	1.0	0.250	4.0
2	dog	4.0	2.5	3.0	2.5	0.625	2.0
3	spider	8.0	4.0	4.0	4.0	1.000	1.0
4	snake	NaN	NaN	NaN	5.0	NaN	NaN

按年龄分组#

[290]:

datas = [15, 16, 17, 18, 91, 56, 44]
bins = [0, 16, 45, 80, 100]
labels = ['儿童', '青年', '中年', '老年']
age_cut = pd.cut(datas, bins, True,labels)
age_cut

[290]:

['儿童', '儿童', '青年', '青年', '老年', '中年', '青年']
Categories (4, object): ['儿童' < '青年' < '中年' < '老年']

[291]:

age_cut.value_counts()

[291]:

儿童    2
青年    3
中年    1
老年    1
dtype: int64

输出示例#

[292]:

df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})
df

[292]:

	brand	style	rating
0	Yum Yum	cup	4.0
1	Yum Yum	cup	4.0
2	Indomie	cup	3.5
3	Indomie	pack	15.0
4	Indomie	pack	5.0

[293]:

df.to_dict()

[293]:

{'brand': {0: 'Yum Yum',
  1: 'Yum Yum',
  2: 'Indomie',
  3: 'Indomie',
  4: 'Indomie'},
 'style': {0: 'cup', 1: 'cup', 2: 'cup', 3: 'pack', 4: 'pack'},
 'rating': {0: 4.0, 1: 4.0, 2: 3.5, 3: 15.0, 4: 5.0}}

[294]:

df.to_records()

[294]:

rec.array([(0, 'Yum Yum', 'cup',  4. ), (1, 'Yum Yum', 'cup',  4. ),
           (2, 'Indomie', 'cup',  3.5), (3, 'Indomie', 'pack', 15. ),
           (4, 'Indomie', 'pack',  5. )],
          dtype=[('index', '<i8'), ('brand', 'O'), ('style', 'O'), ('rating', '<f8')])

[295]:

df.to_csv()

[295]:

',brand,style,rating\n0,Yum Yum,cup,4.0\n1,Yum Yum,cup,4.0\n2,Indomie,cup,3.5\n3,Indomie,pack,15.0\n4,Indomie,pack,5.0\n'

[296]:

print(df.to_csv())

,brand,style,rating
0,Yum Yum,cup,4.0
1,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0

[297]:

for i in df.iterrows():
    print(i[0])
    print(type(i[1]))
    print(i[1])

0
<class 'pandas.core.series.Series'>
brand     Yum Yum
style         cup
rating        4.0
Name: 0, dtype: object
1
<class 'pandas.core.series.Series'>
brand     Yum Yum
style         cup
rating        4.0
Name: 1, dtype: object
2
<class 'pandas.core.series.Series'>
brand     Indomie
style         cup
rating        3.5
Name: 2, dtype: object
3
<class 'pandas.core.series.Series'>
brand     Indomie
style        pack
rating       15.0
Name: 3, dtype: object
4
<class 'pandas.core.series.Series'>
brand     Indomie
style        pack
rating        5.0
Name: 4, dtype: object

[298]:

df.items()

[298]:

<generator object DataFrame.items at 0x7fe75ca78c80>

[299]:

for i in df.items():
    print(i)
    print(type(i))

('brand', 0    Yum Yum
1    Yum Yum
2    Indomie
3    Indomie
4    Indomie
Name: brand, dtype: object)
<class 'tuple'>
('style', 0     cup
1     cup
2     cup
3    pack
4    pack
Name: style, dtype: object)
<class 'tuple'>
('rating', 0     4.0
1     4.0
2     3.5
3    15.0
4     5.0
Name: rating, dtype: float64)
<class 'tuple'>

图形化#

[300]:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

[300]:

<Axes: >

[301]:

df = get_random_df()
df.plot(y=['A', 'B'])

[301]:

<Axes: >

[302]:

df_iris = get_iris_df()
df_iris.sample(5)

[302]:

	petalLength	petalWidth	sepalLength	sepalWidth	species
127	4.9	1.8	6.1	3.0	virginica
134	5.6	1.4	6.1	2.6	virginica
28	1.4	0.2	5.2	3.4	setosa
130	6.1	1.9	7.4	2.8	virginica
51	4.5	1.5	6.4	3.2	versicolor

[303]:

# 线性回归
sns.regplot(x='petalLength', y='petalWidth',data=df_iris)

[303]:

<Axes: xlabel='petalLength', ylabel='petalWidth'>

复制数据#

复制 Frame 或者 Series 的数据使用 copy 方法。

Signature: df.copy(deep: 'bool_t' = True) -> 'FrameOrSeries'

参数 deep 默认为 True 。

[304]:

df = get_random_df()
df_deep = df.copy()
df_shallow = df.copy(deep=False)
df.iloc[1,1] = 0

[305]:

df

[305]:

	A	B	C	D
2020-01-01	1.334778	0.590552	-0.191088	-0.570773
2020-01-02	0.563330	0.000000	-0.388838	1.082667
2020-01-03	-0.395530	0.022657	2.374852	-0.936310
2020-01-04	0.095474	-0.013410	-0.949512	-1.027845
2020-01-05	0.268630	1.993076	0.120371	-0.165745
2020-01-06	-0.026002	0.474532	0.353869	-0.868767

[306]:

df_deep

[306]:

	A	B	C	D
2020-01-01	1.334778	0.590552	-0.191088	-0.570773
2020-01-02	0.563330	1.888045	-0.388838	1.082667
2020-01-03	-0.395530	0.022657	2.374852	-0.936310
2020-01-04	0.095474	-0.013410	-0.949512	-1.027845
2020-01-05	0.268630	1.993076	0.120371	-0.165745
2020-01-06	-0.026002	0.474532	0.353869	-0.868767

[307]:

df_shallow

[307]:

	A	B	C	D
2020-01-01	1.334778	0.590552	-0.191088	-0.570773
2020-01-02	0.563330	0.000000	-0.388838	1.082667
2020-01-03	-0.395530	0.022657	2.374852	-0.936310
2020-01-04	0.095474	-0.013410	-0.949512	-1.027845
2020-01-05	0.268630	1.993076	0.120371	-0.165745
2020-01-06	-0.026002	0.474532	0.353869	-0.868767

注意：当值为嵌套的结构时，深复制不是递归的。下层原数据变更时也会影响复制后的数据。

[308]:

s = pd.Series([[1, 2], [3, 4]])
deep = s.copy()
s[0][0] = 10  # 这个会影响复制后的数据
s[1] = [5, 6] # 这个不会影响复制后的数据
s

[308]:

0    [10, 2]
1     [5, 6]
dtype: object

[309]:

deep

[309]:

0    [10, 2]
1     [3, 4]
dtype: object

导出数据#

EXCEL#

[310]:

# df.to_excel(output_path)
# 多个 sheet
# with pd.ExcelWriter(output_path) as w:
#     df1.to_excel(w, sheet_name='first_sheet')
#     df2.to_excel(w, sheet_name='second_sheet')

CSV#

[311]:

# 导出周六的数据，格式为 CSV
# df[ (df['Day'] == 'Sat') ].to_csv('test_tmp.csv', mode='w', encoding='utf-8', index=False)

#在前面的文件中追加周日的数据
# df[ (df['Day'] == 'Sun') ].to_csv('test_output.csv', mode='a', header=False, encoding='utf-8', index=False)

Dict#

[312]:

# 输出为 dict 格式
# DataFrame.to_dict可以接受 ‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’
df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df

[312]:

	AAA	BBB	CCC
0	4	10	100
1	5	20	50
2	6	30	-30
3	7	40	-50

[313]:

import pprint
for p in ['dict', 'list', 'series', 'split', 'records', 'index']:
    print(f'Parameters:{p}')
    pprint.pprint(df.to_dict(p))
    print('-----------------------------------------------------')

Parameters:dict
{'AAA': {0: 4, 1: 5, 2: 6, 3: 7},
 'BBB': {0: 10, 1: 20, 2: 30, 3: 40},
 'CCC': {0: 100, 1: 50, 2: -30, 3: -50}}
-----------------------------------------------------
Parameters:list
{'AAA': [4, 5, 6, 7], 'BBB': [10, 20, 30, 40], 'CCC': [100, 50, -30, -50]}
-----------------------------------------------------
Parameters:series
{'AAA': 0    4
1    5
2    6
3    7
Name: AAA, dtype: int64,
 'BBB': 0    10
1    20
2    30
3    40
Name: BBB, dtype: int64,
 'CCC': 0    100
1     50
2    -30
3    -50
Name: CCC, dtype: int64}
-----------------------------------------------------
Parameters:split
{'columns': ['AAA', 'BBB', 'CCC'],
 'data': [[4, 10, 100], [5, 20, 50], [6, 30, -30], [7, 40, -50]],
 'index': [0, 1, 2, 3]}
-----------------------------------------------------
Parameters:records
[{'AAA': 4, 'BBB': 10, 'CCC': 100},
 {'AAA': 5, 'BBB': 20, 'CCC': 50},
 {'AAA': 6, 'BBB': 30, 'CCC': -30},
 {'AAA': 7, 'BBB': 40, 'CCC': -50}]
-----------------------------------------------------
Parameters:index
{0: {'AAA': 4, 'BBB': 10, 'CCC': 100},
 1: {'AAA': 5, 'BBB': 20, 'CCC': 50},
 2: {'AAA': 6, 'BBB': 30, 'CCC': -30},
 3: {'AAA': 7, 'BBB': 40, 'CCC': -50}}
-----------------------------------------------------

Cheat Sheet#

英文：https://www.dataquest.io/blog/pandas-cheat-sheet/

中文翻译：http://blog.csdn.net/qq_33399185/article/details/60872853