Pandas 学习笔记#

一些准备工作#

[163]:
import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime
from pandas import Series, DataFrame
from pathlib import Path

# 辅助函数

def get_titanic_df():
    return pd.read_csv('datas/titanic.csv')

def get_iris_df():
    return pd.read_csv('datas/iris.csv')

def get_tips_df():
    return pd.read_csv('datas/tips.csv')

def get_random_df():
    return pd.DataFrame(
        np.random.randn(6, 4),
        index=pd.date_range('20200101', periods=6),
        columns=list('ABCD'))

安装使用#

2020年1月更新:

  • jupyter 已经进化出 jupyter-lab 了。

2018年更新:

ipython notebook 已更名为 jupyter notebook 。安装使用方法如下:

  • 安装 anaconda ,anaconda 是一个 Python 发行版。装好后就已经包含 pandasjupyter notebook

  • 运行 jupyter notebook 命令就可以使用 jupyter notebook 。

  • 要在 notebook 即时显示图表,可以使用在 notebook 中使用 %matplotlib inline 命令。

下面是以前的安装使用方法:

  • 安装 pandas

sudo apt-get install build-essential python-dev
sudo apt-get install python-pandas python-tk
sudo apt-get install python-scipy python-matplotlib python-tables
sudo apt-get install python-numexpr python-xlrd python-statsmodels
sudo apt-get install python-openpyxl python-xlwt python-bs4

if use virtualenv before install matplotlib should install libpng-dev, libjpeg8-dev, libfreetype6-dev

  • 安装 ipython-notebook

sudo pip install "ipython[notebook]"
sudo pip install pygments
  • 使用ipython notebook运行 ipython-notebook 。如果使用matplotlib内嵌进网页中,那么需要运行:ipython notebook --matplotlib inline ;或者在已经打开的 notebook 中运行 %matplotlib inline 命令。

[164]:
# 查看 pandas 的版本
# pd.__version__
# 查看当前环境涉及软件的版本
# pd.show_versions()

DataFrame 速览#

创建#

[165]:
data = {
    'fruit': ['Apple', 'Apple', 'Apple', 'Grape', 'Grape', 'Grape'],
    'year': [2017, 2018, 2019, 2017, 2018, 2019],
    'price': [20.5, 21.3, 25, 10.1, 10.9, 9.98]
}
frame = DataFrame(data)
frame
[165]:
fruit year price
0 Apple 2017 20.50
1 Apple 2018 21.30
2 Apple 2019 25.00
3 Grape 2017 10.10
4 Grape 2018 10.90
5 Grape 2019 9.98

生成时指定 columns 和 index ,index 的长度要与行数相匹配

[166]:
frame2 = DataFrame(
    data,
    columns = ['year', 'month', 'fruit', 'price'],
    index = ['one', 'two', 'three', 'four', 'five', 'six']
)
frame2
[166]:
year month fruit price
one 2017 NaN Apple 20.50
two 2018 NaN Apple 21.30
three 2019 NaN Apple 25.00
four 2017 NaN Grape 10.10
five 2018 NaN Grape 10.90
six 2019 NaN Grape 9.98

查看数据#

通过类似字典标记的方式或属性的方式,可以将 DataFrame 的列获取为一个 Series 。 IPython 提供了类似属性的访问(即frame2.year)和tab补全。 frame2[column] 适用于任何列的名,但是 frame2.column 只有在列名是一个合理 的 Python 变量名时才适用。 注意,返回的 Series 拥有原 DataFrame 相同的索引,且其 name 属性也已经被相应地 设置好了。

[167]:
frame2['fruit']
[167]:
one      Apple
two      Apple
three    Apple
four     Grape
five     Grape
six      Grape
Name: fruit, dtype: object
[168]:
frame2.year
[168]:
one      2017
two      2018
three    2019
four     2017
five     2018
six      2019
Name: year, dtype: int64
[169]:
frame2.loc['two']
[169]:
year      2018
month      NaN
fruit    Apple
price     21.3
Name: two, dtype: object

修改数据#

[170]:
frame2.month = 11
frame2
[170]:
year month fruit price
one 2017 11 Apple 20.50
two 2018 11 Apple 21.30
three 2019 11 Apple 25.00
four 2017 11 Grape 10.10
five 2018 11 Grape 10.90
six 2019 11 Grape 9.98
[171]:
frame2.month = np.arange(6)
frame2
[171]:
year month fruit price
one 2017 0 Apple 20.50
two 2018 1 Apple 21.30
three 2019 2 Apple 25.00
four 2017 3 Grape 10.10
five 2018 4 Grape 10.90
six 2019 5 Grape 9.98

DataFrame 的切片是 DataFrame 的视图,不是副本。对切片赋值会改变 DataFrame 本身。

[172]:
c_month = frame2['month']
c_month
[172]:
one      0
two      1
three    2
four     3
five     4
six      5
Name: month, dtype: int64
[173]:
month = Series([1,3,5], index=['one', 'three', 'five'])
frame2['month'] = month
frame2
[173]:
year month fruit price
one 2017 1.0 Apple 20.50
two 2018 NaN Apple 21.30
three 2019 3.0 Apple 25.00
four 2017 NaN Grape 10.10
five 2018 5.0 Grape 10.90
six 2019 NaN Grape 9.98
[174]:
# c_month['two'] = 2.2
# 这种方式也能改写值,但是因为性能的问题不推荐使用,详见:
# https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
frame2.loc['two', 'month'] = 2.2
frame2
[174]:
year month fruit price
one 2017 1.0 Apple 20.50
two 2018 2.2 Apple 21.30
three 2019 3.0 Apple 25.00
four 2017 NaN Grape 10.10
five 2018 5.0 Grape 10.90
six 2019 NaN Grape 9.98

删除数据#

[175]:
del frame2['month'] #  删除 column
frame2
[175]:
year fruit price
one 2017 Apple 20.50
two 2018 Apple 21.30
three 2019 Apple 25.00
four 2017 Grape 10.10
five 2018 Grape 10.90
six 2019 Grape 9.98

pandas 的索引对象#

[176]:
nums = Series(range(3), index=['one', 'two', 'three'])
[177]:
num_index = nums.index
[178]:
num_index
[178]:
Index(['one', 'two', 'three'], dtype='object')
[179]:
num_index[1:]
[179]:
Index(['two', 'three'], dtype='object')

索引对象是不可变的,不能对其赋值

[180]:
num_index2 = pd.Index(['one', 'two', 'three'])
[181]:
num_index is num_index2
[181]:
False
[182]:
num_index == num_index2
[182]:
array([ True,  True,  True])
[183]:
nums2= Series(range(3), index = num_index2)
[184]:
nums2
[184]:
one      0
two      1
three    2
dtype: int64
[185]:
nums2.index is num_index2
[185]:
True
[186]:
nums2.index == num_index2
[186]:
array([ True,  True,  True])
[187]:
frame3 = get_random_df()
[188]:
frame3.columns
[188]:
Index(['A', 'B', 'C', 'D'], dtype='object')
[189]:
frame3.index
[189]:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

reindex#

Series reindex#

[190]:
obj = Series(range(4), index=['a', 'c', 'd', 'b'])
obj
[190]:
a    0
c    1
d    2
b    3
dtype: int64
[191]:
obj2 = obj.reindex(['a', 'b', 'c', 'd'])
[192]:
obj2
[192]:
a    0
b    3
c    1
d    2
dtype: int64

自动填充#

[193]:
obj3 = Series(['red', 'blue', 'yellow'], index=[1,3,5])
[194]:
obj3
[194]:
1       red
3      blue
5    yellow
dtype: object
[195]:
obj3.reindex(range(6), method='ffill')
[195]:
0       NaN
1       red
2       red
3      blue
4      blue
5    yellow
dtype: object

Frame reindex#

[196]:
range(5)
[196]:
range(0, 5)
[197]:
df = pd.DataFrame(
        np.random.randn(6, 4),
        index=range(6),
        columns=list('ABCD'))
df
[197]:
A B C D
0 -1.870282 1.847236 1.060884 -0.962755
1 0.639969 1.395878 -1.609093 -0.000310
2 1.262852 0.930063 -0.288271 -1.647747
3 1.844337 -0.448772 1.115280 -0.033701
4 0.095214 -0.154165 -0.941797 -0.515162
5 -0.642843 1.988245 0.806943 -0.834752
[198]:
df.reindex(range(5,-1,-1), columns=['A', 'C', 'DDD'])
[198]:
A C DDD
5 -0.642843 0.806943 NaN
4 0.095214 -0.941797 NaN
3 1.844337 1.115280 NaN
2 1.262852 -0.288271 NaN
1 0.639969 -1.609093 NaN
0 -1.870282 1.060884 NaN

drop#

[199]:
df.drop('D', axis='columns') #  不修改对象
[199]:
A B C
0 -1.870282 1.847236 1.060884
1 0.639969 1.395878 -1.609093
2 1.262852 0.930063 -0.288271
3 1.844337 -0.448772 1.115280
4 0.095214 -0.154165 -0.941797
5 -0.642843 1.988245 0.806943
[200]:
df
[200]:
A B C D
0 -1.870282 1.847236 1.060884 -0.962755
1 0.639969 1.395878 -1.609093 -0.000310
2 1.262852 0.930063 -0.288271 -1.647747
3 1.844337 -0.448772 1.115280 -0.033701
4 0.095214 -0.154165 -0.941797 -0.515162
5 -0.642843 1.988245 0.806943 -0.834752
[201]:
df.drop('D', axis='columns', inplace=True) #  修改对象
[202]:
df
[202]:
A B C
0 -1.870282 1.847236 1.060884
1 0.639969 1.395878 -1.609093
2 1.262852 0.930063 -0.288271
3 1.844337 -0.448772 1.115280
4 0.095214 -0.154165 -0.941797
5 -0.642843 1.988245 0.806943

DataFrames 创建#

[203]:
dates = pd.date_range('20180101', periods=6);dates
[203]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06'],
              dtype='datetime64[ns]', freq='D')
[204]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'));df
[204]:
A B C D
2018-01-01 1.863653 0.053990 -1.256914 1.474340
2018-01-02 -0.164327 0.211501 0.323222 -0.002627
2018-01-03 0.254398 -0.622240 0.329015 0.245014
2018-01-04 -0.496644 0.955061 1.403011 -0.370782
2018-01-05 -0.065887 0.570067 1.524579 -0.854202
2018-01-06 1.132634 0.529680 0.117496 0.880622
[205]:
# 可以使用字典来创建 DataFrame 。
# 如果字典的 Value 是单一值,那么会自动扩展。
# 如果字典的 Value 是列表或者 Series ,那么长度要保持一致。
# 如果字典中只有一个值有 Index ,那么会使用这个 Index 作为整个 DataFrame 的 Index 。
# 如果字典有多个 Index ,那么必须保持一致,否则会报错。
df2 = pd.DataFrame(
    { 'A' : 1.,
      'B' : pd.Timestamp('20130102'),
      'C' : pd.Series(1,index=list(range(2,6)),dtype='float32'),
      'D' : np.array([3] * 4,dtype='int32'),
      'E' : pd.Categorical(["test","train","test","train"]),
      'F' : 'foo' }
)
df2
[205]:
A B C D E F
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
4 1.0 2013-01-02 1.0 3 test foo
5 1.0 2013-01-02 1.0 3 train foo
[206]:
data = {
    'fruit': ['Apple', 'Apple', 'Apple', 'Grape', 'Grape', 'Grape'],
    'year': [2017, 2018, 2019, 2017, 2018, 2019],
    'price': [20.5, 21.3, 25, 10.1, 10.9, 9.98]
}
frame = DataFrame(data)

DataFrame 全局操作#

[207]:
df = get_random_df()
[208]:
df.dtypes
[208]:
A    float64
B    float64
C    float64
D    float64
dtype: object
[209]:
df.index
[209]:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')
[210]:
df.columns
[210]:
Index(['A', 'B', 'C', 'D'], dtype='object')
[211]:
df.values
[211]:
array([[-0.8430307 ,  0.16365112,  0.34819952,  0.1816034 ],
       [-0.2527771 ,  0.35288587, -1.31084768,  0.96593095],
       [-0.98246094, -0.24233779, -1.70510314, -0.07605717],
       [-0.34124125,  1.70597412, -0.46188374,  0.49084553],
       [-0.17822249,  0.24639153,  0.16107507,  0.22037991],
       [ 2.14574608,  1.18792989,  0.75582046,  0.1645418 ]])
[212]:
df.describe()
[212]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.075331 0.569082 -0.368790 0.324541
std 1.136505 0.727893 0.973802 0.362325
min -0.982461 -0.242338 -1.705103 -0.076057
25% -0.717583 0.184336 -1.098607 0.168807
50% -0.297009 0.299639 -0.150404 0.200992
75% -0.196861 0.979169 0.301418 0.423229
max 2.145746 1.705974 0.755820 0.965931
[213]:
df.T
[213]:
2020-01-01 2020-01-02 2020-01-03 2020-01-04 2020-01-05 2020-01-06
A -0.843031 -0.252777 -0.982461 -0.341241 -0.178222 2.145746
B 0.163651 0.352886 -0.242338 1.705974 0.246392 1.187930
C 0.348200 -1.310848 -1.705103 -0.461884 0.161075 0.755820
D 0.181603 0.965931 -0.076057 0.490846 0.220380 0.164542

读入数据#

从 CSV 文件读入数据#

[214]:
# 读入 CSV 格式数据
df_movies = pd.read_csv('datas/movies.csv', sep='\t', encoding='utf-8')
df_movies.head()
[214]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 $26,168,351 American Sniper $9,905,616
1 2 Jan. 31 Sat 31 $41,633,588 American Sniper $16,510,536
2 3 Feb. 1 Sun 32 $12,515,579 American Sniper $4,244,376
3 4 Feb. 2 Mon 33 $6,475,068 American Sniper $2,645,109
4 5 Feb. 3 Tue 34 $7,825,091 American Sniper $2,923,141
[215]:
df_movies = pd.read_csv('datas/movies.csv', sep='\t', encoding='utf-8',thousands=',',escapechar='$')
df_movies.head()
[215]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141

读取目录下所有CSV,合并为1个DataFrame,去重#

[216]:
source_path = Path('./datas')
csv_files = source_path.glob('*.csv')
dfs = [pd.read_csv(item, encoding='utf8', dtype="string") for item in csv_files]
df = pd.concat(dfs).drop_duplicates()
#df.head(3)

从数据库读入数据#

参见:http://stackoverflow.com/questions/10065051/python-pandas-and-databases-like-mysql

从 Excel 文件读入数据#

原型#

pandas.read_excel(io, sheet_name=0, header=0, skiprows=None, skip_footer=0,
                  index_col=None, names=None, usecols=None, parse_dates=False,
                  date_parser=None, na_values=None, thousands=None,
                  convert_float=True, converters=None, dtype=None,
                  true_values=None, false_values=None, engine=None,
                  squeeze=False, **kwds)

Read an Excel table into a pandas DataFrame

示例:

# 读入 Excel 文件,所有字段都是 str
df = pd.read_excel(source_file, dtype="string")

选择数据#

  1. 行,列 –> df[]

  2. 区域 –> df.loc[], df.iloc[], df.ix[]

  3. 单元格 –> df.at[], df.iat[]

选择单个数据#

[217]:
df=get_random_df();df
[217]:
A B C D
2020-01-01 -0.127414 0.402208 -1.621377 -0.706600
2020-01-02 0.427348 -0.576915 -0.719638 -1.058799
2020-01-03 -0.466531 1.123756 -0.884415 2.203494
2020-01-04 -0.249747 -0.367119 -1.423965 0.971523
2020-01-05 -0.879580 1.904772 0.470134 1.533346
2020-01-06 3.046705 -1.288350 -1.219871 -0.800774
[218]:
df.at['2020-01-01', 'B']
[218]:
0.40220819245805306
[219]:
df.iat[1,1]
[219]:
-0.5769150159100832

选择某个区域#

[220]:
#只显示指定的行和列
df.iloc[[1,3,5],[0,1,2,3]]
[220]:
A B C D
2020-01-02 0.427348 -0.576915 -0.719638 -1.058799
2020-01-04 -0.249747 -0.367119 -1.423965 0.971523
2020-01-06 3.046705 -1.288350 -1.219871 -0.800774
[221]:
df.loc[['2020-01-01','2020-01-04','2020-01-06'],['A', 'C']]
[221]:
A C
2020-01-01 -0.127414 -1.621377
2020-01-04 -0.249747 -1.423965
2020-01-06 3.046705 -1.219871

选择行#

[222]:
df = get_titanic_df()
df = df[['survived', 'age', 'deck', 'class']]
df[1:4] # 第2-4行
[222]:
survived age deck class
1 1 38.0 C First
2 1 26.0 NaN Third
3 1 35.0 C First
[223]:
# 查看头部数据
df.head()
[223]:
survived age deck class
0 0 22.0 NaN Third
1 1 38.0 C First
2 1 26.0 NaN Third
3 1 35.0 C First
4 0 35.0 NaN Third
[224]:
# 查看尾部数据
df.tail(2)
#head 和 tail 接受一个整数参数,缺省值为 5 。
[224]:
survived age deck class
889 1 26.0 C First
890 0 32.0 NaN Third
[225]:
# 选择 deck 为空值的行
df[df['deck'].isnull()].head()
[225]:
survived age deck class
0 0 22.0 NaN Third
2 1 26.0 NaN Third
4 0 35.0 NaN Third
5 0 NaN NaN Third
7 0 2.0 NaN Third
[226]:
df[df.deck.isna()].head()
[226]:
survived age deck class
0 0 22.0 NaN Third
2 1 26.0 NaN Third
4 0 35.0 NaN Third
5 0 NaN NaN Third
7 0 2.0 NaN Third
[227]:
# 选择 deck 为非空值的行
df[df.deck.notnull()].head()
[227]:
survived age deck class
1 1 38.0 C First
3 1 35.0 C First
6 0 54.0 E First
10 1 4.0 G Third
11 1 58.0 C First
[228]:
# 根据条件过滤
df[(df['class'] == 'First') | (df.deck == 'E') ].head()
[228]:
survived age deck class
1 1 38.0 C First
3 1 35.0 C First
6 0 54.0 E First
11 1 58.0 C First
23 1 28.0 A First
[229]:
df[ (df['class'] == 'First') & (df.deck == 'D') & (df.age < 30)].head()
[229]:
survived age deck class
97 1 23.0 D First
102 0 21.0 D First
136 1 19.0 D First
393 1 23.0 D First
627 1 21.0 D First

选择列#

[230]:
df = get_titanic_df()
df[['survived', 'age', 'deck', 'class']].head()
[230]:
survived age deck class
0 0 22.0 NaN Third
1 1 38.0 C First
2 1 26.0 NaN Third
3 1 35.0 C First
4 0 35.0 NaN Third
[231]:
df.iloc[:,[1, 2, 6]].head()
[231]:
pclass sex fare
0 3 male 7.2500
1 1 female 71.2833
2 3 female 7.9250
3 1 female 53.1000
4 3 male 8.0500
[232]:
# 只包含指定字符的列
df.filter(regex='s', axis=1).head()
[232]:
survived pclass sex sibsp class
0 0 3 male 1 Third
1 1 1 female 1 First
2 1 3 female 0 Third
3 1 1 female 1 First
4 0 3 male 0 Third
[233]:
# 根据列类型过滤
# 只选择字符串型的列
df.loc[:, (df.dtypes == np.dtype('O')).values].head()
[233]:
sex embarked class who deck embark_town alive
0 male S Third man NaN Southampton no
1 female C First woman C Cherbourg yes
2 female S Third woman NaN Southampton yes
3 female S First woman C Southampton yes
4 male S Third man NaN Southampton no

操作单元#

单个单元赋值#

[234]:
df = get_random_df()
df
[234]:
A B C D
2020-01-01 1.148532 0.317324 1.159205 0.767581
2020-01-02 0.006078 -0.933768 1.729512 -0.186436
2020-01-03 0.781555 -0.465289 0.490352 1.365542
2020-01-04 0.010499 0.523784 0.204040 0.082275
2020-01-05 0.221785 -2.691217 0.151030 -1.445744
2020-01-06 -0.405333 -0.838198 -0.493799 -2.455432
[235]:
# 单元格赋值
# 单个单元格赋值
df.iloc[0, 3] = 0.131211
df.loc['2020-01-02', 'B']= -0.131211
df.head(3)
[235]:
A B C D
2020-01-01 1.148532 0.317324 1.159205 0.131211
2020-01-02 0.006078 -0.131211 1.729512 -0.186436
2020-01-03 0.781555 -0.465289 0.490352 1.365542

多个单元格赋值#

[236]:
df.loc[df.index[0:2], 'C'] = [100, 200]
df.head(3)
[236]:
A B C D
2020-01-01 1.148532 0.317324 100.000000 0.131211
2020-01-02 0.006078 -0.131211 200.000000 -0.186436
2020-01-03 0.781555 -0.465289 0.490352 1.365542

操作列#

调整列的顺序#

[237]:
# 构建 DataFrame
df = pd.DataFrame({'a':[1,2], 'b': [3,4], 'c':[5,6], 'd':[7,8], 'e':[9,10]});df
[237]:
a b c d e
0 1 3 5 7 9
1 2 4 6 8 10

方法一#

[238]:
col_order = ['c', 'a', 'b', 'd']
df = df[col_order]
df
[238]:
c a b d
0 5 1 3 7
1 6 2 4 8

方法二#

[239]:
new_df = df.drop('c', axis=1)
new_df.insert(2, 'c', df['c'])
new_df
[239]:
a b c d
0 1 3 5 7
1 2 4 6 8

打印列类型#

[240]:
df = df_movies.copy()
df.columns.to_series().groupby(df.dtypes).groups
[240]:
{int64: ['Row', 'Day#', 'Top 10 Gross', 'Gross'], object: ['Date', 'Day', '#1 Movie']}
[241]:
# 打印列类型(清晰打印中文)
types = df.columns.to_series().groupby(df.dtypes).groups
for key, value in types.items():
    print(key,':\t', ','.join(value))
int64 :  Row,Day#,Top 10 Gross,Gross
object :         Date,Day,#1 Movie

插入列#

[242]:
df = get_random_df()
# 方式一:在末尾添加
df['memo'] = pd.Series('', index=df.index)
df.head(3)
[242]:
A B C D memo
2020-01-01 -0.460849 -2.111140 0.526750 -1.723461
2020-01-02 -2.075447 0.771671 0.074949 0.419958
2020-01-03 -1.044952 -0.515388 -1.244958 -0.207565
[243]:
# 方式二:在中间插入
df = get_random_df()
df.insert(loc=1, column=u'year', value=u'2015')
df.head(3)
[243]:
A year B C D
2020-01-01 -0.909163 2015 -0.917322 -0.506002 1.088749
2020-01-02 0.237910 2015 -0.330139 -0.185761 -0.282543
2020-01-03 0.644266 2015 -0.552541 0.013399 1.092278
[244]:
# 根据现有值生成一个新的列
df = get_random_df()
df.insert(loc = 2 , column='A-B', value=df['A'] - df['B'])
df.head(3)
[244]:
A B A-B C D
2020-01-01 0.448350 -0.290054 0.738404 -0.406046 0.308425
2020-01-02 -0.580417 -1.769612 1.189195 1.239757 -0.759034
2020-01-03 -0.043785 0.072310 -0.116096 0.264677 -0.395628
[245]:
# 根据现有值生成多个新的列
df = df_movies.copy()
def process_date_col(text):
    #根据日期生成月份和日两个新的列
    if pd.isnull(text):
        month = day = np.nan
    else:
        month, day = text.split('.')
    return pd.Series([month, day])

df[[u'month', u'day']] = df.Date.apply(process_date_col)
df.head()
[245]:
Row Date Day Day# Top 10 Gross #1 Movie Gross month day
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616 Jan 30
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536 Jan 31
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376 Feb 1
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109 Feb 2
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141 Feb 3
[246]:
# 根据现有值生成多个新的列
df = df_movies.copy()
df.head()

def process_date_col(se):
    #根据日期生成月份和日两个新的列
    if pd.isnull(se['Date']):
        se['month'] = se['day'] = np.nan
    else:
        se['month'], se['day'] = se['Date'].split('.')
    return se
df['month'] = df['day'] = np.nan
df_new = df.apply(process_date_col, axis=1)
df_new.head()
[246]:
Row Date Day Day# Top 10 Gross #1 Movie Gross month day
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616 Jan 30
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536 Jan 31
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376 Feb 1
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109 Feb 2
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141 Feb 3

改变列值#

[247]:
df = get_random_df()
df
[247]:
A B C D
2020-01-01 0.455907 0.440170 -0.956345 0.766485
2020-01-02 -0.400386 -0.218492 -0.737315 -0.841963
2020-01-03 -1.036559 -0.701914 1.589558 0.057975
2020-01-04 1.590878 -0.045251 -0.515069 -1.433473
2020-01-05 -0.071407 0.250663 1.714977 0.262407
2020-01-06 0.029585 -0.255554 -1.276931 -1.298129

根据一列的值改变本列的值#

[248]:
# 方法一
# 把 C 列小于 0 的数据设置为 0
df.loc[df.C < 0, 'C'] = 0
df
[248]:
A B C D
2020-01-01 0.455907 0.440170 0.000000 0.766485
2020-01-02 -0.400386 -0.218492 0.000000 -0.841963
2020-01-03 -1.036559 -0.701914 1.589558 0.057975
2020-01-04 1.590878 -0.045251 0.000000 -1.433473
2020-01-05 -0.071407 0.250663 1.714977 0.262407
2020-01-06 0.029585 -0.255554 0.000000 -1.298129
[249]:
# 根据一列的值改变本列的值
# 方法二
# 把 C 列小于 0 的数据设置为 0
df['D'] = df['D'].apply(lambda x : 0 if x<0 else x)
df
[249]:
A B C D
2020-01-01 0.455907 0.440170 0.000000 0.766485
2020-01-02 -0.400386 -0.218492 0.000000 0.000000
2020-01-03 -1.036559 -0.701914 1.589558 0.057975
2020-01-04 1.590878 -0.045251 0.000000 0.000000
2020-01-05 -0.071407 0.250663 1.714977 0.262407
2020-01-06 0.029585 -0.255554 0.000000 0.000000

同时改变多个列的值#

[250]:
# 把 A B 两列数据均乘以 100
cols = ['A', 'B']
df[cols] = df[cols].applymap(lambda x: x*100)
df
[250]:
A B C D
2020-01-01 45.590705 44.016988 0.000000 0.766485
2020-01-02 -40.038550 -21.849155 0.000000 0.000000
2020-01-03 -103.655878 -70.191427 1.589558 0.057975
2020-01-04 159.087823 -4.525098 0.000000 0.000000
2020-01-05 -7.140684 25.066323 1.714977 0.262407
2020-01-06 2.958459 -25.555402 0.000000 0.000000

按条件过滤行,批量修改字段#

[251]:
# A、B 列均小于零的,C 列改为空值
mask = (df['A'] < 0) & (df['B'] < 0)
df.loc[mask, 'C'] = np.nan
df
[251]:
A B C D
2020-01-01 45.590705 44.016988 0.000000 0.766485
2020-01-02 -40.038550 -21.849155 NaN 0.000000
2020-01-03 -103.655878 -70.191427 NaN 0.057975
2020-01-04 159.087823 -4.525098 0.000000 0.000000
2020-01-05 -7.140684 25.066323 1.714977 0.262407
2020-01-06 2.958459 -25.555402 0.000000 0.000000

使用字典生成新的列#

[252]:
dict_str = """320281 江阴
320282 宜兴
320213 梁溪"""
# 生成一个字典,key是行政区划代码,value是行政区划名称
dict_rows = [row.split() for row in dict_str.split('\n')]
xzqh_dm_mc = dict([(item[0], item[1]) for item in dict_rows])
xzqh_dm_mc
df['行政区划代码'] = ['320281', '320282', '320283',
                '320281', '320282', '320213']

df['行政区划'] = df['行政区划代码'].map(xzqh_dm_mc)
df
[252]:
A B C D 行政区划代码 行政区划
2020-01-01 45.590705 44.016988 0.000000 0.766485 320281 江阴
2020-01-02 -40.038550 -21.849155 NaN 0.000000 320282 宜兴
2020-01-03 -103.655878 -70.191427 NaN 0.057975 320283 NaN
2020-01-04 159.087823 -4.525098 0.000000 0.000000 320281 江阴
2020-01-05 -7.140684 25.066323 1.714977 0.262407 320282 宜兴
2020-01-06 2.958459 -25.555402 0.000000 0.000000 320213 梁溪

删除列#

[253]:
df = get_random_df()
df.head()
[253]:
A B C D
2020-01-01 -1.587218 0.069797 0.714401 0.381977
2020-01-02 0.671095 -1.255362 -1.335127 0.220457
2020-01-03 -0.551792 0.174548 1.120456 -0.709983
2020-01-04 -0.171354 0.851806 -0.319238 -0.310731
2020-01-05 0.254203 0.074290 -1.541995 0.765277
[254]:
df.drop(df[['D', 'C']], axis=1, inplace=True)
df.head()
[254]:
A B
2020-01-01 -1.587218 0.069797
2020-01-02 0.671095 -1.255362
2020-01-03 -0.551792 0.174548
2020-01-04 -0.171354 0.851806
2020-01-05 0.254203 0.074290
[255]:
del df['A']
df.head()
[255]:
B
2020-01-01 0.069797
2020-01-02 -1.255362
2020-01-03 0.174548
2020-01-04 0.851806
2020-01-05 0.074290
[256]:
df = get_random_df()
df.drop(columns=['A', 'C'], axis=1).head()
[256]:
B D
2020-01-01 1.267116 1.102446
2020-01-02 -2.601673 -0.973597
2020-01-03 0.330555 0.101620
2020-01-04 -1.300182 -0.316561
2020-01-05 -0.856983 -0.837295
[257]:
# 只保留前两列,删除第三列开始的所有列
df = get_random_df()
df.drop(df.columns[2:], axis=1)
[257]:
A B
2020-01-01 1.054027 -1.710003
2020-01-02 -0.895784 0.699684
2020-01-03 -0.906341 1.058156
2020-01-04 -0.534992 -1.327501
2020-01-05 0.460481 -0.613649
2020-01-06 -0.418965 -1.288246

操作行#

添加一个空行#

[258]:
df = get_iris_df()
#df = get_random_df()
new_line = pd.Series([np.nan]*len(df.columns), # Fill cells with NaNs
                     index=df.columns)
# 方法一
df.loc[len(df)]=new_line

# 方法二
new_df = pd.concat([df, new_line.to_frame().T], axis=0, ignore_index=True)

# 方法三
# append 将被废弃
# new_df = df.append(new_line, ignore_index=True)

new_df.tail()
[258]:
petalLength petalWidth sepalLength sepalWidth species
147 5.2 2.0 6.5 3.0 virginica
148 5.4 2.3 6.2 3.4 virginica
149 5.1 1.8 5.9 3.0 virginica
150 NaN NaN NaN NaN NaN
151 NaN NaN NaN NaN NaN

删除行#

[259]:
# 删除 petalLength 为空的行
new_df[~(new_df['petalLength'].isnull())].tail()
[259]:
petalLength petalWidth sepalLength sepalWidth species
145 5.2 2.3 6.7 3.0 virginica
146 5.0 1.9 6.3 2.5 virginica
147 5.2 2.0 6.5 3.0 virginica
148 5.4 2.3 6.2 3.4 virginica
149 5.1 1.8 5.9 3.0 virginica
[260]:
# 删除索引为 1 和 5 的行
df = df.drop([1,5])
df.head()
[260]:
petalLength petalWidth sepalLength sepalWidth species
0 1.4 0.2 5.1 3.5 setosa
2 1.3 0.2 4.7 3.2 setosa
3 1.5 0.2 4.6 3.1 setosa
4 1.4 0.2 5.0 3.6 setosa
6 1.4 0.3 4.6 3.4 setosa

apply, applymap, map 区别#

  • apply:应用在DataFrame的行或列中;

  • applymap:应用在DataFrame的每个元素中;

  • map:应用在单独一列(Series)的每个元素中。

[261]:
df = pd.DataFrame(np.random.randn(2, 2), columns=list('AB'));df
[261]:
A B
0 -1.793254 -2.011889
1 -0.904923 -0.594622
[262]:
# 行的和
df.apply(np.sum)
[262]:
A   -2.698177
B   -2.606511
dtype: float64
[263]:
# 列的和
df.apply(np.sum, axis=1)
[263]:
0   -3.805143
1   -1.499545
dtype: float64
[264]:
# 保留两位小数
df.applymap(lambda x: '%.2f'%x)
[264]:
A B
0 -1.79 -2.01
1 -0.90 -0.59
[265]:
# 保留两位小数,只处理某一列,返回一个 DataFrame
df[['A']].applymap(lambda x: '%.2f'%x)
[265]:
A
0 -1.79
1 -0.90
[266]:
# 保留两位小数,只处理某一列,返回一个 Series
df['A'].map(lambda x: '%.2f'%x)
[266]:
0    -1.79
1    -0.90
Name: A, dtype: object

处理字符串#

把一列字符串拆分为两列#

[267]:
df = pd.DataFrame({"下发数量": ["下发数(6),未下发数(3)", "下发数(16),未下发数(7)"],
                   "下发时间": ["20220118007", "20220120009"]
                  })
df[['下发数', '去重数']] = df['下发数量'].str.extract(r'下发数\((\d+)\),未下发数((\d+)\)').astype('int')
df
[267]:
下发数量 下发时间 下发数 去重数
0 下发数(6),未下发数(3) 20220118007 6 3
1 下发数(16),未下发数(7) 20220120009 16 7
[268]:
firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0]
firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[1]
firstlast[["First", "Last"]] = firstlast["String"].str.split(" ", expand=True)
firstlast
[268]:
String First_Name Last_Name First Last
0 John Smith John Smith John Smith
1 Jane Cook Jane Cook Jane Cook

合并字符串#

[269]:
df = pd.DataFrame({
    "First": ["John", "Jane"],
    "Last": ["Smith", np.nan],
    "Age":[12, np.nan]
})
df
[269]:
First Last Age
0 John Smith 12.0
1 Jane NaN NaN
[270]:
df['First'].str.cat(df['Last'], sep='.', na_rep='NO_DATA')
[270]:
0      John.Smith
1    Jane.NO_DATA
Name: First, dtype: object
[271]:
df['First'].str.cat(df['Age'].astype(str), sep=' - ', na_rep='NO_DATA')
[271]:
0    John - 12.0
1     Jane - nan
Name: First, dtype: object
[272]:
df['First'].str.cat([df['Last'], df['Age'].astype(str)], sep=' - ', na_rep='NO_DATA')
[272]:
0     John - Smith - 12.0
1    Jane - NO_DATA - nan
Name: First, dtype: object

对一列字符串进行处理,生成新的列#

[273]:
firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]})
firstlast["upper"] = firstlast["string"].str.upper()
firstlast["lower"] = firstlast["string"].str.lower()
firstlast["title"] = firstlast["string"].str.title()
firstlast
[273]:
string upper lower title
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook

截取字符串#

[274]:
firstlast['title'].str[5:12]
[274]:
0    Smith
1     Cook
Name: title, dtype: object

替换字符串#

[275]:
firstlast['title'].str.replace(
    r'o{2,}',
    'XXXX',
    regex=True
)
[275]:
0     John Smith
1    Jane CXXXXk
Name: title, dtype: object

空值处理(NaN)#

[276]:
# 计数有空值的行
nans = df.shape[0] - df.dropna().shape[0]
print(u'一共有 %d 行出现空值' % nans)

# 填充空值为`无`
df.fillna(value=u'无', inplace=True)
df.tail()
一共有 1 行出现空值
[276]:
First Last Age
0 John Smith 12.0
1 Jane

排序#

DataFrame 按列排序#

[277]:
df_ti = get_titanic_df()
df = df_ti[['pclass', 'sex', 'age']]
# 按年龄排序
# df.sort_values('age', ascending=True, inplace=True)
df.sort_values(by='age', ascending=True).head()
[277]:
pclass sex age
803 3 male 0.42
755 2 male 0.67
644 3 female 0.75
469 3 female 0.75
78 2 male 0.83
[278]:
df.sort_values(by=['age', 'pclass'], ascending=True)[5:15]
[278]:
pclass sex age
831 2 male 0.83
305 1 male 0.92
183 2 male 1.00
827 2 male 1.00
164 3 male 1.00
172 3 female 1.00
381 3 female 1.00
386 3 male 1.00
788 3 male 1.00
297 1 female 2.00
[279]:
# 排序后重新编制索引
df.index = range(1,len(df.index)+1)
df.head()
[279]:
pclass sex age
1 3 male 22.0
2 1 female 38.0
3 3 female 26.0
4 1 female 35.0
5 3 male 35.0

Series 排序#

按指定顺序排序#

[280]:
dic = {'101': '香蕉', '102': '苹果', '181': '葡萄'}
s = pd.Series(dic)
my_order = ['101', '181', '102']
s.reindex(my_order)
[280]:
101    香蕉
181    葡萄
102    苹果
dtype: object

按 index 排序#

[281]:
s.sort_index()
[281]:
101    香蕉
102    苹果
181    葡萄
dtype: object

按值排序#

[282]:
s.sort_values()
[282]:
102    苹果
181    葡萄
101    香蕉
dtype: object
[283]:
s.sort_values(ascending=False)
[283]:
101    香蕉
181    葡萄
102    苹果
dtype: object

## 去重

[284]:
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie',pd.NA,pd.NA],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack',pd.NA,pd.NA],
    'rating': [4, 4, 3.5, 15, 5,pd.NA,pd.NA]
})
df
[284]:
brand style rating
0 Yum Yum cup 4
1 Yum Yum cup 4
2 Indomie cup 3.5
3 Indomie pack 15
4 Indomie pack 5
5 <NA> <NA> <NA>
6 <NA> <NA> <NA>
[285]:
df.drop_duplicates()  # 所有列相同的去重
[285]:
brand style rating
0 Yum Yum cup 4
2 Indomie cup 3.5
3 Indomie pack 15
4 Indomie pack 5
5 <NA> <NA> <NA>
[286]:
df.drop_duplicates(subset=['brand'])  # 按某一列去重
[286]:
brand style rating
0 Yum Yum cup 4
2 Indomie cup 3.5
5 <NA> <NA> <NA>
[287]:
# 按两列去重,保留最后一个
df.drop_duplicates(subset=['brand', 'style'], keep='last')
[287]:
brand style rating
1 Yum Yum cup 4
2 Indomie cup 3.5
4 Indomie pack 5
6 <NA> <NA> <NA>

排名#

[288]:
df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog',
                                   'spider', 'snake'],
                        'Number_legs': [4, 2, 4, 8, np.nan]})
df
[288]:
Animal Number_legs
0 cat 4.0
1 penguin 2.0
2 dog 4.0
3 spider 8.0
4 snake NaN
[289]:
df['default_rank'] = df['Number_legs'].rank()
df['max_rank'] = df['Number_legs'].rank(method='max')
df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
df['pct_rank'] = df['Number_legs'].rank(pct=True)
df['max_asc_rank'] = df['Number_legs'].rank(method='min', ascending=False)
df
[289]:
Animal Number_legs default_rank max_rank NA_bottom pct_rank max_asc_rank
0 cat 4.0 2.5 3.0 2.5 0.625 2.0
1 penguin 2.0 1.0 1.0 1.0 0.250 4.0
2 dog 4.0 2.5 3.0 2.5 0.625 2.0
3 spider 8.0 4.0 4.0 4.0 1.000 1.0
4 snake NaN NaN NaN 5.0 NaN NaN

按年龄分组#

[290]:
datas = [15, 16, 17, 18, 91, 56, 44]
bins = [0, 16, 45, 80, 100]
labels = ['儿童', '青年', '中年', '老年']
age_cut = pd.cut(datas, bins, True,labels)
age_cut
[290]:
['儿童', '儿童', '青年', '青年', '老年', '中年', '青年']
Categories (4, object): ['儿童' < '青年' < '中年' < '老年']
[291]:
age_cut.value_counts()
[291]:
儿童    2
青年    3
中年    1
老年    1
dtype: int64

输出示例#

[292]:
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})
df
[292]:
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
[293]:
df.to_dict()
[293]:
{'brand': {0: 'Yum Yum',
  1: 'Yum Yum',
  2: 'Indomie',
  3: 'Indomie',
  4: 'Indomie'},
 'style': {0: 'cup', 1: 'cup', 2: 'cup', 3: 'pack', 4: 'pack'},
 'rating': {0: 4.0, 1: 4.0, 2: 3.5, 3: 15.0, 4: 5.0}}
[294]:
df.to_records()
[294]:
rec.array([(0, 'Yum Yum', 'cup',  4. ), (1, 'Yum Yum', 'cup',  4. ),
           (2, 'Indomie', 'cup',  3.5), (3, 'Indomie', 'pack', 15. ),
           (4, 'Indomie', 'pack',  5. )],
          dtype=[('index', '<i8'), ('brand', 'O'), ('style', 'O'), ('rating', '<f8')])
[295]:
df.to_csv()
[295]:
',brand,style,rating\n0,Yum Yum,cup,4.0\n1,Yum Yum,cup,4.0\n2,Indomie,cup,3.5\n3,Indomie,pack,15.0\n4,Indomie,pack,5.0\n'
[296]:
print(df.to_csv())
,brand,style,rating
0,Yum Yum,cup,4.0
1,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0

[297]:
for i in df.iterrows():
    print(i[0])
    print(type(i[1]))
    print(i[1])
0
<class 'pandas.core.series.Series'>
brand     Yum Yum
style         cup
rating        4.0
Name: 0, dtype: object
1
<class 'pandas.core.series.Series'>
brand     Yum Yum
style         cup
rating        4.0
Name: 1, dtype: object
2
<class 'pandas.core.series.Series'>
brand     Indomie
style         cup
rating        3.5
Name: 2, dtype: object
3
<class 'pandas.core.series.Series'>
brand     Indomie
style        pack
rating       15.0
Name: 3, dtype: object
4
<class 'pandas.core.series.Series'>
brand     Indomie
style        pack
rating        5.0
Name: 4, dtype: object
[298]:
df.items()
[298]:
<generator object DataFrame.items at 0x7fe75ca78c80>
[299]:
for i in df.items():
    print(i)
    print(type(i))
('brand', 0    Yum Yum
1    Yum Yum
2    Indomie
3    Indomie
4    Indomie
Name: brand, dtype: object)
<class 'tuple'>
('style', 0     cup
1     cup
2     cup
3    pack
4    pack
Name: style, dtype: object)
<class 'tuple'>
('rating', 0     4.0
1     4.0
2     3.5
3    15.0
4     5.0
Name: rating, dtype: float64)
<class 'tuple'>

图形化#

[300]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
[300]:
<Axes: >
../_images/pandas_pandas_note_206_1.png
[301]:
df = get_random_df()
df.plot(y=['A', 'B'])
[301]:
<Axes: >
../_images/pandas_pandas_note_207_1.png
[302]:
df_iris = get_iris_df()
df_iris.sample(5)
[302]:
petalLength petalWidth sepalLength sepalWidth species
127 4.9 1.8 6.1 3.0 virginica
134 5.6 1.4 6.1 2.6 virginica
28 1.4 0.2 5.2 3.4 setosa
130 6.1 1.9 7.4 2.8 virginica
51 4.5 1.5 6.4 3.2 versicolor
[303]:
# 线性回归
sns.regplot(x='petalLength', y='petalWidth',data=df_iris)
[303]:
<Axes: xlabel='petalLength', ylabel='petalWidth'>
../_images/pandas_pandas_note_209_1.png

复制数据#

复制 Frame 或者 Series 的数据使用 copy 方法。

Signature: df.copy(deep: 'bool_t' = True) -> 'FrameOrSeries'

参数 deep 默认为 True

[304]:
df = get_random_df()
df_deep = df.copy()
df_shallow = df.copy(deep=False)
df.iloc[1,1] = 0
[305]:
df
[305]:
A B C D
2020-01-01 1.334778 0.590552 -0.191088 -0.570773
2020-01-02 0.563330 0.000000 -0.388838 1.082667
2020-01-03 -0.395530 0.022657 2.374852 -0.936310
2020-01-04 0.095474 -0.013410 -0.949512 -1.027845
2020-01-05 0.268630 1.993076 0.120371 -0.165745
2020-01-06 -0.026002 0.474532 0.353869 -0.868767
[306]:
df_deep
[306]:
A B C D
2020-01-01 1.334778 0.590552 -0.191088 -0.570773
2020-01-02 0.563330 1.888045 -0.388838 1.082667
2020-01-03 -0.395530 0.022657 2.374852 -0.936310
2020-01-04 0.095474 -0.013410 -0.949512 -1.027845
2020-01-05 0.268630 1.993076 0.120371 -0.165745
2020-01-06 -0.026002 0.474532 0.353869 -0.868767
[307]:
df_shallow
[307]:
A B C D
2020-01-01 1.334778 0.590552 -0.191088 -0.570773
2020-01-02 0.563330 0.000000 -0.388838 1.082667
2020-01-03 -0.395530 0.022657 2.374852 -0.936310
2020-01-04 0.095474 -0.013410 -0.949512 -1.027845
2020-01-05 0.268630 1.993076 0.120371 -0.165745
2020-01-06 -0.026002 0.474532 0.353869 -0.868767

注意:当值为嵌套的结构时,深复制不是递归的。下层原数据变更时也会影响复制后的数据。

[308]:
s = pd.Series([[1, 2], [3, 4]])
deep = s.copy()
s[0][0] = 10  # 这个会影响复制后的数据
s[1] = [5, 6] # 这个不会影响复制后的数据
s
[308]:
0    [10, 2]
1     [5, 6]
dtype: object
[309]:
deep
[309]:
0    [10, 2]
1     [3, 4]
dtype: object

导出数据#

EXCEL#

[310]:
# df.to_excel(output_path)
# 多个 sheet
# with pd.ExcelWriter(output_path) as w:
#     df1.to_excel(w, sheet_name='first_sheet')
#     df2.to_excel(w, sheet_name='second_sheet')

CSV#

[311]:
# 导出周六的数据,格式为 CSV
# df[ (df['Day'] == 'Sat') ].to_csv('test_tmp.csv', mode='w', encoding='utf-8', index=False)

#在前面的文件中追加周日的数据
# df[ (df['Day'] == 'Sun') ].to_csv('test_output.csv', mode='a', header=False, encoding='utf-8', index=False)

Dict#

[312]:
# 输出为 dict 格式
# DataFrame.to_dict可以接受 ‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’
df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
[312]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
[313]:
import pprint
for p in ['dict', 'list', 'series', 'split', 'records', 'index']:
    print(f'Parameters:{p}')
    pprint.pprint(df.to_dict(p))
    print('-----------------------------------------------------')
Parameters:dict
{'AAA': {0: 4, 1: 5, 2: 6, 3: 7},
 'BBB': {0: 10, 1: 20, 2: 30, 3: 40},
 'CCC': {0: 100, 1: 50, 2: -30, 3: -50}}
-----------------------------------------------------
Parameters:list
{'AAA': [4, 5, 6, 7], 'BBB': [10, 20, 30, 40], 'CCC': [100, 50, -30, -50]}
-----------------------------------------------------
Parameters:series
{'AAA': 0    4
1    5
2    6
3    7
Name: AAA, dtype: int64,
 'BBB': 0    10
1    20
2    30
3    40
Name: BBB, dtype: int64,
 'CCC': 0    100
1     50
2    -30
3    -50
Name: CCC, dtype: int64}
-----------------------------------------------------
Parameters:split
{'columns': ['AAA', 'BBB', 'CCC'],
 'data': [[4, 10, 100], [5, 20, 50], [6, 30, -30], [7, 40, -50]],
 'index': [0, 1, 2, 3]}
-----------------------------------------------------
Parameters:records
[{'AAA': 4, 'BBB': 10, 'CCC': 100},
 {'AAA': 5, 'BBB': 20, 'CCC': 50},
 {'AAA': 6, 'BBB': 30, 'CCC': -30},
 {'AAA': 7, 'BBB': 40, 'CCC': -50}]
-----------------------------------------------------
Parameters:index
{0: {'AAA': 4, 'BBB': 10, 'CCC': 100},
 1: {'AAA': 5, 'BBB': 20, 'CCC': 50},
 2: {'AAA': 6, 'BBB': 30, 'CCC': -30},
 3: {'AAA': 7, 'BBB': 40, 'CCC': -50}}
-----------------------------------------------------

Cheat Sheet#

英文:https://www.dataquest.io/blog/pandas-cheat-sheet/

中文翻译:http://blog.csdn.net/qq_33399185/article/details/60872853

相关资源#