本文主要是介绍Pandas操作题,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
Pandas操作题
1.学生数据分析
1.数据导入并展示, 文件位置位于../data/下
- 请将文件student-info.csv中的内容读入pandas DataFrame "df_info",注意该文件的分隔符是分号,并展示前五行
- 请将文件student-score.csv中的内容读入pandas DataFrame "df_score",注意该文件的分隔符是分号,并展示前五行
In [1]:
import pandas as pd
df_info = pd.read_csv("C:/Users/PycharmProjects/pythonProject/ML/Projects/课程/资料/homework/data/student-info.csv",sep = ';')
df_info.head(5)
Out[1]:
ID | school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | ... | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | ... | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 6 |
1 | 1 | GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | ... | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 4 |
2 | 2 | GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | ... | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 10 |
3 | 3 | GP | F | 15 | U | GT3 | T | 4 | 2 | health | ... | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 2 |
4 | 4 | GP | F | 16 | U | GT3 | T | 3 | 3 | other | ... | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 4 |
5 rows × 31 columns
In [3]:
import pandas as pd
df_score = pd.read_csv("C:/Users/PycharmProjects/pythonProject/ML/Projects/课程/资料/homework/data/student-score.csv",sep = ';')
df_score.head(5)
Out[3]:
ID | G1 | G2 | G3 | |
---|---|---|---|---|
0 | 0 | 5 | 6 | 6 |
1 | 1 | 5 | 5 | 6 |
2 | 2 | 7 | 8 | 10 |
3 | 3 | 15 | 14 | 15 |
4 | 4 | 6 | 10 | 10 |
2.把两个DataFrame "df_info" 和 "df_score" 按照学生ID对齐拼接起来,结果存在一个新DataFrame "df" 中
In [55]:
# df = pd.concat([df_info,df_score], axis=1)
# df_info.merge(df_score)
df = pd.merge(df_info,df_score,on=['ID'])
3.把"df"所有的列展示出来
In [56]:
df
Out[56]:
ID | school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | ... | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | ... | 4 | 3 | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 |
1 | 1 | GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | ... | 5 | 3 | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 |
2 | 2 | GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | ... | 4 | 3 | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 |
3 | 3 | GP | F | 15 | U | GT3 | T | 4 | 2 | health | ... | 3 | 2 | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 |
4 | 4 | GP | F | 16 | U | GT3 | T | 3 | 3 | other | ... | 4 | 3 | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
390 | 390 | MS | M | 20 | U | LE3 | A | 2 | 2 | services | ... | 5 | 5 | 4 | 4 | 5 | 4 | 11 | 9 | 9 | 9 |
391 | 391 | MS | M | 17 | U | LE3 | T | 3 | 1 | services | ... | 2 | 4 | 5 | 3 | 4 | 2 | 3 | 14 | 16 | 16 |
392 | 392 | MS | M | 21 | R | GT3 | T | 1 | 1 | other | ... | 5 | 5 | 3 | 3 | 3 | 3 | 3 | 10 | 8 | 7 |
393 | 393 | MS | M | 18 | R | LE3 | T | 3 | 2 | services | ... | 4 | 4 | 1 | 3 | 4 | 5 | 0 | 11 | 12 | 10 |
394 | 394 | MS | M | 19 | U | LE3 | T | 1 | 1 | other | ... | 3 | 2 | 3 | 3 | 3 | 5 | 5 | 8 | 9 | 9 |
395 rows × 34 columns
In [16]:
df.columns
Out[16]:
Index(['ID', 'school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu','Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime','failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery','higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc','Walc', 'health', 'absences', 'ID', 'G1', 'G2', 'G3'],dtype='object')
4.统计男生的平均分和女生的平均分(G1, G2, G3),使用groupby和aggregate来操作
In [59]:
group_by_sex = df.groupby('sex').agg('mean')
group_by_sex[['G1','G2','G3']]
#group_by_sex = df.groupby('sex').mean()
#group_by_sex[['G1','G2','G3']]
Out[59]:
G1 | G2 | G3 | |
---|---|---|---|
sex | |||
F | 10.620192 | 10.389423 | 9.966346 |
M | 11.229947 | 11.074866 | 10.914439 |
In [60]:
group_by_agg = group_by_sex.aggregate('mean')
group_by_agg[['G1','G2','G3']]
Out[60]:
G1 10.925069 G2 10.732145 G3 10.440392 dtype: float64
5.统计不同年龄的学生(G1, G2, G3)的平均分,并作出柱状图展示
In [70]:
group_by_age = df.groupby('age').mean()
#group_by_age[['G1','G2','G3']]
group_by_age[['G1','G2','G3']].plot(kind='bar')
Out[70]:
<AxesSubplot:xlabel='age'>
6.统计不同学校(school一列)的学生的平均分,并作柱状图展示,要求同一科目的两个学校成绩贴在一起展示
In [88]:
group_by_school=df.groupby('school').agg('mean').transpose()
group_by_school.plot(kind='bar')
#roup_by_school.T.plot(kind='bar')
Out[88]:
<AxesSubplot:>
7.统计一下每个学校三个科目总分的前十名,然后把前十名的同学所有信息输出
In [201]:
#group_by_school=df.groupby(['school']).agg('sum').transpose()
# group_by_school['G1','G2','G3']
df["total"] = df["G1"] + df["G2"]+ df["G3"]
df.groupb
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-201-88778c814a96> in <module>3 4 df["total"] = df["G1"] + df["G2"]+ df["G3"] ----> 5 df.groupby['school']6 # df.groupby(['school','total'])7 TypeError: 'method' object is not subscriptable
2.沃尔玛销售数据整理
1. 数据导入并展示, 文件位置位于 ../data下
- 请将文件 stores.csv 中的内容读入pandas DataFrame "df_stores",并展示前五行以及打印df_stores的shape
In [211]:
import pandas as pd
df_stores = pd.read_csv(r"C:\Users\PycharmProjects\pythonProject\ML\Projects\课程\资料\homework\data\walmart\walmart\stores.csv")
df_stores.head(5)
Out[211]:
Store | Type | Size | |
---|---|---|---|
0 | 1 | A | 151315 |
1 | 2 | A | 202307 |
2 | 3 | B | 37392 |
3 | 4 | A | 205863 |
4 | 5 | B | 34875 |
In [212]:
df_stores.shape
Out[212]:
(45, 3)
- 请将文件 features.csv 中的内容读入pandas DataFrame "df_feats",将df_feats改成只留下以下几个columns: "Store", "Date", "Temperature", "Fuel_Price", "CPI", "Unemployment", 并展示前五行以及打印df_feats的shape
In [215]:
df_feats = pd.read_csv(r"C:\Users\PycharmProjects\pythonProject\ML\Projects\课程\资料\homework\data\walmart\walmart\features.csv")
df_feats.loc[:,["Store", "Date", "Temperature", "Fuel_Price", "CPI", "Unemployment"]].head()
Out[215]:
Store | Date | Temperature | Fuel_Price | CPI | Unemployment | |
---|---|---|---|---|---|---|
0 | 1 | 2010-02-05 | 42.31 | 2.572 | 211.096358 | 8.106 |
1 | 1 | 2010-02-12 | 38.51 | 2.548 | 211.242170 | 8.106 |
2 | 1 | 2010-02-19 | 39.93 | 2.514 | 211.289143 | 8.106 |
3 | 1 | 2010-02-26 | 46.63 | 2.561 | 211.319643 | 8.106 |
4 | 1 | 2010-03-05 | 46.50 | 2.625 | 211.350143 | 8.106 |
In [217]:
df_feats.shape
Out[217]:
(8190, 12)
- 请将文件 sales.csv 中的内容读入pandas DataFrame "df_sales",并展示前五行以及打印df_sales的shape
In [219]:
df_sales = pd.read_csv(r"C:\Users\PycharmProjects\pythonProject\ML\Projects\课程\资料\homework\data\walmart\walmart\sales.csv")
df_sales.head(5)
Out[219]:
Store | Dept | Date | Weekly_Sales | IsHoliday | |
---|---|---|---|---|---|
0 | 1 | 1 | 2010-02-05 | 24924.50 | False |
1 | 1 | 1 | 2010-02-12 | 46039.49 | True |
2 | 1 | 1 | 2010-02-19 | 41595.55 | False |
3 | 1 | 1 | 2010-02-26 | 19403.54 | False |
4 | 1 | 1 | 2010-03-05 | 21827.90 | False |
In [220]:
df_sales.shape
Out[220]:
(421570, 5)
8. 我们发现 df_sales 中每个Store有很多个Dept(department), 我们只关心整家店的销售情况,所以请把这张表格按照Store和Date来计算整店销售总额,IsHoliday这一列不必保留。结果保存在df_sales_by_store中
In [257]:
df_sales_by_store=df_sales.loc[:,["Store", "Dept", "Date", "Weekly_Sales"]].groupby(['Store','Date']).sum()
df_sales_by_store
Out[257]:
Dept | Weekly_Sales | ||
---|---|---|---|
Store | Date | ||
1 | 2010-02-05 | 3184 | 1643690.90 |
2010-02-12 | 3137 | 1641957.44 | |
2010-02-19 | 3184 | 1611968.17 | |
2010-02-26 | 3137 | 1409727.59 | |
2010-03-05 | 3215 | 1554806.68 | |
... | ... | ... | ... |
45 | 2012-09-28 | 2853 | 713173.95 |
2012-10-05 | 2952 | 733455.07 | |
2012-10-12 | 2853 | 734464.36 | |
2012-10-19 | 2853 | 718125.53 | |
2012-10-26 | 2907 | 760281.43 |
6435 rows × 2 columns
9. 使用 Store 和 Date 这两列合并 df_sales_by_store 和 df_feats 两张表格,结果保存在 df_sale_feats 。
In [261]:
df_sale_feats = pd.merge(df_sales_by_store,df_feats,on=['Store','Date'])
df_sale_feats
Out[261]:
Store | Date | Dept | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2010-02-05 | 3184 | 1643690.90 | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | False |
1 | 1 | 2010-02-12 | 3137 | 1641957.44 | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | True |
2 | 1 | 2010-02-19 | 3184 | 1611968.17 | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | False |
3 | 1 | 2010-02-26 | 3137 | 1409727.59 | 46.63 | 2.561 | NaN | NaN | NaN | NaN | NaN | 211.319643 | 8.106 | False |
4 | 1 | 2010-03-05 | 3215 | 1554806.68 | 46.50 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.350143 | 8.106 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6430 | 45 | 2012-09-28 | 2853 | 713173.95 | 64.88 | 3.997 | 4556.61 | 20.64 | 1.50 | 1601.01 | 3288.25 | 192.013558 | 8.684 | False |
6431 | 45 | 2012-10-05 | 2952 | 733455.07 | 64.89 | 3.985 | 5046.74 | NaN | 18.82 | 2253.43 | 2340.01 | 192.170412 | 8.667 | False |
6432 | 45 | 2012-10-12 | 2853 | 734464.36 | 54.47 | 4.000 | 1956.28 | NaN | 7.89 | 599.32 | 3990.54 | 192.327265 | 8.667 | False |
6433 | 45 | 2012-10-19 | 2853 | 718125.53 | 56.47 | 3.969 | 2004.02 | NaN | 3.18 | 437.73 | 1537.49 | 192.330854 | 8.667 | False |
6434 | 45 | 2012-10-26 | 2907 | 760281.43 | 58.85 | 3.882 | 4018.91 | 58.08 | 100.00 | 211.94 | 858.33 | 192.308899 | 8.667 | False |
6435 rows × 14 columns
10. 使用 Store 这一列合并 df_sale_feats 和 df_stores 两张表格,结果仍然保留在df_sale_feats 中。
In [266]:
df_sale_feats = pd.merge(df_sale_feats,df_stores,on=['Store'])
df_sale_feats
Out[266]:
Store | Date | Dept | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday | Type | Size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2010-02-05 | 3184 | 1643690.90 | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | False | A | 151315 |
1 | 1 | 2010-02-12 | 3137 | 1641957.44 | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | True | A | 151315 |
2 | 1 | 2010-02-19 | 3184 | 1611968.17 | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | False | A | 151315 |
3 | 1 | 2010-02-26 | 3137 | 1409727.59 | 46.63 | 2.561 | NaN | NaN | NaN | NaN | NaN | 211.319643 | 8.106 | False | A | 151315 |
4 | 1 | 2010-03-05 | 3215 | 1554806.68 | 46.50 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.350143 | 8.106 | False | A | 151315 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6430 | 45 | 2012-09-28 | 2853 | 713173.95 | 64.88 | 3.997 | 4556.61 | 20.64 | 1.50 | 1601.01 | 3288.25 | 192.013558 | 8.684 | False | B | 118221 |
6431 | 45 | 2012-10-05 | 2952 | 733455.07 | 64.89 | 3.985 | 5046.74 | NaN | 18.82 | 2253.43 | 2340.01 | 192.170412 | 8.667 | False | B | 118221 |
6432 | 45 | 2012-10-12 | 2853 | 734464.36 | 54.47 | 4.000 | 1956.28 | NaN | 7.89 | 599.32 | 3990.54 | 192.327265 | 8.667 | False | B | 118221 |
6433 | 45 | 2012-10-19 | 2853 | 718125.53 | 56.47 | 3.969 | 2004.02 | NaN | 3.18 | 437.73 | 1537.49 | 192.330854 | 8.667 | False | B | 118221 |
6434 | 45 | 2012-10-26 | 2907 | 760281.43 | 58.85 | 3.882 | 4018.91 | 58.08 | 100.00 | 211.94 | 858.33 | 192.308899 | 8.667 | False | B | 118221 |
6435 rows × 16 columns
In [ ]:
这篇关于Pandas操作题的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!