Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines

本文主要是介绍Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 21 (4)

Exercise

Exploring categorical features

The Gapminder dataset that you worked with in previous chapters also contained a categorical 'Region' feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!

Your job in this exercise is to explore this feature. Boxplots are particularly useful for visualizing categorical features such as this.

Instruction

  • Import pandas as pd.
  • Read the CSV file 'gapminder.csv' into a DataFrame called df.
  • Use pandas to create a boxplot showing the variation of life expectancy ('life') by region ('Region'). To do so, pass the column names in to df.boxplot() (in that order).
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from urllib.request import urlretrieve
fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/gm_2008_region.csv'
urlretrieve(fn, 'gapminder.csv')
('gapminder.csv', <http.client.HTTPMessage at 0x11515f0b8>)
# Import pandas
import pandas as pd# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)# Show the plot
plt.show()

[外链图片转存失败(img-TzFIQdvn-1564048295763)(output_2_0.png)]

Exercise

Creating dummy variables

As Andy discussed in the video, scikit-learn does not accept non-numerical features. You saw in the previous exercise that the 'Region' feature contains very useful information that can predict life expectancy. For example, Sub-Saharan Africa has a lower life expectancy compared to Europe and Central Asia. Therefore, if you are trying to predict life expectancy, it would be preferable to retain the 'Region' feature. To do this, you need to binarize it by creating dummy variables, which is what you will do in this exercise.

Instruction

  • Use the pandas get_dummies() function to create dummy variables from the df DataFrame. Store the result as df_region.
  • Print the columns of df_region. This has been done for you.
  • Use the get_dummies() function again, this time specifying drop_first=True to drop the unneeded dummy variable (in this case, 'Region_America').
  • Hit 'Submit Answer to print the new columns of df_region and take note of how one column was dropped!
# Create dummy variables: df_region
df_region = pd.get_dummies(df)# Print the columns of df_region
print(df_region.columns)# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)# Print the new columns of df_region
print(df_region.columns)
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP','BMI_female', 'life', 'child_mortality', 'Region_America','Region_East Asia & Pacific', 'Region_Europe & Central Asia','Region_Middle East & North Africa', 'Region_South Asia','Region_Sub-Saharan Africa'],dtype='object')
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP','BMI_female', 'life', 'child_mortality', 'Region_East Asia & Pacific','Region_Europe & Central Asia', 'Region_Middle East & North Africa','Region_South Asia', 'Region_Sub-Saharan Africa'],dtype='object')

Exercise

Regression with categorical features

Having created the dummy variables from the 'Region'feature, you can build regression models as you did before. Here, you’ll use ridge regression to perform 5-fold cross-validation.

The feature array X and target variable array y have been pre-loaded.

Instruction

  • Import Ridge from sklearn.linear_model and cross_val_score from sklearn.model_selection.
  • Instantiate a ridge regressor called ridge with alpha=0.5 and normalize=True.
  • Perform 5-fold cross-validation on X and y using the cross_val_score() function.
  • Print the cross-validated scores.
# modified/added by Jinny
import pandas as pd
import numpy as npfrom urllib.request import urlretrievefn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/gm_2008_region.csv'
urlretrieve(fn, 'gapminder.csv')df = pd.read_csv('gapminder.csv')df_region = pd.get_dummies(df)df_region = df_region.drop('Region_America', axis=1)X = df_region.drop('life', axis=1).valuesy = df_region['life'].values
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5, normalize=True)# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv=5)# Print the cross-validated scores
print(ridge_cv)
[0.86808336 0.80623545 0.84004203 0.7754344  0.87503712]

Exercise

Dropping missing data

The voting dataset from Chapter 1 contained a bunch of missing values that we dealt with for you behind the scenes. Now, it’s time for you to take care of these yourself!

The unprocessed dataset has been loaded into a DataFrame df. Explore it in the IPython Shell with the .head()method. You will see that there are certain data points labeled with a '?'. These denote missing values. As you saw in the video, different datasets encode missing values in different ways. Sometimes it may be a '9999', other times a 0 - real-world data can be very messy! If you’re lucky, the missing values will already be encoded as NaN. We use NaNbecause it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as .dropna() and .fillna(), as well as scikit-learn’s Imputation transformer Imputer

这篇关于Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1028848

相关文章

python实现pdf转word和excel的示例代码

《python实现pdf转word和excel的示例代码》本文主要介绍了python实现pdf转word和excel的示例代码,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价... 目录一、引言二、python编程1,PDF转Word2,PDF转Excel三、前端页面效果展示总结一

在MyBatis的XML映射文件中<trim>元素所有场景下的完整使用示例代码

《在MyBatis的XML映射文件中<trim>元素所有场景下的完整使用示例代码》在MyBatis的XML映射文件中,trim元素用于动态添加SQL语句的一部分,处理前缀、后缀及多余的逗号或连接符,示... 在MyBATis的XML映射文件中,<trim>元素用于动态地添加SQL语句的一部分,例如SET或W

使用C#代码计算数学表达式实例

《使用C#代码计算数学表达式实例》这段文字主要讲述了如何使用C#语言来计算数学表达式,该程序通过使用Dictionary保存变量,定义了运算符优先级,并实现了EvaluateExpression方法来... 目录C#代码计算数学表达式该方法很长,因此我将分段描述下面的代码片段显示了下一步以下代码显示该方法如

python多进程实现数据共享的示例代码

《python多进程实现数据共享的示例代码》本文介绍了Python中多进程实现数据共享的方法,包括使用multiprocessing模块和manager模块这两种方法,具有一定的参考价值,感兴趣的可以... 目录背景进程、进程创建进程间通信 进程间共享数据共享list实践背景 安卓ui自动化框架,使用的是

SpringBoot生成和操作PDF的代码详解

《SpringBoot生成和操作PDF的代码详解》本文主要介绍了在SpringBoot项目下,通过代码和操作步骤,详细的介绍了如何操作PDF,希望可以帮助到准备通过JAVA操作PDF的你,项目框架用的... 目录本文简介PDF文件简介代码实现PDF操作基于PDF模板生成,并下载完全基于代码生成,并保存合并P

SpringBoot基于MyBatis-Plus实现Lambda Query查询的示例代码

《SpringBoot基于MyBatis-Plus实现LambdaQuery查询的示例代码》MyBatis-Plus是MyBatis的增强工具,简化了数据库操作,并提高了开发效率,它提供了多种查询方... 目录引言基础环境配置依赖配置(Maven)application.yml 配置表结构设计demo_st

SpringCloud集成AlloyDB的示例代码

《SpringCloud集成AlloyDB的示例代码》AlloyDB是GoogleCloud提供的一种高度可扩展、强性能的关系型数据库服务,它兼容PostgreSQL,并提供了更快的查询性能... 目录1.AlloyDBjavascript是什么?AlloyDB 的工作原理2.搭建测试环境3.代码工程1.

Java调用Python代码的几种方法小结

《Java调用Python代码的几种方法小结》Python语言有丰富的系统管理、数据处理、统计类软件包,因此从java应用中调用Python代码的需求很常见、实用,本文介绍几种方法从java调用Pyt... 目录引言Java core使用ProcessBuilder使用Java脚本引擎总结引言python

Java中ArrayList的8种浅拷贝方式示例代码

《Java中ArrayList的8种浅拷贝方式示例代码》:本文主要介绍Java中ArrayList的8种浅拷贝方式的相关资料,讲解了Java中ArrayList的浅拷贝概念,并详细分享了八种实现浅... 目录引言什么是浅拷贝?ArrayList 浅拷贝的重要性方法一:使用构造函数方法二:使用 addAll(

JAVA利用顺序表实现“杨辉三角”的思路及代码示例

《JAVA利用顺序表实现“杨辉三角”的思路及代码示例》杨辉三角形是中国古代数学的杰出研究成果之一,是我国北宋数学家贾宪于1050年首先发现并使用的,:本文主要介绍JAVA利用顺序表实现杨辉三角的思... 目录一:“杨辉三角”题目链接二:题解代码:三:题解思路:总结一:“杨辉三角”题目链接题目链接:点击这里