本文主要是介绍Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python
Datacamp track: Data Scientist with Python - Course 21 (4)
Exercise
Exploring categorical features
The Gapminder dataset that you worked with in previous chapters also contained a categorical 'Region'
feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!
Your job in this exercise is to explore this feature. Boxplots are particularly useful for visualizing categorical features such as this.
Instruction
- Import
pandas
aspd
. - Read the CSV file
'gapminder.csv'
into a DataFrame calleddf
. - Use pandas to create a boxplot showing the variation of life expectancy (
'life'
) by region ('Region'
). To do so, pass the column names in todf.boxplot()
(in that order).
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from urllib.request import urlretrieve
fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/gm_2008_region.csv'
urlretrieve(fn, 'gapminder.csv')
('gapminder.csv', <http.client.HTTPMessage at 0x11515f0b8>)
# Import pandas
import pandas as pd# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)# Show the plot
plt.show()
Exercise
Creating dummy variables
As Andy discussed in the video, scikit-learn does not accept non-numerical features. You saw in the previous exercise that the 'Region'
feature contains very useful information that can predict life expectancy. For example, Sub-Saharan Africa has a lower life expectancy compared to Europe and Central Asia. Therefore, if you are trying to predict life expectancy, it would be preferable to retain the 'Region'
feature. To do this, you need to binarize it by creating dummy variables, which is what you will do in this exercise.
Instruction
- Use the pandas
get_dummies()
function to create dummy variables from thedf
DataFrame. Store the result asdf_region
. - Print the columns of
df_region
. This has been done for you. - Use the
get_dummies()
function again, this time specifyingdrop_first=True
to drop the unneeded dummy variable (in this case,'Region_America'
). - Hit 'Submit Answer to print the new columns of
df_region
and take note of how one column was dropped!
# Create dummy variables: df_region
df_region = pd.get_dummies(df)# Print the columns of df_region
print(df_region.columns)# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)# Print the new columns of df_region
print(df_region.columns)
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP','BMI_female', 'life', 'child_mortality', 'Region_America','Region_East Asia & Pacific', 'Region_Europe & Central Asia','Region_Middle East & North Africa', 'Region_South Asia','Region_Sub-Saharan Africa'],dtype='object')
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP','BMI_female', 'life', 'child_mortality', 'Region_East Asia & Pacific','Region_Europe & Central Asia', 'Region_Middle East & North Africa','Region_South Asia', 'Region_Sub-Saharan Africa'],dtype='object')
Exercise
Regression with categorical features
Having created the dummy variables from the 'Region'
feature, you can build regression models as you did before. Here, you’ll use ridge regression to perform 5-fold cross-validation.
The feature array X
and target variable array y
have been pre-loaded.
Instruction
- Import
Ridge
fromsklearn.linear_model
andcross_val_score
fromsklearn.model_selection
. - Instantiate a ridge regressor called
ridge
withalpha=0.5
andnormalize=True
. - Perform 5-fold cross-validation on
X
andy
using thecross_val_score()
function. - Print the cross-validated scores.
# modified/added by Jinny
import pandas as pd
import numpy as npfrom urllib.request import urlretrievefn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/gm_2008_region.csv'
urlretrieve(fn, 'gapminder.csv')df = pd.read_csv('gapminder.csv')df_region = pd.get_dummies(df)df_region = df_region.drop('Region_America', axis=1)X = df_region.drop('life', axis=1).valuesy = df_region['life'].values
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5, normalize=True)# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv=5)# Print the cross-validated scores
print(ridge_cv)
[0.86808336 0.80623545 0.84004203 0.7754344 0.87503712]
Exercise
Dropping missing data
The voting dataset from Chapter 1 contained a bunch of missing values that we dealt with for you behind the scenes. Now, it’s time for you to take care of these yourself!
The unprocessed dataset has been loaded into a DataFrame df
. Explore it in the IPython Shell with the .head()
method. You will see that there are certain data points labeled with a '?'
. These denote missing values. As you saw in the video, different datasets encode missing values in different ways. Sometimes it may be a '9999'
, other times a 0
- real-world data can be very messy! If you’re lucky, the missing values will already be encoded as NaN
. We use NaN
because it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as .dropna()
and .fillna()
, as well as scikit-learn’s Imputation transformer Imputer
这篇关于Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!