Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第三章 Improving your model

Datacamp track: Data Scientist with Python - Course 22 (3)


Instantiate pipeline

In order to make your life easier as you start to work with all of the data in your original DataFrame, df, it’s time to turn to one of scikit-learn’s most useful objects: the Pipeline.

For the next few exercises, you’ll reacquaint yourself with pipelines and train a classifier on some synthetic (sample) data of multiple datatypes before using the same techniques on the main dataset.

The sample data is stored in the DataFrame, sample_df, which has three kinds of feature data: numeric, text, and numeric with missing values. It also has a label column with two classes, a and b.

In this exercise, your job is to instantiate a pipeline that trains using the numeric column of the sample data.


  • Import Pipeline from sklearn.pipeline.
  • Create training and test sets using the numeric data only. Do this by specifying sample_df[['numeric']] in train_test_split().
  • Instantiate a pipeline as pl by adding the classifier step. Use a name of 'clf' and the same classifier from Chapter 2: OneVsRestClassifier(LogisticRegression()).
  • Fit your pipeline to the training data and compute its accuracy to see it in action! Since this is toy data, you’ll use the default scoring method for now. In the next chapter, you’ll return to log loss scoring.
import numpy as np
import pandas as pdrng = np.random.RandomState(123)SIZE = 1000sample_data = {'numeric': rng.normal(0, 10, size=SIZE),'text': rng.choice(['', 'foo', 'bar', 'foo bar', 'bar foo'], size=SIZE),'with_missing': rng.normal(loc=3, size=SIZE)
}sample_df = pd.DataFrame(sample_data)sample_df.loc[rng.choice(sample_df.index, size=np.floor_divide(sample_df.shape[0], 5)), 'with_missing'] = np.nanfoo_values = sample_df.text.str.contains('foo') * 10
bar_values = sample_df.text.str.contains('bar') * -25
no_text = ((foo_values + bar_values) == 0) * 1val = 2 * sample_df.numeric + -2 * (foo_values + bar_values + no_text) + 4 * sample_df.with_missing.fillna(3)
val += rng.normal(0, 8, size=SIZE)sample_df['label'] = np.where(val > np.median(val), 'a', 'b')print(sample_df.head())
     numeric     text  with_missing label
0 -10.856306               4.433240     b
1   9.973454      foo      4.310229     b
2   2.829785  foo bar      2.469828     a
3 -15.062947               2.852981     b
4  -5.786003  foo bar      1.826475     a
# Import Pipeline
from sklearn.pipeline import Pipeline# Import other necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier# Split and select numeric data only, no nans 
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']],pd.get_dummies(sample_df['label']), random_state=22)# Instantiate Pipeline object: pl
pl = Pipeline([('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))])# Fit the pipeline to the training data, y_train)# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - numeric, no nans: ", accuracy)
Accuracy on sample data - numeric, no nans:  0.62


Preprocessing numeric features

What would have happened if you had included the with 'with_missing' column in the last exercise? Without imputing missing values, the pipeline would not be happy (try it and see). So, in this exercise you’ll improve your pipeline a bit by using the Imputer() imputation transformer from scikit-learn to fill in missing values in your sample data.

By default, the imputer transformer replaces NaNs with the mean value of the column. That’s a good enough imputation strategy for the sample data, so you won’t need to pass anything extra to the imputer.

After importing the transformer, you will edit the steps list used in the previous exercise by inserting a (name, transform) tuple. Recall that steps are processed sequentially, so make sure the new tuple encoding your preprocessing step is put in the right place.

The sample_df is in the workspace, in case you’d like to take another look. Make sure to select both numeric columns- in the previous exercise we couldn’t use with_missing because we had no preprocessing step!


  • Import Imputer from sklearn.preprocessing.
  • Create training and test sets by selecting the correct subset of sample_df: 'numeric' and 'with_missing'.
  • Add the tuple ('imp', Imputer()) to the correct position in the pipeline. Pipeline processes steps sequentially, so the imputation step should come before the classifier step.
  • Complete the .fit() and .score() methods to fit the pipeline to the data and compute the accuracy.
# Import the Imputer object
from sklearn.preprocessing import Imputer# Create training and test sets using only numeric data
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing']],pd.get_dummies(sample_df['label']

这篇关于Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第三章 Improving your model的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



