本文主要是介绍Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第三章 Improving your model,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python
Datacamp track: Data Scientist with Python - Course 22 (3)
Exercise
Instantiate pipeline
In order to make your life easier as you start to work with all of the data in your original DataFrame, df
, it’s time to turn to one of scikit-learn’s most useful objects: the Pipeline
.
For the next few exercises, you’ll reacquaint yourself with pipelines and train a classifier on some synthetic (sample) data of multiple datatypes before using the same techniques on the main dataset.
The sample data is stored in the DataFrame, sample_df
, which has three kinds of feature data: numeric, text, and numeric with missing values. It also has a label column with two classes, a
and b
.
In this exercise, your job is to instantiate a pipeline that trains using the numeric
column of the sample data.
Instruction
- Import
Pipeline
fromsklearn.pipeline
. - Create training and test sets using the numeric data only. Do this by specifying
sample_df[['numeric']]
intrain_test_split()
. - Instantiate a pipeline as
pl
by adding the classifier step. Use a name of'clf'
and the same classifier from Chapter 2:OneVsRestClassifier(LogisticRegression())
. - Fit your pipeline to the training data and compute its accuracy to see it in action! Since this is toy data, you’ll use the default scoring method for now. In the next chapter, you’ll return to log loss scoring.
import numpy as np
import pandas as pdrng = np.random.RandomState(123)SIZE = 1000sample_data = {'numeric': rng.normal(0, 10, size=SIZE),'text': rng.choice(['', 'foo', 'bar', 'foo bar', 'bar foo'], size=SIZE),'with_missing': rng.normal(loc=3, size=SIZE)
}sample_df = pd.DataFrame(sample_data)sample_df.loc[rng.choice(sample_df.index, size=np.floor_divide(sample_df.shape[0], 5)), 'with_missing'] = np.nanfoo_values = sample_df.text.str.contains('foo') * 10
bar_values = sample_df.text.str.contains('bar') * -25
no_text = ((foo_values + bar_values) == 0) * 1val = 2 * sample_df.numeric + -2 * (foo_values + bar_values + no_text) + 4 * sample_df.with_missing.fillna(3)
val += rng.normal(0, 8, size=SIZE)sample_df['label'] = np.where(val > np.median(val), 'a', 'b')print(sample_df.head())
numeric text with_missing label
0 -10.856306 4.433240 b
1 9.973454 foo 4.310229 b
2 2.829785 foo bar 2.469828 a
3 -15.062947 2.852981 b
4 -5.786003 foo bar 1.826475 a
# Import Pipeline
from sklearn.pipeline import Pipeline# Import other necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier# Split and select numeric data only, no nans
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']],pd.get_dummies(sample_df['label']), random_state=22)# Instantiate Pipeline object: pl
pl = Pipeline([('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))])# Fit the pipeline to the training data
pl.fit(X_train, y_train)# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - numeric, no nans: ", accuracy)
Accuracy on sample data - numeric, no nans: 0.62
Exercise
Preprocessing numeric features
What would have happened if you had included the with 'with_missing'
column in the last exercise? Without imputing missing values, the pipeline would not be happy (try it and see). So, in this exercise you’ll improve your pipeline a bit by using the Imputer()
imputation transformer from scikit-learn to fill in missing values in your sample data.
By default, the imputer transformer replaces NaNs with the mean value of the column. That’s a good enough imputation strategy for the sample data, so you won’t need to pass anything extra to the imputer.
After importing the transformer, you will edit the steps list used in the previous exercise by inserting a (name, transform)
tuple. Recall that steps are processed sequentially, so make sure the new tuple encoding your preprocessing step is put in the right place.
The sample_df
is in the workspace, in case you’d like to take another look. Make sure to select both numeric columns- in the previous exercise we couldn’t use with_missing
because we had no preprocessing step!
Instruction
- Import
Imputer
fromsklearn.preprocessing
. - Create training and test sets by selecting the correct subset of
sample_df
:'numeric'
and'with_missing'
. - Add the tuple
('imp', Imputer())
to the correct position in the pipeline.Pipeline
processes steps sequentially, so the imputation step should come before the classifier step. - Complete the
.fit()
and.score()
methods to fit the pipeline to the data and compute the accuracy.
# Import the Imputer object
from sklearn.preprocessing import Imputer# Create training and test sets using only numeric data
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing']],pd.get_dummies(sample_df['label']
这篇关于Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第三章 Improving your model的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!