本文主要是介绍Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第二章 Creating a simple first model,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python
Datacamp track: Data Scientist with Python - Course 22 (2)
Exercise
Setting up a train-test split in scikit-learn
Alright, you’ve been patient and awesome. It’s finally time to start training models!
The first step is to split the data into a training set and a test set. Some labels don’t occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least min_count
examples of each label appear in each split: multilabel_train_test_split
.
Feel free to check out the full code for multilabel_train_test_split
here.
You’ll start with a simple model that uses just the numeric columns of your DataFrame when calling multilabel_train_test_split
. The data has been read into a DataFrame df
and a list consisting of just the numeric columns is available as NUMERIC_COLUMNS
.
Instruction
- Create a new DataFrame named
numeric_data_only
by applying the.fillna(-1000)
method to the numeric columns (available in the listNUMERIC_COLUMNS
) ofdf
. - Convert the labels (available in the list
LABELS
) to dummy variables. Save the result aslabel_dummies
. - In the call to
multilabel_train_test_split()
, set thesize
of your test set to be0.2
. Use aseed
of123
. - Fill in the
.info()
method calls forX_train
,X_test
,y_train
, andy_test
.
import pandas as pd
import numpy as np
from warnings import warn
from sklearn.feature_extraction.text import CountVectorizer#### DEFINE SAMPLING UTILITIES# First multilabel_sample, which is called by multilabel_train_test_splitdef multilabel_sample(y, size=1000, min_count=5, seed=None): try:if (np.unique(y).astype(int) != np.array([0, 1])).all():raise ValueError()except (TypeError, ValueError):raise ValueError('multilabel_sample only works with binary indicator matrices')if (y.sum(axis=0) < min_count).any():raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')if size <= 1:size = np.floor(y.shape[0] * size)if y.shape[1] * min_count > size:msg = "Size less than number of columns * min_count, returning {} items instead of {}."warn(msg.format(y.shape[1] * min_count, size))size = y.shape[1] * min_countrng = np.random.RandomState(seed if seed is not None else np.random.randint(1))if isinstance(y, pd.DataFrame):choices = y.indexy = y.valueselse:choices = np.arange(y.shape[0])sample_idxs = np.array([], dtype=choices.dtype)# first, guarantee > min_count of each labelfor j in range(y.shape[1]):label_choices = choices[y[:, j] == 1]label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])sample_idxs = np.unique(sample_idxs)# now that we have at least min_count of each, we can just random samplesample_count = size - sample_idxs.shape[0]# get sample_count indices from remaining choicesremaining_choices = np.setdiff1d(choices, sample_idxs)remaining_sampled = rng.choice(remaining_choices, size=sample_count
这篇关于Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第二章 Creating a simple first model的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!