Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第二章 Creating a simple first model

本文主要是介绍Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第二章 Creating a simple first model，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 22 (2)

Exercise

Setting up a train-test split in scikit-learn

Alright, you’ve been patient and awesome. It’s finally time to start training models!

The first step is to split the data into a training set and a test set. Some labels don’t occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least min_count examples of each label appear in each split: multilabel_train_test_split.

Feel free to check out the full code for multilabel_train_test_split here.

You’ll start with a simple model that uses just the numeric columns of your DataFrame when calling multilabel_train_test_split. The data has been read into a DataFrame df and a list consisting of just the numeric columns is available as NUMERIC_COLUMNS.

Instruction

Create a new DataFrame named numeric_data_only by applying the .fillna(-1000) method to the numeric columns (available in the list NUMERIC_COLUMNS) of df.
Convert the labels (available in the list LABELS) to dummy variables. Save the result as label_dummies.
In the call to multilabel_train_test_split(), set the size of your test set to be 0.2. Use a seed of 123.
Fill in the .info() method calls for X_train, X_test, y_train, and y_test.

import pandas as pd
import numpy as np
from warnings import warn
from sklearn.feature_extraction.text import CountVectorizer#### DEFINE SAMPLING UTILITIES# First multilabel_sample, which is called by multilabel_train_test_splitdef multilabel_sample(y, size=1000, min_count=5, seed=None):   try:if (np.unique(y).astype(int) != np.array([0, 1])).all():raise ValueError()except (TypeError, ValueError):raise ValueError('multilabel_sample only works with binary indicator matrices')if (y.sum(axis=0) < min_count).any():raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')if size <= 1:size = np.floor(y.shape[0] * size)if y.shape[1] * min_count > size:msg = "Size less than number of columns * min_count, returning {} items instead of {}."warn(msg.format(y.shape[1] * min_count, size))size = y.shape[1] * min_countrng = np.random.RandomState(seed if seed is not None else np.random.randint(1))if isinstance(y, pd.DataFrame):choices = y.indexy = y.valueselse:choices = np.arange(y.shape[0])sample_idxs = np.array([], dtype=choices.dtype)# first, guarantee > min_count of each labelfor j in range(y.shape[1]):label_choices = choices[y[:, j] == 1]label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])sample_idxs = np.unique(sample_idxs)# now that we have at least min_count of each, we can just random samplesample_count = size - sample_idxs.shape[0]# get sample_count indices from remaining choicesremaining_choices = np.setdiff1d(choices, sample_idxs)remaining_sampled = rng.choice(remaining_choices, size=sample_count