Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第三章 Improving your model

本文主要是介绍Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第三章 Improving your model,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 22 (3)

Exercise

Instantiate pipeline

In order to make your life easier as you start to work with all of the data in your original DataFrame, df, it’s time to turn to one of scikit-learn’s most useful objects: the Pipeline.

For the next few exercises, you’ll reacquaint yourself with pipelines and train a classifier on some synthetic (sample) data of multiple datatypes before using the same techniques on the main dataset.

The sample data is stored in the DataFrame, sample_df, which has three kinds of feature data: numeric, text, and numeric with missing values. It also has a label column with two classes, a and b.

In this exercise, your job is to instantiate a pipeline that trains using the numeric column of the sample data.

Instruction

  • Import Pipeline from sklearn.pipeline.
  • Create training and test sets using the numeric data only. Do this by specifying sample_df[['numeric']] in train_test_split().
  • Instantiate a pipeline as pl by adding the classifier step. Use a name of 'clf' and the same classifier from Chapter 2: OneVsRestClassifier(LogisticRegression()).
  • Fit your pipeline to the training data and compute its accuracy to see it in action! Since this is toy data, you’ll use the default scoring method for now. In the next chapter, you’ll return to log loss scoring.
import numpy as np
import pandas as pdrng = np.random.RandomState(123)SIZE = 1000sample_data = {'numeric': rng.normal(0, 10, size=SIZE),'text': rng.choice(['', 'foo', 'bar', 'foo bar', 'bar foo'], size=SIZE),'with_missing': rng.normal(loc=3, size=SIZE)
}sample_df = pd.DataFrame(sample_data)sample_df.loc[rng.choice(sample_df.index, size=np.floor_divide(sample_df.shape[0], 5)), 'with_missing'] = np.nanfoo_values = sample_df.text.str.contains('foo') * 10
bar_values = sample_df.text.str.contains('bar') * -25
no_text = ((foo_values + bar_values) == 0) * 1val = 2 * sample_df.numeric + -2 * (foo_values + bar_values + no_text) + 4 * sample_df.with_missing.fillna(3)
val += rng.normal(0, 8, size=SIZE)sample_df['label'] = np.where(val > np.median(val), 'a', 'b')print(sample_df.head())
     numeric     text  with_missing label
0 -10.856306               4.433240     b
1   9.973454      foo      4.310229     b
2   2.829785  foo bar      2.469828     a
3 -15.062947               2.852981     b
4  -5.786003  foo bar      1.826475     a
# Import Pipeline
from sklearn.pipeline import Pipeline# Import other necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier# Split and select numeric data only, no nans 
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']],pd.get_dummies(sample_df['label']), random_state=22)# Instantiate Pipeline object: pl
pl = Pipeline([('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))])# Fit the pipeline to the training data
pl.fit(X_train, y_train)# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - numeric, no nans: ", accuracy)
Accuracy on sample data - numeric, no nans:  0.62

Exercise

Preprocessing numeric features

What would have happened if you had included the with 'with_missing' column in the last exercise? Without imputing missing values, the pipeline would not be happy (try it and see). So, in this exercise you’ll improve your pipeline a bit by using the Imputer() imputation transformer from scikit-learn to fill in missing values in your sample data.

By default, the imputer transformer replaces NaNs with the mean value of the column. That’s a good enough imputation strategy for the sample data, so you won’t need to pass anything extra to the imputer.

After importing the transformer, you will edit the steps list used in the previous exercise by inserting a (name, transform) tuple. Recall that steps are processed sequentially, so make sure the new tuple encoding your preprocessing step is put in the right place.

The sample_df is in the workspace, in case you’d like to take another look. Make sure to select both numeric columns- in the previous exercise we couldn’t use with_missing because we had no preprocessing step!

Instruction

  • Import Imputer from sklearn.preprocessing.
  • Create training and test sets by selecting the correct subset of sample_df: 'numeric' and 'with_missing'.
  • Add the tuple ('imp', Imputer()) to the correct position in the pipeline. Pipeline processes steps sequentially, so the imputation step should come before the classifier step.
  • Complete the .fit() and .score() methods to fit the pipeline to the data and compute the accuracy.
# Import the Imputer object
from sklearn.preprocessing import Imputer# Create training and test sets using only numeric data
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing']],pd.get_dummies(sample_df['label']

这篇关于Datacamp 笔记代码 Machine Learning with the Experts: School Budgets 第三章 Improving your model的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1028850

相关文章

C#实现千万数据秒级导入的代码

《C#实现千万数据秒级导入的代码》在实际开发中excel导入很常见,现代社会中很容易遇到大数据处理业务,所以本文我就给大家分享一下千万数据秒级导入怎么实现,文中有详细的代码示例供大家参考,需要的朋友可... 目录前言一、数据存储二、处理逻辑优化前代码处理逻辑优化后的代码总结前言在实际开发中excel导入很

SpringBoot+RustFS 实现文件切片极速上传的实例代码

《SpringBoot+RustFS实现文件切片极速上传的实例代码》本文介绍利用SpringBoot和RustFS构建高性能文件切片上传系统,实现大文件秒传、断点续传和分片上传等功能,具有一定的参考... 目录一、为什么选择 RustFS + SpringBoot?二、环境准备与部署2.1 安装 RustF

Python实现Excel批量样式修改器(附完整代码)

《Python实现Excel批量样式修改器(附完整代码)》这篇文章主要为大家详细介绍了如何使用Python实现一个Excel批量样式修改器,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一... 目录前言功能特性核心功能界面特性系统要求安装说明使用指南基本操作流程高级功能技术实现核心技术栈关键函

Redis实现高效内存管理的示例代码

《Redis实现高效内存管理的示例代码》Redis内存管理是其核心功能之一,为了高效地利用内存,Redis采用了多种技术和策略,如优化的数据结构、内存分配策略、内存回收、数据压缩等,下面就来详细的介绍... 目录1. 内存分配策略jemalloc 的使用2. 数据压缩和编码ziplist示例代码3. 优化的

Python 基于http.server模块实现简单http服务的代码举例

《Python基于http.server模块实现简单http服务的代码举例》Pythonhttp.server模块通过继承BaseHTTPRequestHandler处理HTTP请求,使用Threa... 目录测试环境代码实现相关介绍模块简介类及相关函数简介参考链接测试环境win11专业版python

Python从Word文档中提取图片并生成PPT的操作代码

《Python从Word文档中提取图片并生成PPT的操作代码》在日常办公场景中,我们经常需要从Word文档中提取图片,并将这些图片整理到PowerPoint幻灯片中,手动完成这一任务既耗时又容易出错,... 目录引言背景与需求解决方案概述代码解析代码核心逻辑说明总结引言在日常办公场景中,我们经常需要从 W

使用Spring Cache本地缓存示例代码

《使用SpringCache本地缓存示例代码》缓存是提高应用程序性能的重要手段,通过将频繁访问的数据存储在内存中,可以减少数据库访问次数,从而加速数据读取,:本文主要介绍使用SpringCac... 目录一、Spring Cache简介核心特点:二、基础配置1. 添加依赖2. 启用缓存3. 缓存配置方案方案

MySQL的配置文件详解及实例代码

《MySQL的配置文件详解及实例代码》MySQL的配置文件是服务器运行的重要组成部分,用于设置服务器操作的各种参数,下面:本文主要介绍MySQL配置文件的相关资料,文中通过代码介绍的非常详细,需要... 目录前言一、配置文件结构1.[mysqld]2.[client]3.[mysql]4.[mysqldum

Python多线程实现大文件快速下载的代码实现

《Python多线程实现大文件快速下载的代码实现》在互联网时代,文件下载是日常操作之一,尤其是大文件,然而,网络条件不稳定或带宽有限时,下载速度会变得很慢,本文将介绍如何使用Python实现多线程下载... 目录引言一、多线程下载原理二、python实现多线程下载代码说明:三、实战案例四、注意事项五、总结引

IDEA与MyEclipse代码量统计方式

《IDEA与MyEclipse代码量统计方式》文章介绍在项目中不安装第三方工具统计代码行数的方法,分别说明MyEclipse通过正则搜索(排除空行和注释)及IDEA使用Statistic插件或调整搜索... 目录项目场景MyEclipse代码量统计IDEA代码量统计总结项目场景在项目中,有时候我们需要统计