5.实操(美国两党预测)

本文主要是介绍5.实操(美国两党预测)，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

一、数据预览
- 1.Data.head()
- 2. Class分布预览
- 1.3 其他
- 1.4 问题
二、数据预处理
- 2.1 将class值从REP、DEM变成0，1
- 2.2 one-hot变换
- 2.3 测试集、训练集分离
三、模型构建
- 3.1 随机森林
四、Stacking
- 4.1 构建一个字典来存储各个模型
- 4.2 用各个模型训练_预测测试集

一、数据预览

1.Data.head()

在这里插入图片描述

2. Class分布预览

在这里插入图片描述

1.3 其他

无缺失值
class 名称是 'cand_pty_affiliation '

1.4 问题

将class值从REP、DEM变成0，1
要进行one-hot变换

二、数据预处理

2.1 将class值从REP、DEM变成0，1

data['cand_pty_affiliation'] = data[['cand_pty_affiliation']].replace({'REP':1,'DEM':0})

2.2 one-hot变换

首先将属性和class分离开

X = data.drop(['cand_pty_affiliation'],axis=1)
y = data['cand_pty_affiliation']

再进行one-hot变换，转成稀疏格式

X = pd.get_dummies(X,sparse=True)

2.3 测试集、训练集分离

from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.4, random_state=33)

因为数据有点多，训练集就取60%，方便计算

三、模型构建

3.1 随机森林

不断地用Grid SearchCV调试参数

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
parameters = {'max_depth':np.arange(14,18),'min_samples_split':np.arange(5,8)} 
clf = GridSearchCV(estimator=RandomForestClassifier(n_estimators=186,random_state=33,n_jobs=-1),param_grid=parameters,cv=5,n_jobs=-1,scoring='roc_auc')
clf.fit(train_x,train_y)
print(clf.best_score_)
print(clf.best_params_)

得到最优的参数模型

rf = RandomForestClassifier(max_depth=17,min_samples_split=5,n_estimators=186,random_state=33,n_jobs=-1)

四、Stacking

4.1 构建一个字典来存储各个模型

各个模型在之前都已经找到最优参数了

from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.kernel_approximation import Nystroem
from sklearn.kernel_approximation import RBFSampler
from sklearn.pipeline import make_pipelineSEED =33 # 设置随机种子
def get_models():"""Generate a library of base learners."""nb = GaussianNB()svc = SVC(C=100, probability=True)knn = KNeighborsClassifier(n_neighbors=3)lr = LogisticRegression(C=100, random_state=SEED)nn = MLPClassifier((80, 10), early_stopping=False, random_state=SEED)gb = GradientBoostingClassifier(n_estimators=100, random_state=SEED)rf = RandomForestClassifier(max_depth=17,min_samples_split=5,n_estimators=186,random_state=SEED,n_jobs=-1)models = {'svm': svc,'knn': knn,'naive bayes': nb,'mlp-nn': nn,'random forest': rf,'gbm': gb,'logistic': lr,}return models  #返回字典对象，key:各个模型的名字; value: 各个调好参数的模型

在这里插入图片描述

4.2 用各个模型训练_预测测试集

def train_predict(models):"""Fit models in list on training set and return preds"""#np.zeros(行,列)初始化一个array对象用来存储各个模型预测的值，行就是每个样本，列则是每个模型#np.zeros行的值 = ytest.shape[0]，也就是test_y的数量； 列的值=len(model_list)，也就是传进来的models对象，里有几个模型P = np.zeros((test_y.shape[0], len(models)))P = pd.DataFrame(P)print("Fitting models.")cols = list()for i, (name, m) in enumerate(models.items()):print("%s..." % name, end=" ", flush=False)m.fit(train_x, train_y)P.iloc[:, i] = m.predict_proba(xtest)[:, 1] #将模型i预测的值赋值到P中cols.append(name) #加上列名print("done")P.columns = colsprint("Done.\n")return P

获得每个基础模型的分数

def score_models(P, y):"""Score model in prediction DF"""print("Scoring models.")for m in P.columns:score = roc_auc_score(y, P.loc[:, m])print("%-26s: %.3f" % (m, score))print("Done.\n")

SEED=33
models = get_models()
P = train_predict(models)
score_models(P, test_y)

Scoring models.
knn : 0.838
naive bayes : 0.818
mlp-nn : 0.891
random forest : 0.904
gbm : 0.889
logistic : 0.862
Done.

使用ML-Ensemble模块，将各个模型生成的预测值最为属性，看互相的相关性热力图

# You need ML-Ensemble for this figure: you can install it with: pip install mlens
from mlens.visualization import corrmatcorrmat(P.corr(), inflate=False)
plt.show()

相关性不高，且每个模型的预测值偏高，就证明该模型可以被用来stacking
在这里插入图片描述

stacking_predict = P.mean(axis=1)print("Ensemble ROC-AUC score: %.3f" % roc_auc_score(test_y, P.mean(axis=1)))

Ensemble ROC-AUC score: 0.899

from sklearn.metrics import roc_curvedef plot_roc_curve(test_y, P_base_learners, P_ensemble, labels, ens_label):#test_y = 测试集的真实值 ；P_base_learners = 各个模型预测生成的新的属性值；#P_ensemble = 各个模型预测值的合并预测值，此处是用平均#labels= 各个模型名称list  ;  ens_label 的名称，这里命名为'ensemble'plt.figure(figsize=(10, 8))plt.plot([0, 1], [0, 1], 'k--') # 画出基础对角线cm = [plt.cm.rainbow(i)for i in np.linspace(0, 1.0, P_base_learners.shape[1] + 1)]for i in range(P_base_learners.shape[1]):p = P_base_learners[:, i]fpr, tpr, _ = roc_curve(test_y, p)plt.plot(fpr, tpr, label=labels[i], c=cm[i + 1])fpr, tpr, _ = roc_curve(test_y, P_ensemble)plt.plot(fpr, tpr, label=ens_label, c=cm[0])plt.xlabel('False positive rate')plt.ylabel('True positive rate')plt.title('ROC curve')plt.legend(frameon=False)plt.show()

plot_roc_curve(test_y, P.values, P.mean(axis=1), list(P.columns), "ensemble")

在这里插入图片描述

base_learners = get_models()from mlens.ensemble import SuperLearner
# Instantiate the ensemble with 10 folds
sl = SuperLearner(folds=10,random_state=SEED,verbose=2,backend="multiprocessing"# Add the base learners and the meta learner
sl.add(list(base_learners.values()), proba=True) 
sl.add_meta(meta_learner, proba=True)
# Train the ensemble
sl.fit(train_x.values, train_y.values)
# Predict the test set
p_sl = sl.predict_proba(test_x.values)
print("\nSuper Learner ROC-AUC score: %.3f" % roc_auc_score(test_y, p_sl[:, 1])))