本文主要是介绍Datawhale第23期组队学习—集成学习—task6 模型评估与超参数调优,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
文章目录
- 1.k折交叉验证
- 2. 偏差与方差
- 3. 混淆矩阵与ROC曲线
- 4. 超参数调优
参考来源:1. https://zhuanlan.zhihu.com/p/140040705
2. https://blog.csdn.net/teng_zz/article/details/98027712
1.k折交叉验证
所谓K折交叉验证,就是将数据集等比例划分成K份。将其中的k-1份作为训练集,剩余1份作为测试集。用k-1份数据训练出的模型预测值与剩余的1份样本测试值进行对比,得出均方误差大小。之后将第2份数据作为测试集,其余k-1份作为训练集,以此类推,将此过程重复k次,得到一个平均误差值。
k折交叉验证
from sklearn.model_selection import cross_val_scorescores1 = cross_val_score(estimator=pipe_lr,X = X_train,y = y_train,cv=10,n_jobs=1)
print("CV accuracy scores:%s" % scores1)
print("CV accuracy:%.3f +/-%.3f"%(np.mean(scores1),np.std(scores1)))
分层k折交叉验证
from sklearn.model_selection import StratifiedKFoldkfold = StratifiedKFold(n_splits=10,random_state=1).split(X_train,y_train)
scores2 = []
for k,(train,test) in enumerate(kfold):pipe_lr.fit(X_train[train],y_train[train])score = pipe_lr.score(X_train[test],y_train[test])scores2.append(score)print('Fold:%2d,Class dist.:%s,Acc:%.3f'%(k+1,np.bincount(y_train[train]),score))
print('\nCV accuracy :%.3f +/-%.3f'%(np.mean(scores2),np.std(scores2)))
2. 偏差与方差
通过前面的学习我们可以知道,模型越复杂,对训练集的拟合效果就越好(理论上来说,我们可以通过添加无数的高次项将训练集的误差降低为0)。但是,对训练集的拟合效果越好,通常在测试集上的表现就相对较差。模型的偏差是指,我们使用的模型去估计真实值所产生的误差,不同数据集训练出来的模型之间的差异称为方差。(详情可见大佬直播:https://www.bilibili.com/video/BV1DZ4y1F7Su)
3. 混淆矩阵与ROC曲线
混淆矩阵
- 真阳性TP:预测值和真实值都为正例;
- 真阴性TN:预测值与真实值都为负例;
- 假阳性FP:预测值为正,实际值为负;
- 假阴性FN:预测值为负,实际值为正;
上图是一个简单的混淆矩阵的概念,那什么是混淆矩阵呢?混淆矩阵的每一列代表了预测类别,每一列的总数表示预测为该类别的数据的数目;每一行代表了数据的真实归属类别,每一行的数据总数表示该类别的数据实例的数目。而混淆矩阵主对角线上的数量便是预测正确的数量。由此,产生了一系列度量模型的指标:
- 准确率:分类正确的样本数占总样本的比例,即: A C C = T P + T N F P + F N + T P + T N ACC = \frac{TP+TN}{FP+FN+TP+TN} ACC=FP+FN+TP+TNTP+TN.
- 精度:预测为正且分类正确的样本占预测值为正的比例,即: P R E = T P T P + F P PRE = \frac{TP}{TP+FP} PRE=TP+FPTP.
- 召回率:预测为正且分类正确的样本占预测正确的比例,即: R E C = T P T P + F N REC = \frac{TP}{TP+FN} REC=TP+FNTP.
- F1值:综合衡量精度和召回率,即: F 1 = 2 P R E × R E C P R E + R E C F1 = 2\frac{PRE\times REC}{PRE + REC} F1=2PRE+RECPRE×REC.
- ROC曲线:以假阳率为横轴,真阳率为纵轴画出来的曲线,曲线下方面积越大越好。
绘制混淆矩阵
# 混淆矩阵:
# 加载数据
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",header=None)
'''
乳腺癌数据集:569个恶性和良性肿瘤细胞的样本,M为恶性,B为良性
'''
# 做基本的数据预处理
from sklearn.preprocessing import LabelEncoderX = df.iloc[:,2:].values
y = df.iloc[:,1].values
le = LabelEncoder() #将M-B等字符串编码成计算机能识别的0-1
y = le.fit_transform(y)
le.transform(['M','B'])
# 数据切分8:2
from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)
from sklearn.svm import SVC
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
from sklearn.metrics import confusion_matrixpipe_svc.fit(X_train,y_train)
y_pred = pipe_svc.predict(X_test)
confmat = confusion_matrix(y_true=y_test,y_pred=y_pred)
fig,ax = plt.subplots(figsize=(2.5,2.5))
ax.matshow(confmat, cmap=plt.cm.Blues,alpha=0.3)
for i in range(confmat.shape[0]):for j in range(confmat.shape[1]):ax.text(x=j,y=i,s=confmat[i,j],va='center',ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()
绘制ROC曲线
# 绘制ROC曲线:
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import make_scorer,f1_score
scorer = make_scorer(f1_score,pos_label=0)
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring=scorer,cv=10)
y_pred = gs.fit(X_train,y_train).decision_function(X_test)
#y_pred = gs.predict(X_test)
fpr,tpr,threshold = roc_curve(y_test, y_pred) ###计算真阳率和假阳率
roc_auc = auc(fpr,tpr) ###计算auc的值
plt.figure()
lw = 2
plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, color='darkorange',lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###假阳率为横坐标,真阳率为纵坐标做曲线
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic ')
plt.legend(loc="lower right")
plt.show()
4. 超参数调优
关于超参数,超参数调优的相关理论,在task4 已经进行了描述。
而超参数调优,主要有两种方法:网格搜索和随机搜索。
# 方式1:网格搜索GridSearchCV()
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import timestart_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X_train,y_train)
end_time = time.time()
print("网格搜索经历时间:%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)
# 方式2:随机网格搜索RandomizedSearchCV()
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
import timestart_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
# param_grid = [{'svc__C':param_range,'svc__kernel':['linear','rbf'],'svc__gamma':param_range}]
gs = RandomizedSearchCV(estimator=pipe_svc, param_distributions=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X_train,y_train)
end_time = time.time()
print("随机网格搜索经历时间:%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)
这篇关于Datawhale第23期组队学习—集成学习—task6 模型评估与超参数调优的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!