机器学习笔记：如何使用Hyperopt对Xgboost自动调参

本文主要是介绍机器学习笔记：如何使用Hyperopt对Xgboost自动调参，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Hyperopt介绍

超参数优化是实现模型性能最大化的重要步骤，scikit-learn提供了GridSearchCV和RandomizedSearchCV两个比较流行的选项。Hyperopt，是python中的一个用于"分布式异步算法组态/超参数优化"的类库。Hyperopt提供了能够超越随机搜索的算法，并且可以找到与网格搜索相媲美的结果。它是一种通过贝叶斯优化来调整参数的工具，可结合MongoDB可以进行分布式调参，快速找到相对较优的参数。

Hyheropt四个重要的因素：

指定需要最小化的函数
参数搜索空间
存储搜索计算结果
所使用的搜索算法

Xgboost介绍

XGBoost是一个优化的分布式梯度增强库，它在Gradient Boosting框架下实现机器学习算法。XGBoost成功背后最重要的因素是它在所有场景中的可扩展性，模型具有可解释性，在工业系统中被大量使用，xgboost与gbdt相比，gbdt只用到了一阶导数信息，而xgboost则同时用到了一阶与二阶导数，并且xgboost在惩罚函数中加入了正则化项，用于控制模型的复杂度，防止过拟合。

XGBoost具有三类参数，（常规参数）general parameters，（提升器参数）booster parameters和（任务参数）task parameters。

常规参数与我们用于提升的提升器有关，通常是树模型或线性模型
提升器参数取决于你所选择的提升器，提升模型表现
任务参数决定了学习场景, 例如回归任务、二分类任务

通常Xgboost训练模型,

xgboost.train(params, dtrain, num_boost_round=10, evals=(), \
obj=None, feval=None, maximize=False, early_stopping_rounds=None, \
evals_result=None, verbose_eval=True, learning_rates=None, \
xgb_model=None, callbacks=None)

其中params为一个dict,

params = {'booster':'gbtree','min_child_weight': 100,'eta': 0.02,'colsample_bytree': 0.7,'max_depth': 12,'subsample': 0.7,'alpha': 1,'gamma': 1,'silent': 1,'objective': 'reg:linear','verbose_eval': True,'seed': 12
}

General Parameters

booster [default=gbtree]
有两种模型可以选择，gbtree和gblinear。gbtree使用基于树的模型进行提升计算，gblinear使用线性模型进行提升计算。缺省值为gbtree。
silent [default=0]
取0时表示打印出运行时信息，取1时表示以缄默方式运行，不打印运行时信息。缺省值为0。
nthread
XGBoost运行时的线程数。缺省值是当前系统可以获得的最大线程数

Booster Parameters

eta [default=0.3]
- 为了防止过拟合，更新过程中用到的收缩步长。在每次提升计算之后，算法会直接获得新特征的权重。eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
- 通过减少每一步的权重，可以提高模型的鲁棒性。取值范围为：[0,1]
gamma [default=0]
- 在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。Gamma指定了节点分裂所需的最小损失函数下降值。这个参数的值越大，算法越保守。这个参数的值和损失函数息息相关，所以是需要调整的。
- range: [0,∞]
max_depth [default=6]
- 数的最大深度。缺省值为6,max_depth越大，模型会学到更具体更局部的样本。
- 取值范围为：[1,∞]
min_child_weight [default=1]
- 最小样本权重的和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束。在现行回归模型中，这个参数是指建立每个模型所需要的最小样本数。该成熟越大算法越conservative
- 取值范围为: [0,∞]
max_delta_step [default=0]
- 这参数限制每棵树权重改变的最大步长。如果这个参数的值为0，那就意味着没有约束。如果它被赋予了某个正值，那么它会让这个算法更加保守。
- 取值范围为：[0,∞]
subsample [default=1]
- 用于训练模型的子样本占整个样本集合的比例。如果设置为0.5则意味着XGBoost将随机从整个样本集合中随机的抽取出50%的子样本，建立树模型，这能够防止过拟合。减小这个参数的值，算法会更加保守，避免过拟合。但是，如果这个值设置得过小，它可能会导致欠拟合。
- 取值范围为：(0,1]
colsample_bytree [default=1]
- 在建立树时对特征采样的比例。缺省值为1
- 取值范围：(0,1]
colsample_bylevel
- 用来控制树的每一级的每一次分裂，对列数的采样的占比。
- 取值范围：(0,1]
lambda
- 权重的L2正则化项
- 参数是用来控制XGBoost的正则化部分的,
alpha
- 权重的L1正则化项
- 可以应用在很高维度的情况下，使得算法的速度更快。
scale_pos_weight
- 在各类别样本十分不平衡时，把这个参数设定为一个正值，可以使算法更快收敛。

Task Parameters

objective [ default=reg:linear ]
- binary:logistic, 二分类的逻辑回归，返回预测的概率(不是类别)。
- multi:softmax, 使用softmax的多分类器，返回预测的类别(不是概率)。
- multi:softprob, 和multi:softmax参数一样，但是返回的是每个数据属于各个类别的概率。
base_score [ default=0.5 ]
eval_metric
- 校验数据所需要的评价指标，不同的目标函数将会有缺省的评价指标
- rmse, logloss, auc等
seed
- 随机数种子，设置成某个值可以复现随机数据的结果，也可以用于调整参数

GridSearchCV

from sklearn.model_selection import GridSearchCV, cross_val_score
from xgboost import XGBClassifiergs = GridSearchCV(estimator=XGBClassifier(), param_grid={'max_depth': [3, 6, 9], 'learning_rate': [0.001, 0.01, 0.05]}, cv=2)
scores = cross_val_score(gs, X, y, cv=2)

Hyperopt自动调参

import xgboost as xgb
from hyperopt import STATUS_OK, Trials, fmin, hp, tpedef score(params):print("Training with params: ")print(params)num_round = int(params['n_estimators'])del params['n_estimators']dtrain = xgb.DMatrix(train_features, label=y_train)dvalid = xgb.DMatrix(valid_features, label=y_valid)watchlist = [(dvalid, 'eval'), (dtrain, 'train')]gbm_model = xgb.train(params, dtrain, num_round,evals=watchlist,verbose_eval=True)predictions = gbm_model.predict(dvalid,ntree_limit=gbm_model.best_iteration + 1)score = roc_auc_score(y_valid, predictions)# TODO: Add the importance for the selected featuresprint("\tScore {0}\n\n".format(score))# The score function should return the loss (1-score)# since the optimize function looks for the minimumloss = 1 - scorereturn {'loss': loss, 'status': STATUS_OK}def optimize(#trials, random_state=SEED):"""This is the optimization function that given a space (space here) of hyperparameters and a scoring function (score here), finds the best hyperparameters."""# To learn more about XGBoost parameters, head to this page: # https://github.com/dmlc/xgboost/blob/master/doc/parameter.mdspace = {'n_estimators': hp.quniform('n_estimators', 100, 1000, 1),'eta': hp.quniform('eta', 0.025, 0.5, 0.025),# A problem with max_depth casted to float instead of int with# the hp.quniform method.'max_depth':  hp.choice('max_depth', np.arange(1, 14, dtype=int)),'min_child_weight': hp.quniform('min_child_weight', 1, 6, 1),'subsample': hp.quniform('subsample', 0.5, 1, 0.05),'gamma': hp.quniform('gamma', 0.5, 1, 0.05),'colsample_bytree': hp.quniform('colsample_bytree', 0.5, 1, 0.05),'eval_metric': 'auc','objective': 'binary:logistic',# Increase this number if you have more cores. Otherwise, remove it and it will default # to the maxium number. 'nthread': 4,'booster': 'gbtree','tree_method': 'exact','silent': 1,'seed': random_state}# Use the fmin function from Hyperopt to find the best hyperparametersbest = fmin(score, space, algo=tpe.suggest, # trials=trials, max_evals=250)return best"""
其他特征处理步骤
"""if __name__ == "__main__":trials = Trials()optimize(trials)

这篇关于机器学习笔记：如何使用Hyperopt对Xgboost自动调参的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！