xgboost的参数设定

2024-06-21 07:32
文章标签 参数 xgboost 设定

本文主要是介绍xgboost的参数设定,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

先列出Xgboost中可指定的参数,参数的详细说明如下

总共有3类参数:通用参数/general parameters, 集成(增强)参数/booster parameters 和 任务参数/task parameters

通用参数/General Parameters

  • booster [default=gbtree]
    • gbtree 和 gblinear

  • silent [default=0]
    • 0表示输出信息, 1表示安静模式

  • nthread
    • 跑xgboost的线程数,默认最大线程数

  • num_pbuffer [无需用户手动设定]
    • size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.

  • num_feature [无需用户手动设定]
    • feature dimension used in boosting, set to maximum dimension of the feature

集成(增强)参数/booster parameters

  • eta [default=0.3, 可以视作学习率]
    • 为了防止过拟合,更新过程中用到的收缩步长。在每次提升计算之后,算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
    • 取值范围为:[0,1]

  • gamma [default=0, alias: min_split_loss]
    • 为了对树的叶子节点做进一步的分割而必须设置的损失减少的最小值,该值越大,算法越保守
    • range: [0,∞]

  • max_depth [default=6]
    • 用于设置树的最大深度
    • range: [1,∞]

  • min_child_weight [default=1]
    • 表示子树观测权重之和的最小值,如果树的生长时的某一步所生成的叶子结点,其观测权重之和小于min_child_weight,那么可以放弃该步生长,在线性回归模式中,这仅仅与每个结点所需的最小观测数相对应。该值越大,算法越保守
    • range: [0,∞]

  • max_delta_step [default=0]
    • 如果该值为0,就是没有限制;如果设为一个正数,可以使每一步更新更加保守通常情况下这一参数是不需要设置的,但是在logistic回归的训练集中类极端不平衡的情况下,将这一参数的设置很有用,将该参数设为1-10可以控制每一步更新
    • range: [0,∞]

  • subsample [default=1]
    • 表示观测的子样本的比率,将其设置为0.5意味着xgboost将随机抽取一半观测用于数的生长,这将有助于防止过拟合现象
    • range: (0,1]

  • colsample_bytree [default=1]
    • 表示用于构造每棵树时变量的子样本比率
    • range: (0,1]

  • colsample_bylevel [default=1]
    • 用来控制树的每一级的每一次分裂,对列数的采样的占比。一般不太用这个参数,因为subsample参数和colsample_bytree参数可以起到相同的作用。
    • range: (0,1]

  • lambda [default=1, alias: reg_lambda]
    • L2 权重的L2正则化项

  • alpha [default=0, alias: reg_alpha]
    • L1 权重的L1正则化项

  • tree_method, string [default='auto']
    • The tree construction algorithm used in XGBoost(see description in the reference paper)
    • Distributed and external memory version only support approximate algorithm.
    • Choices: {'auto', 'exact', 'approx'}
      • 'auto': Use heuristic to choose faster one.
        • For small to medium dataset, exact greedy will be used.
        • For very large-dataset, approximate algorithm will be chosen.
        • Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.

      • 'exact': Exact greedy algorithm.
      • 'approx': Approximate greedy algorithm using sketching and histogram.

  • sketch_eps, [default=0.03]
    • This is only used for approximate greedy algorithm.
    • This roughly translated into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.
    • Usually user does not have to tune this. but consider setting to a lower number for more accurate enumeration.
    • range: (0, 1)

  • scale_pos_weight, [default=1]
    • 在各类别样本十分不平衡时,把这个参数设定为一个正值,可以使算法更快收敛
    • 一个可以考虑的值: sum(negative cases) / sum(positive cases) see Higgs Kaggle competition demo for examples: Rpy1py2py3

  • updater, [default='grow_colmaker,prune']
    • A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitely by a user. The following updater plugins exist:
      • 'grow_colmaker': non-distributed column-based construction of trees.
      • 'distcol': distributed tree construction with column-based data splitting mode.
      • 'grow_histmaker': distributed tree construction with row-based data splitting based on global proposal of histogram counting.
      • 'grow_local_histmaker': based on local histogram counting.
      • 'grow_skmaker': uses the approximate sketching algorithm.
      • 'sync': synchronizes trees in all distributed nodes.
      • 'refresh': refreshes tree's statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
      • 'prune': prunes the splits where loss < min_split_loss (or gamma).

    • In a distributed setting, the implicit updater sequence value would be adjusted as follows:
      • 'grow_histmaker,prune' when dsplit='row' (or default) and prob_buffer_row == 1 (or default); or when data has multiple sparse pages
      • 'grow_histmaker,refresh,prune' when dsplit='row' and prob_buffer_row < 1
      • 'distcol' when dsplit='col'

  • refresh_leaf, [default=1]
    • This is a parameter of the 'refresh' updater plugin. When this flag is true, tree leafs as well as tree nodes' stats are updated. When it is false, only node stats are updated.

  • process_type, [default='default']
    • A type of boosting process to run.
    • Choices: {'default', 'update'}
      • 'default': the normal boosting process which creates new trees.
      • 'update': starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updater plugins is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iteratons performed. Currently, the following built-in updater plugins could be meaningfully used with this process type: 'refresh', 'prune'. With 'update', one cannot use updater plugins that create new nrees.

任务参数/task parameters

  • objective [ default=reg:linear ] 这个参数定义需要被最小化的损失函数。最常用的值有
    • "reg:linear" --线性回归
    • "reg:logistic" --逻辑回归
    • "binary:logistic" --二分类的逻辑回归,返回预测的概率(不是类别)
    • "binary:logitraw" --输出归一化前的得分
    • "count:poisson" --poisson regression for count data, output mean of poisson distribution
      • max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)

    • "multi:softmax" --设定XGBoost做多分类,你需要同时设定num_class(类别数)的值
    • "multi:softprob" --输出维度为ndata * nclass的概率矩阵
    • "rank:pairwise" --设定XGBoost去完成排序问题(最小化pairwise loss)
    • "reg:gamma" --gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed
    • "reg:tweedie" --Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

  • base_score [ default=0.5 ]
    • the initial prediction score of all instances, global bias
    • for sufficient number of iterations, changing this value will not have too much effect.

  • eval_metric [ 默认是根据 损失函数/目标函数 自动选定的 ]
    • 有如下的选择:
      • "rmse": 均方误差
      • "mae": 绝对平均误差
      • "logloss": negative log损失
      • "error": 二分类的错误率
      • "error@t": 通过提供t为阈值(而不是0.5),计算错误率
      • "merror": 多分类的错误类,计算公式为#(wrong cases)/#(all cases).
      • "mlogloss": 多类log损失
      • "auc": ROC曲线下方的面积 for ranking evaluation.
      • "ndcg":Normalized Discounted Cumulative Gain
      • "map":平均准确率
      • "ndcg@n","map@n": n can be assigned as an integer to cut off the top positions in the lists for evaluation.
      • "ndcg-","map-","ndcg@n-","map@n-": In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding "-" in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatedly

    • "poisson-nloglik": negative log-likelihood for Poisson regression
    • "gamma-nloglik": negative log-likelihood for gamma regression
    • "gamma-deviance": residual deviance for gamma regression
    • "tweedie-nloglik": negative log-likelihood for Tweedie regression (at a specified value of the tweedie_variance_power parameter)

  • seed [ default=0 ]
    • random number seed.

这篇关于xgboost的参数设定的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1080588

相关文章

一文带你了解SpringBoot中启动参数的各种用法

《一文带你了解SpringBoot中启动参数的各种用法》在使用SpringBoot开发应用时,我们通常需要根据不同的环境或特定需求调整启动参数,那么,SpringBoot提供了哪些方式来配置这些启动参... 目录一、启动参数的常见传递方式二、通过命令行参数传递启动参数三、使用 application.pro

基于@RequestParam注解之Spring MVC参数绑定的利器

《基于@RequestParam注解之SpringMVC参数绑定的利器》:本文主要介绍基于@RequestParam注解之SpringMVC参数绑定的利器,具有很好的参考价值,希望对大家有所帮助... 目录@RequestParam注解:Spring MVC参数绑定的利器什么是@RequestParam?@

SpringBoot接收JSON类型的参数方式

《SpringBoot接收JSON类型的参数方式》:本文主要介绍SpringBoot接收JSON类型的参数方式,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录一、jsON二、代码准备三、Apifox操作总结一、JSON在学习前端技术时,我们有讲到过JSON,而在

JAVA虚拟机中 -D, -X, -XX ,-server参数使用

《JAVA虚拟机中-D,-X,-XX,-server参数使用》本文主要介绍了JAVA虚拟机中-D,-X,-XX,-server参数使用,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有... 目录一、-D参数二、-X参数三、-XX参数总结:在Java开发过程中,对Java虚拟机(JVM)的启动参数进

解读docker运行时-itd参数是什么意思

《解读docker运行时-itd参数是什么意思》在Docker中,-itd参数组合用于在后台运行一个交互式容器,同时保持标准输入和分配伪终端,这种方式适合需要在后台运行容器并保持交互能力的场景... 目录docker运行时-itd参数是什么意思1. -i(或 --interactive)2. -t(或 --

Java通过反射获取方法参数名的方式小结

《Java通过反射获取方法参数名的方式小结》这篇文章主要为大家详细介绍了Java如何通过反射获取方法参数名的方式,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 目录1、前言2、解决方式方式2.1: 添加编译参数配置 -parameters方式2.2: 使用Spring的内部工具类 -

Python调用另一个py文件并传递参数常见的方法及其应用场景

《Python调用另一个py文件并传递参数常见的方法及其应用场景》:本文主要介绍在Python中调用另一个py文件并传递参数的几种常见方法,包括使用import语句、exec函数、subproce... 目录前言1. 使用import语句1.1 基本用法1.2 导入特定函数1.3 处理文件路径2. 使用ex

MySQL中时区参数time_zone解读

《MySQL中时区参数time_zone解读》MySQL时区参数time_zone用于控制系统函数和字段的DEFAULTCURRENT_TIMESTAMP属性,修改时区可能会影响timestamp类型... 目录前言1.时区参数影响2.如何设置3.字段类型选择总结前言mysql 时区参数 time_zon

Python如何使用seleniumwire接管Chrome查看控制台中参数

《Python如何使用seleniumwire接管Chrome查看控制台中参数》文章介绍了如何使用Python的seleniumwire库来接管Chrome浏览器,并通过控制台查看接口参数,本文给大家... 1、cmd打开控制台,启动谷歌并制定端口号,找不到文件的加环境变量chrome.exe --rem

Linux中Curl参数详解实践应用

《Linux中Curl参数详解实践应用》在现代网络开发和运维工作中,curl命令是一个不可或缺的工具,它是一个利用URL语法在命令行下工作的文件传输工具,支持多种协议,如HTTP、HTTPS、FTP等... 目录引言一、基础请求参数1. -X 或 --request2. -d 或 --data3. -H 或