xgboost的参数设定

2024-06-21 07:32
文章标签 参数 xgboost 设定

本文主要是介绍xgboost的参数设定,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

先列出Xgboost中可指定的参数,参数的详细说明如下

总共有3类参数:通用参数/general parameters, 集成(增强)参数/booster parameters 和 任务参数/task parameters

通用参数/General Parameters

  • booster [default=gbtree]
    • gbtree 和 gblinear

  • silent [default=0]
    • 0表示输出信息, 1表示安静模式

  • nthread
    • 跑xgboost的线程数,默认最大线程数

  • num_pbuffer [无需用户手动设定]
    • size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.

  • num_feature [无需用户手动设定]
    • feature dimension used in boosting, set to maximum dimension of the feature

集成(增强)参数/booster parameters

  • eta [default=0.3, 可以视作学习率]
    • 为了防止过拟合,更新过程中用到的收缩步长。在每次提升计算之后,算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
    • 取值范围为:[0,1]

  • gamma [default=0, alias: min_split_loss]
    • 为了对树的叶子节点做进一步的分割而必须设置的损失减少的最小值,该值越大,算法越保守
    • range: [0,∞]

  • max_depth [default=6]
    • 用于设置树的最大深度
    • range: [1,∞]

  • min_child_weight [default=1]
    • 表示子树观测权重之和的最小值,如果树的生长时的某一步所生成的叶子结点,其观测权重之和小于min_child_weight,那么可以放弃该步生长,在线性回归模式中,这仅仅与每个结点所需的最小观测数相对应。该值越大,算法越保守
    • range: [0,∞]

  • max_delta_step [default=0]
    • 如果该值为0,就是没有限制;如果设为一个正数,可以使每一步更新更加保守通常情况下这一参数是不需要设置的,但是在logistic回归的训练集中类极端不平衡的情况下,将这一参数的设置很有用,将该参数设为1-10可以控制每一步更新
    • range: [0,∞]

  • subsample [default=1]
    • 表示观测的子样本的比率,将其设置为0.5意味着xgboost将随机抽取一半观测用于数的生长,这将有助于防止过拟合现象
    • range: (0,1]

  • colsample_bytree [default=1]
    • 表示用于构造每棵树时变量的子样本比率
    • range: (0,1]

  • colsample_bylevel [default=1]
    • 用来控制树的每一级的每一次分裂,对列数的采样的占比。一般不太用这个参数,因为subsample参数和colsample_bytree参数可以起到相同的作用。
    • range: (0,1]

  • lambda [default=1, alias: reg_lambda]
    • L2 权重的L2正则化项

  • alpha [default=0, alias: reg_alpha]
    • L1 权重的L1正则化项

  • tree_method, string [default='auto']
    • The tree construction algorithm used in XGBoost(see description in the reference paper)
    • Distributed and external memory version only support approximate algorithm.
    • Choices: {'auto', 'exact', 'approx'}
      • 'auto': Use heuristic to choose faster one.
        • For small to medium dataset, exact greedy will be used.
        • For very large-dataset, approximate algorithm will be chosen.
        • Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.

      • 'exact': Exact greedy algorithm.
      • 'approx': Approximate greedy algorithm using sketching and histogram.

  • sketch_eps, [default=0.03]
    • This is only used for approximate greedy algorithm.
    • This roughly translated into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.
    • Usually user does not have to tune this. but consider setting to a lower number for more accurate enumeration.
    • range: (0, 1)

  • scale_pos_weight, [default=1]
    • 在各类别样本十分不平衡时,把这个参数设定为一个正值,可以使算法更快收敛
    • 一个可以考虑的值: sum(negative cases) / sum(positive cases) see Higgs Kaggle competition demo for examples: Rpy1py2py3

  • updater, [default='grow_colmaker,prune']
    • A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitely by a user. The following updater plugins exist:
      • 'grow_colmaker': non-distributed column-based construction of trees.
      • 'distcol': distributed tree construction with column-based data splitting mode.
      • 'grow_histmaker': distributed tree construction with row-based data splitting based on global proposal of histogram counting.
      • 'grow_local_histmaker': based on local histogram counting.
      • 'grow_skmaker': uses the approximate sketching algorithm.
      • 'sync': synchronizes trees in all distributed nodes.
      • 'refresh': refreshes tree's statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
      • 'prune': prunes the splits where loss < min_split_loss (or gamma).

    • In a distributed setting, the implicit updater sequence value would be adjusted as follows:
      • 'grow_histmaker,prune' when dsplit='row' (or default) and prob_buffer_row == 1 (or default); or when data has multiple sparse pages
      • 'grow_histmaker,refresh,prune' when dsplit='row' and prob_buffer_row < 1
      • 'distcol' when dsplit='col'

  • refresh_leaf, [default=1]
    • This is a parameter of the 'refresh' updater plugin. When this flag is true, tree leafs as well as tree nodes' stats are updated. When it is false, only node stats are updated.

  • process_type, [default='default']
    • A type of boosting process to run.
    • Choices: {'default', 'update'}
      • 'default': the normal boosting process which creates new trees.
      • 'update': starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updater plugins is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iteratons performed. Currently, the following built-in updater plugins could be meaningfully used with this process type: 'refresh', 'prune'. With 'update', one cannot use updater plugins that create new nrees.

任务参数/task parameters

  • objective [ default=reg:linear ] 这个参数定义需要被最小化的损失函数。最常用的值有
    • "reg:linear" --线性回归
    • "reg:logistic" --逻辑回归
    • "binary:logistic" --二分类的逻辑回归,返回预测的概率(不是类别)
    • "binary:logitraw" --输出归一化前的得分
    • "count:poisson" --poisson regression for count data, output mean of poisson distribution
      • max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)

    • "multi:softmax" --设定XGBoost做多分类,你需要同时设定num_class(类别数)的值
    • "multi:softprob" --输出维度为ndata * nclass的概率矩阵
    • "rank:pairwise" --设定XGBoost去完成排序问题(最小化pairwise loss)
    • "reg:gamma" --gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed
    • "reg:tweedie" --Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

  • base_score [ default=0.5 ]
    • the initial prediction score of all instances, global bias
    • for sufficient number of iterations, changing this value will not have too much effect.

  • eval_metric [ 默认是根据 损失函数/目标函数 自动选定的 ]
    • 有如下的选择:
      • "rmse": 均方误差
      • "mae": 绝对平均误差
      • "logloss": negative log损失
      • "error": 二分类的错误率
      • "error@t": 通过提供t为阈值(而不是0.5),计算错误率
      • "merror": 多分类的错误类,计算公式为#(wrong cases)/#(all cases).
      • "mlogloss": 多类log损失
      • "auc": ROC曲线下方的面积 for ranking evaluation.
      • "ndcg":Normalized Discounted Cumulative Gain
      • "map":平均准确率
      • "ndcg@n","map@n": n can be assigned as an integer to cut off the top positions in the lists for evaluation.
      • "ndcg-","map-","ndcg@n-","map@n-": In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding "-" in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatedly

    • "poisson-nloglik": negative log-likelihood for Poisson regression
    • "gamma-nloglik": negative log-likelihood for gamma regression
    • "gamma-deviance": residual deviance for gamma regression
    • "tweedie-nloglik": negative log-likelihood for Tweedie regression (at a specified value of the tweedie_variance_power parameter)

  • seed [ default=0 ]
    • random number seed.

这篇关于xgboost的参数设定的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1080588

相关文章

Python如何使用seleniumwire接管Chrome查看控制台中参数

《Python如何使用seleniumwire接管Chrome查看控制台中参数》文章介绍了如何使用Python的seleniumwire库来接管Chrome浏览器,并通过控制台查看接口参数,本文给大家... 1、cmd打开控制台,启动谷歌并制定端口号,找不到文件的加环境变量chrome.exe --rem

Linux中Curl参数详解实践应用

《Linux中Curl参数详解实践应用》在现代网络开发和运维工作中,curl命令是一个不可或缺的工具,它是一个利用URL语法在命令行下工作的文件传输工具,支持多种协议,如HTTP、HTTPS、FTP等... 目录引言一、基础请求参数1. -X 或 --request2. -d 或 --data3. -H 或

详解Spring Boot接收参数的19种方式

《详解SpringBoot接收参数的19种方式》SpringBoot提供了多种注解来接收不同类型的参数,本文给大家介绍SpringBoot接收参数的19种方式,感兴趣的朋友跟随小编一起看看吧... 目录SpringBoot接受参数相关@PathVariable注解@RequestHeader注解@Reque

Java向kettle8.0传递参数的方式总结

《Java向kettle8.0传递参数的方式总结》介绍了如何在Kettle中传递参数到转换和作业中,包括设置全局properties、使用TransMeta和JobMeta的parameterValu... 目录1.传递参数到转换中2.传递参数到作业中总结1.传递参数到转换中1.1. 通过设置Trans的

java如何调用kettle设置变量和参数

《java如何调用kettle设置变量和参数》文章简要介绍了如何在Java中调用Kettle,并重点讨论了变量和参数的区别,以及在Java代码中如何正确设置和使用这些变量,避免覆盖Kettle中已设置... 目录Java调用kettle设置变量和参数java代码中变量会覆盖kettle里面设置的变量总结ja

spring 参数校验Validation示例详解

《spring参数校验Validation示例详解》Spring提供了Validation工具类来实现对客户端传来的请求参数的有效校验,本文给大家介绍spring参数校验Validation示例详... 目录前言一、Validation常见的校验注解二、Validation的简单应用三、分组校验四、自定义校

SpringBoot中Get请求和POST请求接收参数示例详解

《SpringBoot中Get请求和POST请求接收参数示例详解》文章详细介绍了SpringBoot中Get请求和POST请求的参数接收方式,包括方法形参接收参数、实体类接收参数、HttpServle... 目录1、Get请求1.1 方法形参接收参数 这种方式一般适用参数比较少的情况,并且前后端参数名称必须

Andrej Karpathy最新采访:认知核心模型10亿参数就够了,AI会打破教育不公的僵局

夕小瑶科技说 原创  作者 | 海野 AI圈子的红人,AI大神Andrej Karpathy,曾是OpenAI联合创始人之一,特斯拉AI总监。上一次的动态是官宣创办一家名为 Eureka Labs 的人工智能+教育公司 ,宣布将长期致力于AI原生教育。 近日,Andrej Karpathy接受了No Priors(投资博客)的采访,与硅谷知名投资人 Sara Guo 和 Elad G

C++11第三弹:lambda表达式 | 新的类功能 | 模板的可变参数

🌈个人主页: 南桥几晴秋 🌈C++专栏: 南桥谈C++ 🌈C语言专栏: C语言学习系列 🌈Linux学习专栏: 南桥谈Linux 🌈数据结构学习专栏: 数据结构杂谈 🌈数据库学习专栏: 南桥谈MySQL 🌈Qt学习专栏: 南桥谈Qt 🌈菜鸡代码练习: 练习随想记录 🌈git学习: 南桥谈Git 🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈�

如何在页面调用utility bar并传递参数至lwc组件

1.在app的utility item中添加lwc组件: 2.调用utility bar api的方式有两种: 方法一,通过lwc调用: import {LightningElement,api ,wire } from 'lwc';import { publish, MessageContext } from 'lightning/messageService';import Ca