xgboost的参数设定

2024-06-21 07:32
文章标签 参数 xgboost 设定

本文主要是介绍xgboost的参数设定,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

先列出Xgboost中可指定的参数,参数的详细说明如下

总共有3类参数:通用参数/general parameters, 集成(增强)参数/booster parameters 和 任务参数/task parameters

通用参数/General Parameters

  • booster [default=gbtree]
    • gbtree 和 gblinear

  • silent [default=0]
    • 0表示输出信息, 1表示安静模式

  • nthread
    • 跑xgboost的线程数,默认最大线程数

  • num_pbuffer [无需用户手动设定]
    • size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.

  • num_feature [无需用户手动设定]
    • feature dimension used in boosting, set to maximum dimension of the feature

集成(增强)参数/booster parameters

  • eta [default=0.3, 可以视作学习率]
    • 为了防止过拟合,更新过程中用到的收缩步长。在每次提升计算之后,算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
    • 取值范围为:[0,1]

  • gamma [default=0, alias: min_split_loss]
    • 为了对树的叶子节点做进一步的分割而必须设置的损失减少的最小值,该值越大,算法越保守
    • range: [0,∞]

  • max_depth [default=6]
    • 用于设置树的最大深度
    • range: [1,∞]

  • min_child_weight [default=1]
    • 表示子树观测权重之和的最小值,如果树的生长时的某一步所生成的叶子结点,其观测权重之和小于min_child_weight,那么可以放弃该步生长,在线性回归模式中,这仅仅与每个结点所需的最小观测数相对应。该值越大,算法越保守
    • range: [0,∞]

  • max_delta_step [default=0]
    • 如果该值为0,就是没有限制;如果设为一个正数,可以使每一步更新更加保守通常情况下这一参数是不需要设置的,但是在logistic回归的训练集中类极端不平衡的情况下,将这一参数的设置很有用,将该参数设为1-10可以控制每一步更新
    • range: [0,∞]

  • subsample [default=1]
    • 表示观测的子样本的比率,将其设置为0.5意味着xgboost将随机抽取一半观测用于数的生长,这将有助于防止过拟合现象
    • range: (0,1]

  • colsample_bytree [default=1]
    • 表示用于构造每棵树时变量的子样本比率
    • range: (0,1]

  • colsample_bylevel [default=1]
    • 用来控制树的每一级的每一次分裂,对列数的采样的占比。一般不太用这个参数,因为subsample参数和colsample_bytree参数可以起到相同的作用。
    • range: (0,1]

  • lambda [default=1, alias: reg_lambda]
    • L2 权重的L2正则化项

  • alpha [default=0, alias: reg_alpha]
    • L1 权重的L1正则化项

  • tree_method, string [default='auto']
    • The tree construction algorithm used in XGBoost(see description in the reference paper)
    • Distributed and external memory version only support approximate algorithm.
    • Choices: {'auto', 'exact', 'approx'}
      • 'auto': Use heuristic to choose faster one.
        • For small to medium dataset, exact greedy will be used.
        • For very large-dataset, approximate algorithm will be chosen.
        • Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.

      • 'exact': Exact greedy algorithm.
      • 'approx': Approximate greedy algorithm using sketching and histogram.

  • sketch_eps, [default=0.03]
    • This is only used for approximate greedy algorithm.
    • This roughly translated into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.
    • Usually user does not have to tune this. but consider setting to a lower number for more accurate enumeration.
    • range: (0, 1)

  • scale_pos_weight, [default=1]
    • 在各类别样本十分不平衡时,把这个参数设定为一个正值,可以使算法更快收敛
    • 一个可以考虑的值: sum(negative cases) / sum(positive cases) see Higgs Kaggle competition demo for examples: Rpy1py2py3

  • updater, [default='grow_colmaker,prune']
    • A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitely by a user. The following updater plugins exist:
      • 'grow_colmaker': non-distributed column-based construction of trees.
      • 'distcol': distributed tree construction with column-based data splitting mode.
      • 'grow_histmaker': distributed tree construction with row-based data splitting based on global proposal of histogram counting.
      • 'grow_local_histmaker': based on local histogram counting.
      • 'grow_skmaker': uses the approximate sketching algorithm.
      • 'sync': synchronizes trees in all distributed nodes.
      • 'refresh': refreshes tree's statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
      • 'prune': prunes the splits where loss < min_split_loss (or gamma).

    • In a distributed setting, the implicit updater sequence value would be adjusted as follows:
      • 'grow_histmaker,prune' when dsplit='row' (or default) and prob_buffer_row == 1 (or default); or when data has multiple sparse pages
      • 'grow_histmaker,refresh,prune' when dsplit='row' and prob_buffer_row < 1
      • 'distcol' when dsplit='col'

  • refresh_leaf, [default=1]
    • This is a parameter of the 'refresh' updater plugin. When this flag is true, tree leafs as well as tree nodes' stats are updated. When it is false, only node stats are updated.

  • process_type, [default='default']
    • A type of boosting process to run.
    • Choices: {'default', 'update'}
      • 'default': the normal boosting process which creates new trees.
      • 'update': starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updater plugins is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iteratons performed. Currently, the following built-in updater plugins could be meaningfully used with this process type: 'refresh', 'prune'. With 'update', one cannot use updater plugins that create new nrees.

任务参数/task parameters

  • objective [ default=reg:linear ] 这个参数定义需要被最小化的损失函数。最常用的值有
    • "reg:linear" --线性回归
    • "reg:logistic" --逻辑回归
    • "binary:logistic" --二分类的逻辑回归,返回预测的概率(不是类别)
    • "binary:logitraw" --输出归一化前的得分
    • "count:poisson" --poisson regression for count data, output mean of poisson distribution
      • max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)

    • "multi:softmax" --设定XGBoost做多分类,你需要同时设定num_class(类别数)的值
    • "multi:softprob" --输出维度为ndata * nclass的概率矩阵
    • "rank:pairwise" --设定XGBoost去完成排序问题(最小化pairwise loss)
    • "reg:gamma" --gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed
    • "reg:tweedie" --Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

  • base_score [ default=0.5 ]
    • the initial prediction score of all instances, global bias
    • for sufficient number of iterations, changing this value will not have too much effect.

  • eval_metric [ 默认是根据 损失函数/目标函数 自动选定的 ]
    • 有如下的选择:
      • "rmse": 均方误差
      • "mae": 绝对平均误差
      • "logloss": negative log损失
      • "error": 二分类的错误率
      • "error@t": 通过提供t为阈值(而不是0.5),计算错误率
      • "merror": 多分类的错误类,计算公式为#(wrong cases)/#(all cases).
      • "mlogloss": 多类log损失
      • "auc": ROC曲线下方的面积 for ranking evaluation.
      • "ndcg":Normalized Discounted Cumulative Gain
      • "map":平均准确率
      • "ndcg@n","map@n": n can be assigned as an integer to cut off the top positions in the lists for evaluation.
      • "ndcg-","map-","ndcg@n-","map@n-": In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding "-" in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatedly

    • "poisson-nloglik": negative log-likelihood for Poisson regression
    • "gamma-nloglik": negative log-likelihood for gamma regression
    • "gamma-deviance": residual deviance for gamma regression
    • "tweedie-nloglik": negative log-likelihood for Tweedie regression (at a specified value of the tweedie_variance_power parameter)

  • seed [ default=0 ]
    • random number seed.

这篇关于xgboost的参数设定的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1080588

相关文章

Spring Boot spring-boot-maven-plugin 参数配置详解(最新推荐)

《SpringBootspring-boot-maven-plugin参数配置详解(最新推荐)》文章介绍了SpringBootMaven插件的5个核心目标(repackage、run、start... 目录一 spring-boot-maven-plugin 插件的5个Goals二 应用场景1 重新打包应用

Java内存分配与JVM参数详解(推荐)

《Java内存分配与JVM参数详解(推荐)》本文详解JVM内存结构与参数调整,涵盖堆分代、元空间、GC选择及优化策略,帮助开发者提升性能、避免内存泄漏,本文给大家介绍Java内存分配与JVM参数详解,... 目录引言JVM内存结构JVM参数概述堆内存分配年轻代与老年代调整堆内存大小调整年轻代与老年代比例元空

一文详解PostgreSQL复制参数

《一文详解PostgreSQL复制参数》PostgreSQL作为一款功能强大的开源关系型数据库,其复制功能对于构建高可用性系统至关重要,本文给大家详细介绍了PostgreSQL的复制参数,需要的朋友可... 目录一、复制参数基础概念二、核心复制参数深度解析1. max_wal_seChina编程nders:WAL

Linux高并发场景下的网络参数调优实战指南

《Linux高并发场景下的网络参数调优实战指南》在高并发网络服务场景中,Linux内核的默认网络参数往往无法满足需求,导致性能瓶颈、连接超时甚至服务崩溃,本文基于真实案例分析,从参数解读、问题诊断到优... 目录一、问题背景:当并发连接遇上性能瓶颈1.1 案例环境1.2 初始参数分析二、深度诊断:连接状态与

史上最全nginx详细参数配置

《史上最全nginx详细参数配置》Nginx是一个轻量级高性能的HTTP和反向代理服务器,同时也是一个通用代理服务器(TCP/UDP/IMAP/POP3/SMTP),最初由俄罗斯人IgorSyso... 目录基本命令默认配置搭建站点根据文件类型设置过期时间禁止文件缓存防盗链静态文件压缩指定定错误页面跨域问题

SpringBoot请求参数接收控制指南分享

《SpringBoot请求参数接收控制指南分享》:本文主要介绍SpringBoot请求参数接收控制指南,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录Spring Boot 请求参数接收控制指南1. 概述2. 有注解时参数接收方式对比3. 无注解时接收参数默认位置

Python使用getopt处理命令行参数示例解析(最佳实践)

《Python使用getopt处理命令行参数示例解析(最佳实践)》getopt模块是Python标准库中一个简单但强大的命令行参数处理工具,它特别适合那些需要快速实现基本命令行参数解析的场景,或者需要... 目录为什么需要处理命令行参数?getopt模块基础实际应用示例与其他参数处理方式的比较常见问http

Linux内核参数配置与验证详细指南

《Linux内核参数配置与验证详细指南》在Linux系统运维和性能优化中,内核参数(sysctl)的配置至关重要,本文主要来聊聊如何配置与验证这些Linux内核参数,希望对大家有一定的帮助... 目录1. 引言2. 内核参数的作用3. 如何设置内核参数3.1 临时设置(重启失效)3.2 永久设置(重启仍生效

SpringMVC获取请求参数的方法

《SpringMVC获取请求参数的方法》:本文主要介绍SpringMVC获取请求参数的方法,本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友可以参考下... 目录1、通过ServletAPI获取2、通过控制器方法的形参获取请求参数3、@RequestParam4、@

Spring Boot项目部署命令java -jar的各种参数及作用详解

《SpringBoot项目部署命令java-jar的各种参数及作用详解》:本文主要介绍SpringBoot项目部署命令java-jar的各种参数及作用的相关资料,包括设置内存大小、垃圾回收... 目录前言一、基础命令结构二、常见的 Java 命令参数1. 设置内存大小2. 配置垃圾回收器3. 配置线程栈大小