本文主要是介绍R语言 lightgbm 算法优化:不平衡二分类问题(附代码),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
来源:大数据文摘本文约10000字,建议阅读10分钟本文以kaggle比赛的数据为例,为你讲解不平衡二分类问题的解决方法。
本案例使用的数据为kaggle中“Santander Customer Satisfaction”比赛的数据。此案例为不平衡二分类问题,目标为最大化auc值(ROC曲线下方面积)。目前此比赛已经结束。
竞赛题目链接为:
https://www.kaggle.com/c/santander-customer-satisfaction
1. 建模思路
本文档采用微软开源的lightgbm算法进行分类,运行速度极快。具体步骤为:
读取数据;
并行运算:由于lightgbm包可以通过设置相应参数进行并行运算,因此不再调用doParallel与foreach包进行并行运算;
特征选择:使用mlr包提取了99%的chi.square;
调参:逐步调试lgb.cv函数的参数,并多次调试,直到满意为止;
预测结果:用调试好的参数值构建lightgbm模型,输出预测结果;本案例所用程序输出结果的ROC值为0.833386,已超过Private Leaderboard排名第一的结果(0.829072)。
2. lightgbm算法
由于lightgbm算法没有给出具体的数学公式,因此此处不再介绍,如有需要,请查看github项目网址。
lightgbm算法具体介绍网址:
https://github.com/Microsoft/LightGBM
读取数据
options(java.parameters = "-Xmx8g") ## 特征选择时使用,但是需要在加载包之前设置library(readr)lgb_tr1 <- read_csv("C:/Users/Administrator/Documents/kaggle/scs_lgb/train.csv")lgb_te1 <- read_csv("C:/Users/Administrator/Documents/kaggle/scs_lgb/test.csv")
数据探索
1. 设置并行运算
library(dplyr)library(mlr)library(parallelMap)parallelStartSocket(2)
2. 数据各列初步探索
summarizeColumns(lgb_tr1) %>% View()
3. 处理缺失值
impute missing values by mean and mode
imp_tr1 <- impute(as.data.frame(lgb_tr1), classes = list(integer = imputeMean(), numeric = imputeMean())
)
imp_te1 <- impute(as.data.frame(lgb_te1), classes = list(integer = imputeMean(), numeric = imputeMean())
)
处理缺失值后:
summarizeColumns(imp_tr1$data) %>% View()
4. 观察训练数据类别的比例–数据类别不平衡
table(lgb_tr1$TARGET)
5. 剔除数据集中的常数列
lgb_tr2 <- removeConstantFeatures(imp_tr1$data)lgb_te2 <- removeConstantFeatures(imp_te1$data)
6. 保留训练数据集与测试数据及相同的列
tr2_name <- data.frame(tr2_name = colnames(lgb_tr2))te2_name <- data.frame(te2_name = colnames(lgb_te2))tr2_name_inner <- tr2_name %>% inner_join(te2_name, by = c('tr2_name' = 'te2_name'))TARGET = data.frame(TARGET = lgb_tr2$TARGET)lgb_tr2 <- lgb_tr2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]lgb_te2 <- lgb_te2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]lgb_tr2 <- cbind(lgb_tr2, TARGET)
注:
1)由于本次使用lightgbm算法,故而不对数据进行标准化处理;
2)lightgbm算法运行效率极高,1GB内不进行特征筛选也可以运行的极快,但是此处进行特征筛选,以进一步加快运行速率;
3)本案例直接进行特征筛选,未生成衍生变量,原因为:不知特征实际意义,不好随机生成。
特征筛选–卡方检验
library(lightgbm)
1. 试算最大权重值程序,后面将继续优化
grid_search <- expand.grid( weight = seq(1, 30, 2) ## table(lgb_tr1$TARGET)[1] / table(lgb_tr1$TARGET)[2] = 24.27261 ## 故而设定weight在[1, 30]之间)
lgb_rate_1 <- numeric(length = nrow(grid_search))set.seed(0)for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr2$TARGET * i + 1) / sum(lgb_tr2$TARGET * i + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr2[, 1:300]), label = lgb_tr2$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc' ) # 交叉验证 lgb_tr2_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, learning_rate = .1, num_threads = 2, early_stopping_rounds = 10 ) lgb_rate_1[i] <- unlist(lgb_tr2_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr2_mod$record_evals$valid$auc$eval))]}library(ggplot2)grid_search$perf <- lgb_rate_1ggplot(grid_search,aes(x = weight, y = perf)) + geom_point()
从此图可知auc值受权重影响不大,在weight=5时达到最大。
2. 特征选择
1) 特征选择
lgb_tr2$TARGET <- factor(lgb_tr2$TARGET)lgb.task <- makeClassifTask(data = lgb_tr2, target = 'TARGET')lgb.task.smote <- oversample(lgb.task, rate = 5)fv_time <- system.time( fv <- generateFilterValuesData( lgb.task.smote, method = c('chi.squared') ## 此处可以使用信息增益/卡方检验的方法,但是不建议使用随机森林方法,效率极低 ## 如果有兴趣,也可以尝试IV值方法筛选 ## 特征工程决定目标值(此处为auc)的上限,可以把特征筛选方法作为超参数处理 ))
2) 制图查看
# plotFilterValues(fv)plotFilterValuesGGVIS(fv)
3) 提取99%的chi.squared(lightgbm算法效率极高,因此可以取更多的变量)
注:提取的X%的chi.squared中的X可以作为超参数处理。
fv_data2 <- fv$data %>% arrange(desc(chi.squared)) %>% mutate(chi_gain_cul = cumsum(chi.squared) / sum(chi.squared))
fv_data2_filter <- fv_data2 %>% filter(chi_gain_cul <= 0.99)dim(fv_data2_filter) ## 减少了一半的自变量fv_feature <- fv_data2_filter$namelgb_tr3 <- lgb_tr2[, c(fv_feature, 'TARGET')]lgb_te3 <- lgb_te2[, fv_feature]
4) 写出数据
write_csv(lgb_tr3, 'C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi.csv')write_csv(lgb_te3, 'C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi.csv')
算法
lgb_tr <- rxImport('C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi.csv')lgb_te <- rxImport('C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi.csv')## 建议lgb_te数据在预测时再读取,以节约内存library(lightgbm)
1. 调试weight参数
grid_search <- expand.grid( weight = 1:30)
perf_weight_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc' ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, learning_rate = .1, num_threads = 2, early_stopping_rounds = 10 ) perf_weight_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
library(ggplot2)grid_search$perf <- perf_weight_1ggplot(grid_search,aes(x = weight, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在weight=4时达到最大,呈递减趋势。
2. 调试learning_rate参数
grid_search <- expand.grid( learning_rate = 2 ^ (-(8:1)))
perf_learning_rate_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_learning_rate_1ggplot(grid_search,aes(x = learning_rate, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在learning_rate=2^(-5) 时达到最大,但是 2^(-(6:3)) 区别极小,故取learning_rate = .125,提高运行速度。
3. 调试num_leaves参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = seq(50, 800, 50))
perf_num_leaves_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_num_leaves_1ggplot(grid_search,aes(x = num_leaves, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在num_leaves=650时达到最大。
4. 调试min_data_in_leaf参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, min_data_in_leaf = 2 ^ (1:7))
perf_min_data_in_leaf_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], min_data_in_leaf = grid_search[i, 'min_data_in_leaf'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_min_data_in_leaf_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_min_data_in_leaf_1ggplot(grid_search,aes(x = min_data_in_leaf, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值对min_data_in_leaf不敏感,因此不做调整。
5. 调试max_bin参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin = 2 ^ (5:10))
perf_max_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_max_bin_1ggplot(grid_search,aes(x = max_bin, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在max_bin=2^10 时达到最大,需要再次微调max_bin值。
6. 微调max_bin参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin = 100 * (6:15))
perf_max_bin_2 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_max_bin_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_max_bin_2ggplot(grid_search,aes(x = max_bin, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在max_bin=1000时达到最大。
7. 调试min_data_in_bin参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 2 ^ (1:9) )
perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_min_data_in_bin_1ggplot(grid_search,aes(x = min_data_in_bin, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在min_data_in_bin=8时达到最大,但是变化极其细微,因此不做调整。
8. 调试feature_fraction参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = seq(.5, 1, .02) )
perf_feature_fraction_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_feature_fraction_1ggplot(grid_search,aes(x = feature_fraction, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在feature_fraction=.62时达到最大,feature_fraction在[.60,.62]之间时,auc值保持稳定,表现较好;从.64开始呈下降趋势。
9. 调试min_sum_hessian参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = seq(0, .02, .001))
perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_min_sum_hessian_1ggplot(grid_search,aes(x = min_sum_hessian, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在min_sum_hessian=0.005时达到最大,建议min_sum_hessian取值在[0.002, 0.005]区间,0.005后呈递减趋势。
10. 调试lamda参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = seq(0, .01, .002), lambda_l2 = seq(0, .01, .002))
perf_lamda_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_lamda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_lamda_1ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) + geom_point() + facet_wrap(~ lambda_l2, nrow = 5)
从此图可知建议lambda_l1 = 0, lambda_l2 = 0
11. 调试drop_rate参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = seq(0, 1, .1))
perf_drop_rate_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_drop_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_drop_rate_1ggplot(data = grid_search, aes(x = drop_rate, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在drop_rate=0.2时达到最大,在0, .2, .5较好;在[0, 1]变化不大。
12. 调试max_drop参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = seq(1, 10, 2))
perf_max_drop_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_max_drop_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_max_drop_1ggplot(data = grid_search, aes(x = max_drop, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在max_drop=5时达到最大,在[1, 10]区间变化较小。
二次调参
1. 调试weight参数
grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)
perf_weight_2 <- numeric(length = nrow(grid_search))
for(i in 1:20){ lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[1, 'learning_rate'], num_leaves = grid_search[1, 'num_leaves'], max_bin = grid_search[1, 'max_bin'], min_data_in_bin = grid_search[1, 'min_data_in_bin'], feature_fraction = grid_search[1, 'feature_fraction'], min_sum_hessian = grid_search[1, 'min_sum_hessian'], lambda_l1 = grid_search[1, 'lambda_l1'], lambda_l2 = grid_search[1, 'lambda_l2'], drop_rate = grid_search[1, 'drop_rate'], max_drop = grid_search[1, 'max_drop'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, learning_rate = .1, num_threads = 2, early_stopping_rounds = 10 ) perf_weight_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
library(ggplot2)ggplot(data.frame(num = 1:length(perf_weight_2), perf = perf_weight_2), aes(x = num, y = perf)) + geom_point() + geom_smooth()
从此图可知auc值在weight>=3时auc趋于稳定, weight=7 the max
2. 调试learning_rate参数
grid_search <- expand.grid( learning_rate = seq(.05, .5, .03), num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)
perf_learning_rate_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_learning_rate_1ggplot(data = grid_search, aes(x = learning_rate, y = perf)) + geom_point() + geom_smooth()
结论:learning_rate=.11时,auc最大。
3. 调试num_leaves参数
grid_search <- expand.grid( learning_rate = .11, num_leaves = seq(100, 800, 50), max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)
perf_num_leaves_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_num_leaves_1ggplot(data = grid_search, aes(x = num_leaves, y = perf)) + geom_point() + geom_smooth()
结论:num_leaves=200时,auc最大。
4. 调试max_bin参数
grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = seq(100, 1500, 100), min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)
perf_max_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_max_bin_1ggplot(data = grid_search, aes(x = max_bin, y = perf)) + geom_point() + geom_smooth()
结论:max_bin=600时,auc最大;400,800也是可接受值。
5. 调试min_data_in_bin参数
grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = seq(5, 50, 5), feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)
perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_min_data_in_bin_1ggplot(data = grid_search, aes(x = min_data_in_bin, y = perf)) + geom_point() + geom_smooth()
结论:min_data_in_bin=45时,auc最大;其中25是可接受值。
6. 调试feature_fraction参数
grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = 45, feature_fraction = seq(.5, .9, .02), min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)
perf_feature_fraction_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_feature_fraction_1ggplot(data = grid_search, aes(x = feature_fraction, y = perf)) + geom_point() + geom_smooth()
结论:feature_fraction=.54时,auc最大, .56, .58时也较好。
7. 调试min_sum_hessian参数
grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = 45, feature_fraction = .54, min_sum_hessian = seq(.001, .008, .0005), lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)
perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_min_sum_hessian_1ggplot(data = grid_search, aes(x = min_sum_hessian, y = perf)) + geom_point() + geom_smooth()
结论:min_sum_hessian=0.0065时auc取得最大值,取min_sum_hessian=0.003,0.0055时可接受。
8. 调试lambda参数
grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = 45, feature_fraction = .54, min_sum_hessian = 0.0065, lambda_l1 = seq(0, .001, .0002), lambda_l2 = seq(0, .001, .0002), drop_rate = .2, max_drop = 5)
perf_lambda_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 参数 params <- list( objective = 'binary', metric = 'auc', learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop'] ) # 交叉验证 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_lambda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_lambda_1ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) + geom_point() + facet_wrap(~ lambda_l2, nrow = 5)
结论:lambda与auc整体呈负相关,取lambda_l1=.0002, lambda_l2 = .0004
9. 调试drop_rate参数
结论:drop_rate=.4时取到最大值,.15, .25可接受。
10. 调试max_drop参数
结论:drop_rate=.4时取到最大值,.15, .25可接受。
预测
1. 权重
lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)
2. 训练数据集
lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight)
3. 训练
# 参数params <- list( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = 45, feature_fraction = .54, min_sum_hessian = 0.0065, lambda_l1 = .0002, lambda_l2 = .0004, drop_rate = .4, max_drop = 14)# 模型lgb_mod <- lightgbm( params = params, data = lgb_train, nrounds = 300, early_stopping_rounds = 10, num_threads = 2)# 预测lgb.pred <- predict(lgb_mod, data.matrix(lgb_te))
4. 结果
lgb.pred2 <- matrix(unlist(lgb.pred), ncol = 1)lgb.pred3 <- data.frame(lgb.pred2)
5. 输出
write.csv(lgb.pred3, "C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb.pred1_tr.csv")
注:此处给在校读书的朋友一些建议:
1. 在学校学习机器学习算法时,测试所用数据量一般较少,因此可以尝试大多数算法,大多数的R函数,例如测试随机森林算法时,可以选择randomforest包,如果数据量稍微增多,可以设置并行运算,但是如果数据量达到GB级别,并行运算randomforest包也处理不了了,并且内存会溢出;建议使用专业版R中的函数;
2. 学校学习主要针对理论进行学习,测试数据一般较为干净,实际数据结构一般更为复杂一些。
编辑:黄继彦
这篇关于R语言 lightgbm 算法优化:不平衡二分类问题(附代码)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!