本文主要是介绍【kaggle】avazu-ctr-prediction,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
【kaggle】avazu-ctr-prediction
- 前言
- 1、 工具包 & 数据导入
- 1.1 工具包导入
- 1.2、数据导入
- 1.3、原数据线下验证(0.379626)
- 2 数据探索&特征工程&验证
- 2.1 加入用户定位特征
- 2.2 加入用户的统计特征
- 2.2.1 用户统计特征1
- 2.2.2、线下验证(0.37178)
- 2.3 加入类别编码
- 2.3.1 线下认证(0.353937)
- 2.3.2 用户特征2(用户特征2在此带来过大的增益)
- 2.3.3 线下认证(0.35388)
- 2.4 加入数据count编码
- 2.4.1 线下检验(0.352798)
- 3 总结
前言
这是kaggle上的Click-Through Rate Prediction。对于该比赛的学习主要是通过数据集构建强力特征,了解特征对模型的影响,并线下进行认证。
1、 工具包 & 数据导入
1.1 工具包导入
import pandas as pd
import numpy as np
import lightgbm as lgb
import matplotlib.pyplot as plt
import gc
import warnings
warnings.filterwarnings('ignore')%matplotlib inline
1.2、数据导入
path = './Data/'
#数据集过大,仅获取1000000条数据进行实验
train = pd.read_csv(path + 'train.csv',nrows = 1000000)train_cols = [x for x in train.columns if x!='id' and x!='click']
data = train#apply函数,对索引对象使用函数。
data['day'] = data['hour'].apply(lambda x: str(x)[4:6]).astype(int)
train_data = data.iloc[:train.shape[0],:]
test_data = data.iloc[train.shape[0]:,:] train_data_train = train_data.loc[:800000,:]
train_data_val = train_data.loc[800000: ,:]train_cols = [x for x in train_data_train.columns if x!='id' and x!='click' and train_data_train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y])
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train)
1.3、原数据线下验证(0.379626)
train_cols = [x for x in train.columns if x!='id' and x!='click' and train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y])
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train) params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 255,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.85,
'bagging_freq': 5,
'min_data_in_leaf':15,
'verbose': 0
}
print('Start training...')
# train
gbm_val_0 = lgb.train(params,
lgb_train,
num_boost_round=2000,
valid_sets=[lgb_train,lgb_eval],
early_stopping_rounds=50,verbose_eval=10)
Start training...
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.030452 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 50 rounds
[10] training's binary_logloss: 0.415596 valid_1's binary_logloss: 0.396804
[20] training's binary_logloss: 0.405182 valid_1's binary_logloss: 0.386881
[30] training's binary_logloss: 0.40081 valid_1's binary_logloss: 0.38287
[40] training's binary_logloss: 0.398759 valid_1's binary_logloss: 0.381188
[50] training's binary_logloss: 0.39769 valid_1's binary_logloss: 0.380376
[60] training's binary_logloss: 0.397035 valid_1's binary_logloss: 0.380004
[70] training's binary_logloss: 0.396565 valid_1's binary_logloss: 0.379813
[80] training's binary_logloss: 0.396164 valid_1's binary_logloss: 0.379763
[90] training's binary_logloss: 0.395848 valid_1's binary_logloss: 0.379657
[100] training's binary_logloss: 0.395598 valid_1's binary_logloss: 0.379654
[110] training's binary_logloss: 0.395358 valid_1's binary_logloss: 0.379699
[120] training's binary_logloss: 0.395134 valid_1's binary_logloss: 0.379749
[130] training's binary_logloss: 0.394928 valid_1's binary_logloss: 0.379827
[140] training's binary_logloss: 0.394724 valid_1's binary_logloss: 0.380029
Early stopping, best iteration is:
[93] training's binary_logloss: 0.395774 valid_1's binary_logloss: 0.379626
2 数据探索&特征工程&验证
在做推荐类的比赛时,有一个东西非常重要,那就是用户,而在该数据集中,我们没有看到真正的用户信息,所以我们需要将用户的信息给定位出来,或者至少粗略的定位出来,这样才能进行协同过滤,聚类等等操作,亦或是对用户提取历史点击率等。
在此之前我们还是先将数据的所有含义打印出来,方便快速查找。
数据中每一维度数据的含义。
·id: 用户ID号
·click: 0/1 表示未点击/点击
·hour: 格式为YYMMDDHH,因此14091123表示2014年9月11日UTC时间23:00。
·C1: 匿名分类变量
·banner_pos: int型,网页上的广告位置,离散特征0,1,2,3…
·site_id:Site ID
·site_domain:Site领域
·site_category: 网站类别
·app_id: string型,用户APP的ID
·app_domain
·app_category
·device_id: 设备编号
·device_ip
·device_model
·device_type: 设备类型
·device_conn_type:Device接入类型
·C14-C21 – anonymized categorical variables
2.1 加入用户定位特征
我们假设用户的设备是不会变的,所以我们此处用device的相关特征来定位用户。在做特征之前,我们需要对数据进行进一步的探索与挖掘,首先就是要观察是否数据中存在异样的情况,此处我们重点看device_id,我们发现a99f214a有33358308个,比其他的要高了几个数量级,也就是说,a99f214a可能是缺失值的一种编码表示,所以在对这个特征进行用户定位时需要额外处理。
# data['device_id'].value_counts()
data['device_ip'].value_counts()
data['user_id'] = data['device_id'] + '_' + data['device_ip']+ '_' + data['device_model']
2.2 加入用户的统计特征
2.2.1 用户统计特征1
·用户每天在某个app上出现的次数
·用户每个小时,每天出现的次数
·用户距离上一次出现的时间差
from datetime import datetime
data['hour']=data['hour'].map(lambda x: datetime.strptime(str(x),"%y%m%d%H"))
data['dayoftheweek']=data['hour'].map(lambda x: x.weekday())
data['day']=data['hour'].map(lambda x: x.day)
data['hour']=data['hour'].map(lambda x: x.hour)
data['time']=(data['day'].values - data['day'].min()) * 24 + data['hour'].values## 用户每天/每小时在某个app上出现的次数
for time in ['day','time']:print('user_id_'+time +'_app')data['user_id_'+time +'_app'] = data['user_id'] + '_' + data[time].astype(str) + '_' + data['app_id'].astype(str)dic_ = data['user_id_'+time +'_app'].value_counts().to_dict()data['user_id_'+time +'_app_count'] = data['user_id_'+time +'_app'].apply(lambda x: dic_[x])data.drop('user_id_'+time +'_app', axis=1,inplace = True)## 用户每个小时,每天出现的次数
for time in ['day','time']:print('user_id_'+time)data['user_id_'+time] = data['user_id'] + '_' + data[time].astype(str)dic_ = data['user_id_'+time].value_counts().to_dict()data['user_id_'+time +'_count'] = data['user_id_'+time].apply(lambda x: dic_[x])data.drop('user_id_'+time, axis=1,inplace = True)data['user_to_lasttime'] = data.groupby('user_id')['time'].diff().values
user_id_day_app
user_id_time_app
user_id_day
user_id_time
2.2.2、线下验证(0.37178)
train_data = data.iloc[:train.shape[0],:]
test_data = data.iloc[train.shape[0]:,:] train_data_train = train_data.loc[:800000,:]
train_data_val = train_data.loc[800000: ,:]train_cols = [x for x in train_data_train.columns if x!='id' and x!='click' and train_data_train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y])
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train) params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 255,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.85,
'bagging_freq': 5,
'min_data_in_leaf':15,
'verbose': 0
}
print('Start training...')
# train
gbm_val_1 = lgb.train(params,
lgb_train,
num_boost_round=2000,
valid_sets=[lgb_train,lgb_eval],
early_stopping_rounds=50,verbose_eval=10)
Start training...
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.035420 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 50 rounds
[10] training's binary_logloss: 0.41076 valid_1's binary_logloss: 0.393777
[20] training's binary_logloss: 0.398064 valid_1's binary_logloss: 0.38264
[30] training's binary_logloss: 0.392174 valid_1's binary_logloss: 0.377804
[40] training's binary_logloss: 0.389033 valid_1's binary_logloss: 0.375587
[50] training's binary_logloss: 0.387046 valid_1's binary_logloss: 0.374293
[60] training's binary_logloss: 0.385574 valid_1's binary_logloss: 0.37359
[70] training's binary_logloss: 0.384426 valid_1's binary_logloss: 0.373191
[80] training's binary_logloss: 0.38345 valid_1's binary_logloss: 0.372868
[90] training's binary_logloss: 0.382593 valid_1's binary_logloss: 0.372662
[100] training's binary_logloss: 0.381773 valid_1's binary_logloss: 0.372439
[110] training's binary_logloss: 0.380998 valid_1's binary_logloss: 0.372267
[120] training's binary_logloss: 0.380297 valid_1's binary_logloss: 0.372154
[130] training's binary_logloss: 0.379702 valid_1's binary_logloss: 0.372094
[140] training's binary_logloss: 0.379135 valid_1's binary_logloss: 0.372023
[150] training's binary_logloss: 0.378598 valid_1's binary_logloss: 0.371953
[160] training's binary_logloss: 0.378005 valid_1's binary_logloss: 0.371911
[170] training's binary_logloss: 0.377443 valid_1's binary_logloss: 0.371876
[180] training's binary_logloss: 0.376933 valid_1's binary_logloss: 0.371817
[190] training's binary_logloss: 0.376468 valid_1's binary_logloss: 0.371814
[200] training's binary_logloss: 0.375994 valid_1's binary_logloss: 0.371827
[210] training's binary_logloss: 0.375567 valid_1's binary_logloss: 0.371925
[220] training's binary_logloss: 0.375141 valid_1's binary_logloss: 0.371927
[230] training's binary_logloss: 0.374727 valid_1's binary_logloss: 0.371882
Early stopping, best iteration is:
[186] training's binary_logloss: 0.376655 valid_1's binary_logloss: 0.37178
2.3 加入类别编码
(和之前的不同在于多了一个user_id)
from sklearn.preprocessing import LabelEncoder
for col in data.columns:if col!='id' and col!='click': if data[col].dtypes == 'O':print(col)data[col+'_labelencode'] = LabelEncoder().fit_transform(data[col].values)
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
user_id
2.3.1 线下认证(0.353937)
train_data = data.iloc[:train.shape[0],:]
test_data = data.iloc[train.shape[0]:,:] train_data_train = train_data.loc[:800000,:]
train_data_val = train_data.loc[800000: ,:]train_cols = [x for x in train_data_train.columns if x!='id' and x!='click' and train_data_train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y])
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train) params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 255,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.85,
'bagging_freq': 5,
'min_data_in_leaf':15,
'verbose': 0
}
print('Start training...')
# train
gbm_val_1 = lgb.train(params,
lgb_train,
num_boost_round=2000,
valid_sets=[lgb_train,lgb_eval],
early_stopping_rounds=50,verbose_eval=10)
Start training...
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.056239 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 50 rounds
[10] training's binary_logloss: 0.4058 valid_1's binary_logloss: 0.387928
[20] training's binary_logloss: 0.390174 valid_1's binary_logloss: 0.373899
[30] training's binary_logloss: 0.382333 valid_1's binary_logloss: 0.367276
[40] training's binary_logloss: 0.377793 valid_1's binary_logloss: 0.363899
[50] training's binary_logloss: 0.374682 valid_1's binary_logloss: 0.361842
[60] training's binary_logloss: 0.372386 valid_1's binary_logloss: 0.36039
[70] training's binary_logloss: 0.37014 valid_1's binary_logloss: 0.359147
[80] training's binary_logloss: 0.368285 valid_1's binary_logloss: 0.358286
[90] training's binary_logloss: 0.366652 valid_1's binary_logloss: 0.357614
[100] training's binary_logloss: 0.365119 valid_1's binary_logloss: 0.357172
[110] training's binary_logloss: 0.363837 valid_1's binary_logloss: 0.356862
[120] training's binary_logloss: 0.36253 valid_1's binary_logloss: 0.356532
[130] training's binary_logloss: 0.36142 valid_1's binary_logloss: 0.356425
[140] training's binary_logloss: 0.360221 valid_1's binary_logloss: 0.356221
[150] training's binary_logloss: 0.359205 valid_1's binary_logloss: 0.356097
[160] training's binary_logloss: 0.358145 valid_1's binary_logloss: 0.355992
[170] training's binary_logloss: 0.357173 valid_1's binary_logloss: 0.355848
[180] training's binary_logloss: 0.356149 valid_1's binary_logloss: 0.355788
[190] training's binary_logloss: 0.355161 valid_1's binary_logloss: 0.355695
[200] training's binary_logloss: 0.354191 valid_1's binary_logloss: 0.355514
[210] training's binary_logloss: 0.353241 valid_1's binary_logloss: 0.355393
[220] training's binary_logloss: 0.352405 valid_1's binary_logloss: 0.355259
[230] training's binary_logloss: 0.351462 valid_1's binary_logloss: 0.355109
[240] training's binary_logloss: 0.350589 valid_1's binary_logloss: 0.355077
[250] training's binary_logloss: 0.349738 valid_1's binary_logloss: 0.354968
[260] training's binary_logloss: 0.348869 valid_1's binary_logloss: 0.35489
[270] training's binary_logloss: 0.347985 valid_1's binary_logloss: 0.354858
[280] training's binary_logloss: 0.347223 valid_1's binary_logloss: 0.354709
[290] training's binary_logloss: 0.346345 valid_1's binary_logloss: 0.354566
[300] training's binary_logloss: 0.345503 valid_1's binary_logloss: 0.354518
[310] training's binary_logloss: 0.344748 valid_1's binary_logloss: 0.354464
[320] training's binary_logloss: 0.344014 valid_1's binary_logloss: 0.354421
[330] training's binary_logloss: 0.343335 valid_1's binary_logloss: 0.354472
[340] training's binary_logloss: 0.342561 valid_1's binary_logloss: 0.354477
[350] training's binary_logloss: 0.341778 valid_1's binary_logloss: 0.354468
[360] training's binary_logloss: 0.340963 valid_1's binary_logloss: 0.35438
[370] training's binary_logloss: 0.340251 valid_1's binary_logloss: 0.354374
[380] training's binary_logloss: 0.339512 valid_1's binary_logloss: 0.354332
[390] training's binary_logloss: 0.3387 valid_1's binary_logloss: 0.354222
[400] training's binary_logloss: 0.33794 valid_1's binary_logloss: 0.35423
[410] training's binary_logloss: 0.337155 valid_1's binary_logloss: 0.354198
[420] training's binary_logloss: 0.336554 valid_1's binary_logloss: 0.354167
[430] training's binary_logloss: 0.335958 valid_1's binary_logloss: 0.354187
[440] training's binary_logloss: 0.335283 valid_1's binary_logloss: 0.354124
[450] training's binary_logloss: 0.334627 valid_1's binary_logloss: 0.35403
[460] training's binary_logloss: 0.33404 valid_1's binary_logloss: 0.354019
[470] training's binary_logloss: 0.333412 valid_1's binary_logloss: 0.353969
[480] training's binary_logloss: 0.332729 valid_1's binary_logloss: 0.353947
[490] training's binary_logloss: 0.332128 valid_1's binary_logloss: 0.353948
[500] training's binary_logloss: 0.331457 valid_1's binary_logloss: 0.353957
[510] training's binary_logloss: 0.330729 valid_1's binary_logloss: 0.353966
[520] training's binary_logloss: 0.330111 valid_1's binary_logloss: 0.353959
Early stopping, best iteration is:
[478] training's binary_logloss: 0.332869 valid_1's binary_logloss: 0.353937
2.3.2 用户特征2(用户特征2在此带来过大的增益)
# 我们再看C14-C21系列
train[['C14','C15','C16','C17','C18','C19','C20','C21']].nunique()
C14 606
C15 8
C16 9
C17 162
C18 4
C19 41
C20 161
C21 35
dtype: int64
for time in ['day','time']:for c in ['C14','C17']:print('user_id_'+ time +'_' +c)data['user_id_'+ time + '_' +c] = data['user_id'] + '_' + data[time].astype(str) + '_' + data[c].astype(str)dic_ = data['user_id_'+ time +'_' +c].value_counts().to_dict()data['user_id_'+ time +'_' + c + '_count'] = data['user_id_'+ time +'_' + c].apply(lambda x: dic_[x])data.drop('user_id_'+time +'_' + c, axis=1,inplace = True)
user_id_day_C14
user_id_day_C17
user_id_time_C14
user_id_time_C17
2.3.3 线下认证(0.35388)
train_data = data.iloc[:train.shape[0],:]
test_data = data.iloc[train.shape[0]:,:] train_data_train = train_data.loc[:800000,:]
train_data_val = train_data.loc[800000: ,:]train_cols = [x for x in train_data_train.columns if x!='id' and x!='click' and train_data_train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y])
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train) params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 255,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.85,
'bagging_freq': 5,
'min_data_in_leaf':15,
'verbose': 0
}
print('Start training...')
# train
gbm_val_1 = lgb.train(params,
lgb_train,
num_boost_round=2000,
valid_sets=[lgb_train,lgb_eval],
early_stopping_rounds=50,verbose_eval=10)
Start training...
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.060823 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 50 rounds
[10] training's binary_logloss: 0.405547 valid_1's binary_logloss: 0.387636
[20] training's binary_logloss: 0.389848 valid_1's binary_logloss: 0.37334
[30] training's binary_logloss: 0.38195 valid_1's binary_logloss: 0.36688
[40] training's binary_logloss: 0.377228 valid_1's binary_logloss: 0.363296
[50] training's binary_logloss: 0.374017 valid_1's binary_logloss: 0.361278
[60] training's binary_logloss: 0.371532 valid_1's binary_logloss: 0.359834
[70] training's binary_logloss: 0.369535 valid_1's binary_logloss: 0.358891
[80] training's binary_logloss: 0.36771 valid_1's binary_logloss: 0.358131
[90] training's binary_logloss: 0.365959 valid_1's binary_logloss: 0.357518
[100] training's binary_logloss: 0.36443 valid_1's binary_logloss: 0.357108
[110] training's binary_logloss: 0.363092 valid_1's binary_logloss: 0.35663
[120] training's binary_logloss: 0.361761 valid_1's binary_logloss: 0.356263
[130] training's binary_logloss: 0.360678 valid_1's binary_logloss: 0.356223
[140] training's binary_logloss: 0.359477 valid_1's binary_logloss: 0.356079
[150] training's binary_logloss: 0.3584 valid_1's binary_logloss: 0.355906
[160] training's binary_logloss: 0.357263 valid_1's binary_logloss: 0.355793
[170] training's binary_logloss: 0.356221 valid_1's binary_logloss: 0.355635
[180] training's binary_logloss: 0.355223 valid_1's binary_logloss: 0.35555
[190] training's binary_logloss: 0.354236 valid_1's binary_logloss: 0.355493
[200] training's binary_logloss: 0.35324 valid_1's binary_logloss: 0.355354
[210] training's binary_logloss: 0.352328 valid_1's binary_logloss: 0.355326
[220] training's binary_logloss: 0.351424 valid_1's binary_logloss: 0.355269
[230] training's binary_logloss: 0.3505 valid_1's binary_logloss: 0.355171
[240] training's binary_logloss: 0.349676 valid_1's binary_logloss: 0.355081
[250] training's binary_logloss: 0.348778 valid_1's binary_logloss: 0.354944
[260] training's binary_logloss: 0.347907 valid_1's binary_logloss: 0.354883
[270] training's binary_logloss: 0.347043 valid_1's binary_logloss: 0.35476
[280] training's binary_logloss: 0.346234 valid_1's binary_logloss: 0.354682
[290] training's binary_logloss: 0.345388 valid_1's binary_logloss: 0.354628
[300] training's binary_logloss: 0.344525 valid_1's binary_logloss: 0.354591
[310] training's binary_logloss: 0.343854 valid_1's binary_logloss: 0.354539
[320] training's binary_logloss: 0.343085 valid_1's binary_logloss: 0.354478
[330] training's binary_logloss: 0.342303 valid_1's binary_logloss: 0.354444
[340] training's binary_logloss: 0.34158 valid_1's binary_logloss: 0.354414
[350] training's binary_logloss: 0.340726 valid_1's binary_logloss: 0.354351
[360] training's binary_logloss: 0.339915 valid_1's binary_logloss: 0.354308
[370] training's binary_logloss: 0.339071 valid_1's binary_logloss: 0.35423
[380] training's binary_logloss: 0.338348 valid_1's binary_logloss: 0.354141
[390] training's binary_logloss: 0.337606 valid_1's binary_logloss: 0.354138
[400] training's binary_logloss: 0.336808 valid_1's binary_logloss: 0.35411
[410] training's binary_logloss: 0.335942 valid_1's binary_logloss: 0.35404
[420] training's binary_logloss: 0.335274 valid_1's binary_logloss: 0.354039
[430] training's binary_logloss: 0.334644 valid_1's binary_logloss: 0.354051
[440] training's binary_logloss: 0.3339 valid_1's binary_logloss: 0.353959
[450] training's binary_logloss: 0.333256 valid_1's binary_logloss: 0.353951
[460] training's binary_logloss: 0.332656 valid_1's binary_logloss: 0.353934
[470] training's binary_logloss: 0.33198 valid_1's binary_logloss: 0.353927
[480] training's binary_logloss: 0.331254 valid_1's binary_logloss: 0.3539
[490] training's binary_logloss: 0.330562 valid_1's binary_logloss: 0.353923
[500] training's binary_logloss: 0.329909 valid_1's binary_logloss: 0.35394
[510] training's binary_logloss: 0.329197 valid_1's binary_logloss: 0.353897
[520] training's binary_logloss: 0.328595 valid_1's binary_logloss: 0.353925
[530] training's binary_logloss: 0.32796 valid_1's binary_logloss: 0.353898
[540] training's binary_logloss: 0.327339 valid_1's binary_logloss: 0.353918
[550] training's binary_logloss: 0.326625 valid_1's binary_logloss: 0.353916
[560] training's binary_logloss: 0.325978 valid_1's binary_logloss: 0.353901
[570] training's binary_logloss: 0.325392 valid_1's binary_logloss: 0.353924
[580] training's binary_logloss: 0.324693 valid_1's binary_logloss: 0.353934
[590] training's binary_logloss: 0.324195 valid_1's binary_logloss: 0.353966
[600] training's binary_logloss: 0.323644 valid_1's binary_logloss: 0.353972
Early stopping, best iteration is:
[558] training's binary_logloss: 0.326104 valid_1's binary_logloss: 0.35388
2.4 加入数据count编码
cate_cols = ['C1','banner_pos','site_id','site_domain','site_category','app_id','app_domain','app_category','device_id', 'device_ip','device_model','device_type','device_conn_type',\'C14','C15','C16','C17','C18','C19','C20','C21']
for col in cate_cols: print(col)data[col+'_cnt_code'] = data.groupby(col)['click'].transform('count')
C1
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14
C15
C16
C17
C18
C19
C20
C21
2.4.1 线下检验(0.352798)
train_data = data.iloc[:train.shape[0],:]
test_data = data.iloc[train.shape[0]:,:] train_data_train = train_data.loc[:800000,:]
train_data_val = train_data.loc[800000: ,:]train_cols = [x for x in train_data_train.columns if x!='id' and x!='click' and train_data_train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y])
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train) params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 255,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.85,
'bagging_freq': 5,
'min_data_in_leaf':15,
'verbose': 0
}
print('Start training...')
# train
gbm_val_1 = lgb.train(params,
lgb_train,
num_boost_round=2000,
valid_sets=[lgb_train,lgb_eval],
early_stopping_rounds=50,verbose_eval=10)
Start training...
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.070490 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 50 rounds
[10] training's binary_logloss: 0.404531 valid_1's binary_logloss: 0.387466
[20] training's binary_logloss: 0.388403 valid_1's binary_logloss: 0.373173
[30] training's binary_logloss: 0.380199 valid_1's binary_logloss: 0.366352
[40] training's binary_logloss: 0.375236 valid_1's binary_logloss: 0.362693
[50] training's binary_logloss: 0.371983 valid_1's binary_logloss: 0.3607
[60] training's binary_logloss: 0.369455 valid_1's binary_logloss: 0.359339
[70] training's binary_logloss: 0.367318 valid_1's binary_logloss: 0.358226
[80] training's binary_logloss: 0.365324 valid_1's binary_logloss: 0.357321
[90] training's binary_logloss: 0.363464 valid_1's binary_logloss: 0.356642
[100] training's binary_logloss: 0.361842 valid_1's binary_logloss: 0.356127
[110] training's binary_logloss: 0.360382 valid_1's binary_logloss: 0.355631
[120] training's binary_logloss: 0.359057 valid_1's binary_logloss: 0.355329
[130] training's binary_logloss: 0.357846 valid_1's binary_logloss: 0.355249
[140] training's binary_logloss: 0.356635 valid_1's binary_logloss: 0.354866
[150] training's binary_logloss: 0.35549 valid_1's binary_logloss: 0.354733
[160] training's binary_logloss: 0.354301 valid_1's binary_logloss: 0.354472
[170] training's binary_logloss: 0.353274 valid_1's binary_logloss: 0.354382
[180] training's binary_logloss: 0.35209 valid_1's binary_logloss: 0.354205
[190] training's binary_logloss: 0.351018 valid_1's binary_logloss: 0.354118
[200] training's binary_logloss: 0.349923 valid_1's binary_logloss: 0.353956
[210] training's binary_logloss: 0.348832 valid_1's binary_logloss: 0.353877
[220] training's binary_logloss: 0.34786 valid_1's binary_logloss: 0.353786
[230] training's binary_logloss: 0.346821 valid_1's binary_logloss: 0.353661
[240] training's binary_logloss: 0.345835 valid_1's binary_logloss: 0.353622
[250] training's binary_logloss: 0.344851 valid_1's binary_logloss: 0.353514
[260] training's binary_logloss: 0.343781 valid_1's binary_logloss: 0.353416
[270] training's binary_logloss: 0.342893 valid_1's binary_logloss: 0.353344
[280] training's binary_logloss: 0.341943 valid_1's binary_logloss: 0.353273
[290] training's binary_logloss: 0.341033 valid_1's binary_logloss: 0.353208
[300] training's binary_logloss: 0.340069 valid_1's binary_logloss: 0.35314
[310] training's binary_logloss: 0.339145 valid_1's binary_logloss: 0.353074
[320] training's binary_logloss: 0.338296 valid_1's binary_logloss: 0.35305
[330] training's binary_logloss: 0.337463 valid_1's binary_logloss: 0.353056
[340] training's binary_logloss: 0.336665 valid_1's binary_logloss: 0.353096
[350] training's binary_logloss: 0.33593 valid_1's binary_logloss: 0.353086
[360] training's binary_logloss: 0.335081 valid_1's binary_logloss: 0.353057
[370] training's binary_logloss: 0.33422 valid_1's binary_logloss: 0.352987
[380] training's binary_logloss: 0.33337 valid_1's binary_logloss: 0.352933
[390] training's binary_logloss: 0.33254 valid_1's binary_logloss: 0.352951
[400] training's binary_logloss: 0.331765 valid_1's binary_logloss: 0.352908
[410] training's binary_logloss: 0.33095 valid_1's binary_logloss: 0.352876
[420] training's binary_logloss: 0.330151 valid_1's binary_logloss: 0.352842
[430] training's binary_logloss: 0.329352 valid_1's binary_logloss: 0.352848
[440] training's binary_logloss: 0.328604 valid_1's binary_logloss: 0.352839
[450] training's binary_logloss: 0.327801 valid_1's binary_logloss: 0.352854
[460] training's binary_logloss: 0.327144 valid_1's binary_logloss: 0.352895
[470] training's binary_logloss: 0.326418 valid_1's binary_logloss: 0.352891
[480] training's binary_logloss: 0.325777 valid_1's binary_logloss: 0.35289
Early stopping, best iteration is:
[435] training's binary_logloss: 0.328955 valid_1's binary_logloss: 0.352798
3 总结
主要学习——我们通过挖掘出用户信息,并依赖于用户信息以及时间信息构建了几个重要的特征,并了解该特征在模型上的效果体现!
特征工程很重要!!!
这篇关于【kaggle】avazu-ctr-prediction的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!