【kaggle】avazu-ctr-prediction

2024-03-29 20:08
文章标签 prediction kaggle ctr avazu

本文主要是介绍【kaggle】avazu-ctr-prediction,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

【kaggle】avazu-ctr-prediction

  • 前言
  • 1、 工具包 & 数据导入
    • 1.1 工具包导入
    • 1.2、数据导入
    • 1.3、原数据线下验证(0.379626)
  • 2 数据探索&特征工程&验证
    • 2.1 加入用户定位特征
    • 2.2 加入用户的统计特征
      • 2.2.1 用户统计特征1
      • 2.2.2、线下验证(0.37178)
    • 2.3 加入类别编码
      • 2.3.1 线下认证(0.353937)
      • 2.3.2 用户特征2(用户特征2在此带来过大的增益)
      • 2.3.3 线下认证(0.35388)
    • 2.4 加入数据count编码
      • 2.4.1 线下检验(0.352798)
  • 3 总结

前言

这是kaggle上的Click-Through Rate Prediction。对于该比赛的学习主要是通过数据集构建强力特征,了解特征对模型的影响,并线下进行认证。

1、 工具包 & 数据导入

1.1 工具包导入

import pandas as pd
import numpy as np
import lightgbm as lgb
import matplotlib.pyplot as plt
import gc
import warnings
warnings.filterwarnings('ignore')%matplotlib inline

1.2、数据导入

path = './Data/'
#数据集过大,仅获取1000000条数据进行实验
train = pd.read_csv(path + 'train.csv',nrows = 1000000)train_cols = [x for x in train.columns if x!='id' and x!='click']
data = train#apply函数,对索引对象使用函数。
data['day'] = data['hour'].apply(lambda x: str(x)[4:6]).astype(int)
train_data = data.iloc[:train.shape[0],:]
test_data = data.iloc[train.shape[0]:,:] train_data_train = train_data.loc[:800000,:]
train_data_val = train_data.loc[800000: ,:]train_cols = [x for x in train_data_train.columns if x!='id' and x!='click' and train_data_train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y]) 
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train)  

1.3、原数据线下验证(0.379626)

train_cols = [x for x in train.columns if x!='id' and x!='click' and train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y]) 
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train)  params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 255,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.85,
'bagging_freq': 5, 
'min_data_in_leaf':15,
'verbose': 0 
}
print('Start training...')
# train
gbm_val_0 = lgb.train(params,
lgb_train,
num_boost_round=2000,
valid_sets=[lgb_train,lgb_eval],
early_stopping_rounds=50,verbose_eval=10) 
Start training...
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.030452 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 50 rounds
[10]	training's binary_logloss: 0.415596	valid_1's binary_logloss: 0.396804
[20]	training's binary_logloss: 0.405182	valid_1's binary_logloss: 0.386881
[30]	training's binary_logloss: 0.40081	valid_1's binary_logloss: 0.38287
[40]	training's binary_logloss: 0.398759	valid_1's binary_logloss: 0.381188
[50]	training's binary_logloss: 0.39769	valid_1's binary_logloss: 0.380376
[60]	training's binary_logloss: 0.397035	valid_1's binary_logloss: 0.380004
[70]	training's binary_logloss: 0.396565	valid_1's binary_logloss: 0.379813
[80]	training's binary_logloss: 0.396164	valid_1's binary_logloss: 0.379763
[90]	training's binary_logloss: 0.395848	valid_1's binary_logloss: 0.379657
[100]	training's binary_logloss: 0.395598	valid_1's binary_logloss: 0.379654
[110]	training's binary_logloss: 0.395358	valid_1's binary_logloss: 0.379699
[120]	training's binary_logloss: 0.395134	valid_1's binary_logloss: 0.379749
[130]	training's binary_logloss: 0.394928	valid_1's binary_logloss: 0.379827
[140]	training's binary_logloss: 0.394724	valid_1's binary_logloss: 0.380029
Early stopping, best iteration is:
[93]	training's binary_logloss: 0.395774	valid_1's binary_logloss: 0.379626

2 数据探索&特征工程&验证

在做推荐类的比赛时,有一个东西非常重要,那就是用户,而在该数据集中,我们没有看到真正的用户信息,所以我们需要将用户的信息给定位出来,或者至少粗略的定位出来,这样才能进行协同过滤,聚类等等操作,亦或是对用户提取历史点击率等。
在此之前我们还是先将数据的所有含义打印出来,方便快速查找。

数据中每一维度数据的含义。
·id: 用户ID号
·click: 0/1 表示未点击/点击
·hour: 格式为YYMMDDHH,因此14091123表示2014年9月11日UTC时间23:00。
·C1: 匿名分类变量
·banner_pos: int型,网页上的广告位置,离散特征0,1,2,3…
·site_id:Site ID
·site_domain:Site领域
·site_category: 网站类别
·app_id: string型,用户APP的ID
·app_domain
·app_category
·device_id: 设备编号
·device_ip
·device_model
·device_type: 设备类型
·device_conn_type:Device接入类型
·C14-C21 – anonymized categorical variables

2.1 加入用户定位特征

我们假设用户的设备是不会变的,所以我们此处用device的相关特征来定位用户。在做特征之前,我们需要对数据进行进一步的探索与挖掘,首先就是要观察是否数据中存在异样的情况,此处我们重点看device_id,我们发现a99f214a有33358308个,比其他的要高了几个数量级,也就是说,a99f214a可能是缺失值的一种编码表示,所以在对这个特征进行用户定位时需要额外处理。

# data['device_id'].value_counts()
data['device_ip'].value_counts()
data['user_id'] = data['device_id'] +  '_' + data['device_ip']+  '_'  +  data['device_model']

2.2 加入用户的统计特征

2.2.1 用户统计特征1

·用户每天在某个app上出现的次数
·用户每个小时,每天出现的次数
·用户距离上一次出现的时间差

from datetime import datetime
data['hour']=data['hour'].map(lambda x: datetime.strptime(str(x),"%y%m%d%H"))
data['dayoftheweek']=data['hour'].map(lambda x:  x.weekday())
data['day']=data['hour'].map(lambda x:  x.day)
data['hour']=data['hour'].map(lambda x:  x.hour)
data['time']=(data['day'].values - data['day'].min()) * 24  + data['hour'].values## 用户每天/每小时在某个app上出现的次数
for time in ['day','time']:print('user_id_'+time +'_app')data['user_id_'+time +'_app'] = data['user_id']  + '_' + data[time].astype(str) + '_' + data['app_id'].astype(str)dic_ = data['user_id_'+time +'_app'].value_counts().to_dict()data['user_id_'+time +'_app_count'] = data['user_id_'+time +'_app'].apply(lambda x: dic_[x])data.drop('user_id_'+time +'_app', axis=1,inplace = True)## 用户每个小时,每天出现的次数
for time in ['day','time']:print('user_id_'+time)data['user_id_'+time] = data['user_id']  + '_' + data[time].astype(str)dic_ = data['user_id_'+time].value_counts().to_dict()data['user_id_'+time +'_count'] = data['user_id_'+time].apply(lambda x: dic_[x])data.drop('user_id_'+time, axis=1,inplace = True)data['user_to_lasttime'] = data.groupby('user_id')['time'].diff().values
user_id_day_app
user_id_time_app
user_id_day
user_id_time

2.2.2、线下验证(0.37178)


train_data = data.iloc[:train.shape[0],:]
test_data = data.iloc[train.shape[0]:,:] train_data_train = train_data.loc[:800000,:]
train_data_val = train_data.loc[800000: ,:]train_cols = [x for x in train_data_train.columns if x!='id' and x!='click' and train_data_train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y]) 
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train)  params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 255,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.85,
'bagging_freq': 5, 
'min_data_in_leaf':15,
'verbose': 0 
}
print('Start training...')
# train
gbm_val_1 = lgb.train(params,
lgb_train,
num_boost_round=2000,
valid_sets=[lgb_train,lgb_eval],
early_stopping_rounds=50,verbose_eval=10) 
Start training...
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.035420 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 50 rounds
[10]	training's binary_logloss: 0.41076	valid_1's binary_logloss: 0.393777
[20]	training's binary_logloss: 0.398064	valid_1's binary_logloss: 0.38264
[30]	training's binary_logloss: 0.392174	valid_1's binary_logloss: 0.377804
[40]	training's binary_logloss: 0.389033	valid_1's binary_logloss: 0.375587
[50]	training's binary_logloss: 0.387046	valid_1's binary_logloss: 0.374293
[60]	training's binary_logloss: 0.385574	valid_1's binary_logloss: 0.37359
[70]	training's binary_logloss: 0.384426	valid_1's binary_logloss: 0.373191
[80]	training's binary_logloss: 0.38345	valid_1's binary_logloss: 0.372868
[90]	training's binary_logloss: 0.382593	valid_1's binary_logloss: 0.372662
[100]	training's binary_logloss: 0.381773	valid_1's binary_logloss: 0.372439
[110]	training's binary_logloss: 0.380998	valid_1's binary_logloss: 0.372267
[120]	training's binary_logloss: 0.380297	valid_1's binary_logloss: 0.372154
[130]	training's binary_logloss: 0.379702	valid_1's binary_logloss: 0.372094
[140]	training's binary_logloss: 0.379135	valid_1's binary_logloss: 0.372023
[150]	training's binary_logloss: 0.378598	valid_1's binary_logloss: 0.371953
[160]	training's binary_logloss: 0.378005	valid_1's binary_logloss: 0.371911
[170]	training's binary_logloss: 0.377443	valid_1's binary_logloss: 0.371876
[180]	training's binary_logloss: 0.376933	valid_1's binary_logloss: 0.371817
[190]	training's binary_logloss: 0.376468	valid_1's binary_logloss: 0.371814
[200]	training's binary_logloss: 0.375994	valid_1's binary_logloss: 0.371827
[210]	training's binary_logloss: 0.375567	valid_1's binary_logloss: 0.371925
[220]	training's binary_logloss: 0.375141	valid_1's binary_logloss: 0.371927
[230]	training's binary_logloss: 0.374727	valid_1's binary_logloss: 0.371882
Early stopping, best iteration is:
[186]	training's binary_logloss: 0.376655	valid_1's binary_logloss: 0.37178

2.3 加入类别编码

(和之前的不同在于多了一个user_id)

from sklearn.preprocessing import LabelEncoder
for col in data.columns:if col!='id' and col!='click': if data[col].dtypes == 'O':print(col)data[col+'_labelencode'] = LabelEncoder().fit_transform(data[col].values)
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
user_id

2.3.1 线下认证(0.353937)

train_data = data.iloc[:train.shape[0],:]
test_data = data.iloc[train.shape[0]:,:] train_data_train = train_data.loc[:800000,:]
train_data_val = train_data.loc[800000: ,:]train_cols = [x for x in train_data_train.columns if x!='id' and x!='click' and train_data_train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y]) 
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train)  params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 255,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.85,
'bagging_freq': 5, 
'min_data_in_leaf':15,
'verbose': 0 
}
print('Start training...')
# train
gbm_val_1 = lgb.train(params,
lgb_train,
num_boost_round=2000,
valid_sets=[lgb_train,lgb_eval],
early_stopping_rounds=50,verbose_eval=10) 
Start training...
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.056239 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 50 rounds
[10]	training's binary_logloss: 0.4058	valid_1's binary_logloss: 0.387928
[20]	training's binary_logloss: 0.390174	valid_1's binary_logloss: 0.373899
[30]	training's binary_logloss: 0.382333	valid_1's binary_logloss: 0.367276
[40]	training's binary_logloss: 0.377793	valid_1's binary_logloss: 0.363899
[50]	training's binary_logloss: 0.374682	valid_1's binary_logloss: 0.361842
[60]	training's binary_logloss: 0.372386	valid_1's binary_logloss: 0.36039
[70]	training's binary_logloss: 0.37014	valid_1's binary_logloss: 0.359147
[80]	training's binary_logloss: 0.368285	valid_1's binary_logloss: 0.358286
[90]	training's binary_logloss: 0.366652	valid_1's binary_logloss: 0.357614
[100]	training's binary_logloss: 0.365119	valid_1's binary_logloss: 0.357172
[110]	training's binary_logloss: 0.363837	valid_1's binary_logloss: 0.356862
[120]	training's binary_logloss: 0.36253	valid_1's binary_logloss: 0.356532
[130]	training's binary_logloss: 0.36142	valid_1's binary_logloss: 0.356425
[140]	training's binary_logloss: 0.360221	valid_1's binary_logloss: 0.356221
[150]	training's binary_logloss: 0.359205	valid_1's binary_logloss: 0.356097
[160]	training's binary_logloss: 0.358145	valid_1's binary_logloss: 0.355992
[170]	training's binary_logloss: 0.357173	valid_1's binary_logloss: 0.355848
[180]	training's binary_logloss: 0.356149	valid_1's binary_logloss: 0.355788
[190]	training's binary_logloss: 0.355161	valid_1's binary_logloss: 0.355695
[200]	training's binary_logloss: 0.354191	valid_1's binary_logloss: 0.355514
[210]	training's binary_logloss: 0.353241	valid_1's binary_logloss: 0.355393
[220]	training's binary_logloss: 0.352405	valid_1's binary_logloss: 0.355259
[230]	training's binary_logloss: 0.351462	valid_1's binary_logloss: 0.355109
[240]	training's binary_logloss: 0.350589	valid_1's binary_logloss: 0.355077
[250]	training's binary_logloss: 0.349738	valid_1's binary_logloss: 0.354968
[260]	training's binary_logloss: 0.348869	valid_1's binary_logloss: 0.35489
[270]	training's binary_logloss: 0.347985	valid_1's binary_logloss: 0.354858
[280]	training's binary_logloss: 0.347223	valid_1's binary_logloss: 0.354709
[290]	training's binary_logloss: 0.346345	valid_1's binary_logloss: 0.354566
[300]	training's binary_logloss: 0.345503	valid_1's binary_logloss: 0.354518
[310]	training's binary_logloss: 0.344748	valid_1's binary_logloss: 0.354464
[320]	training's binary_logloss: 0.344014	valid_1's binary_logloss: 0.354421
[330]	training's binary_logloss: 0.343335	valid_1's binary_logloss: 0.354472
[340]	training's binary_logloss: 0.342561	valid_1's binary_logloss: 0.354477
[350]	training's binary_logloss: 0.341778	valid_1's binary_logloss: 0.354468
[360]	training's binary_logloss: 0.340963	valid_1's binary_logloss: 0.35438
[370]	training's binary_logloss: 0.340251	valid_1's binary_logloss: 0.354374
[380]	training's binary_logloss: 0.339512	valid_1's binary_logloss: 0.354332
[390]	training's binary_logloss: 0.3387	valid_1's binary_logloss: 0.354222
[400]	training's binary_logloss: 0.33794	valid_1's binary_logloss: 0.35423
[410]	training's binary_logloss: 0.337155	valid_1's binary_logloss: 0.354198
[420]	training's binary_logloss: 0.336554	valid_1's binary_logloss: 0.354167
[430]	training's binary_logloss: 0.335958	valid_1's binary_logloss: 0.354187
[440]	training's binary_logloss: 0.335283	valid_1's binary_logloss: 0.354124
[450]	training's binary_logloss: 0.334627	valid_1's binary_logloss: 0.35403
[460]	training's binary_logloss: 0.33404	valid_1's binary_logloss: 0.354019
[470]	training's binary_logloss: 0.333412	valid_1's binary_logloss: 0.353969
[480]	training's binary_logloss: 0.332729	valid_1's binary_logloss: 0.353947
[490]	training's binary_logloss: 0.332128	valid_1's binary_logloss: 0.353948
[500]	training's binary_logloss: 0.331457	valid_1's binary_logloss: 0.353957
[510]	training's binary_logloss: 0.330729	valid_1's binary_logloss: 0.353966
[520]	training's binary_logloss: 0.330111	valid_1's binary_logloss: 0.353959
Early stopping, best iteration is:
[478]	training's binary_logloss: 0.332869	valid_1's binary_logloss: 0.353937

2.3.2 用户特征2(用户特征2在此带来过大的增益)

# 我们再看C14-C21系列
train[['C14','C15','C16','C17','C18','C19','C20','C21']].nunique()
C14    606
C15      8
C16      9
C17    162
C18      4
C19     41
C20    161
C21     35
dtype: int64
for time in ['day','time']:for c in ['C14','C17']:print('user_id_'+ time +'_' +c)data['user_id_'+ time + '_' +c] = data['user_id']  + '_' + data[time].astype(str) + '_' + data[c].astype(str)dic_ = data['user_id_'+ time +'_' +c].value_counts().to_dict()data['user_id_'+ time +'_' + c + '_count'] = data['user_id_'+ time +'_' + c].apply(lambda x: dic_[x])data.drop('user_id_'+time +'_' + c, axis=1,inplace = True)
user_id_day_C14
user_id_day_C17
user_id_time_C14
user_id_time_C17

2.3.3 线下认证(0.35388)


train_data = data.iloc[:train.shape[0],:]
test_data = data.iloc[train.shape[0]:,:] train_data_train = train_data.loc[:800000,:]
train_data_val = train_data.loc[800000: ,:]train_cols = [x for x in train_data_train.columns if x!='id' and x!='click' and train_data_train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y]) 
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train)  params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 255,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.85,
'bagging_freq': 5, 
'min_data_in_leaf':15,
'verbose': 0 
}
print('Start training...')
# train
gbm_val_1 = lgb.train(params,
lgb_train,
num_boost_round=2000,
valid_sets=[lgb_train,lgb_eval],
early_stopping_rounds=50,verbose_eval=10) 
Start training...
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.060823 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 50 rounds
[10]	training's binary_logloss: 0.405547	valid_1's binary_logloss: 0.387636
[20]	training's binary_logloss: 0.389848	valid_1's binary_logloss: 0.37334
[30]	training's binary_logloss: 0.38195	valid_1's binary_logloss: 0.36688
[40]	training's binary_logloss: 0.377228	valid_1's binary_logloss: 0.363296
[50]	training's binary_logloss: 0.374017	valid_1's binary_logloss: 0.361278
[60]	training's binary_logloss: 0.371532	valid_1's binary_logloss: 0.359834
[70]	training's binary_logloss: 0.369535	valid_1's binary_logloss: 0.358891
[80]	training's binary_logloss: 0.36771	valid_1's binary_logloss: 0.358131
[90]	training's binary_logloss: 0.365959	valid_1's binary_logloss: 0.357518
[100]	training's binary_logloss: 0.36443	valid_1's binary_logloss: 0.357108
[110]	training's binary_logloss: 0.363092	valid_1's binary_logloss: 0.35663
[120]	training's binary_logloss: 0.361761	valid_1's binary_logloss: 0.356263
[130]	training's binary_logloss: 0.360678	valid_1's binary_logloss: 0.356223
[140]	training's binary_logloss: 0.359477	valid_1's binary_logloss: 0.356079
[150]	training's binary_logloss: 0.3584	valid_1's binary_logloss: 0.355906
[160]	training's binary_logloss: 0.357263	valid_1's binary_logloss: 0.355793
[170]	training's binary_logloss: 0.356221	valid_1's binary_logloss: 0.355635
[180]	training's binary_logloss: 0.355223	valid_1's binary_logloss: 0.35555
[190]	training's binary_logloss: 0.354236	valid_1's binary_logloss: 0.355493
[200]	training's binary_logloss: 0.35324	valid_1's binary_logloss: 0.355354
[210]	training's binary_logloss: 0.352328	valid_1's binary_logloss: 0.355326
[220]	training's binary_logloss: 0.351424	valid_1's binary_logloss: 0.355269
[230]	training's binary_logloss: 0.3505	valid_1's binary_logloss: 0.355171
[240]	training's binary_logloss: 0.349676	valid_1's binary_logloss: 0.355081
[250]	training's binary_logloss: 0.348778	valid_1's binary_logloss: 0.354944
[260]	training's binary_logloss: 0.347907	valid_1's binary_logloss: 0.354883
[270]	training's binary_logloss: 0.347043	valid_1's binary_logloss: 0.35476
[280]	training's binary_logloss: 0.346234	valid_1's binary_logloss: 0.354682
[290]	training's binary_logloss: 0.345388	valid_1's binary_logloss: 0.354628
[300]	training's binary_logloss: 0.344525	valid_1's binary_logloss: 0.354591
[310]	training's binary_logloss: 0.343854	valid_1's binary_logloss: 0.354539
[320]	training's binary_logloss: 0.343085	valid_1's binary_logloss: 0.354478
[330]	training's binary_logloss: 0.342303	valid_1's binary_logloss: 0.354444
[340]	training's binary_logloss: 0.34158	valid_1's binary_logloss: 0.354414
[350]	training's binary_logloss: 0.340726	valid_1's binary_logloss: 0.354351
[360]	training's binary_logloss: 0.339915	valid_1's binary_logloss: 0.354308
[370]	training's binary_logloss: 0.339071	valid_1's binary_logloss: 0.35423
[380]	training's binary_logloss: 0.338348	valid_1's binary_logloss: 0.354141
[390]	training's binary_logloss: 0.337606	valid_1's binary_logloss: 0.354138
[400]	training's binary_logloss: 0.336808	valid_1's binary_logloss: 0.35411
[410]	training's binary_logloss: 0.335942	valid_1's binary_logloss: 0.35404
[420]	training's binary_logloss: 0.335274	valid_1's binary_logloss: 0.354039
[430]	training's binary_logloss: 0.334644	valid_1's binary_logloss: 0.354051
[440]	training's binary_logloss: 0.3339	valid_1's binary_logloss: 0.353959
[450]	training's binary_logloss: 0.333256	valid_1's binary_logloss: 0.353951
[460]	training's binary_logloss: 0.332656	valid_1's binary_logloss: 0.353934
[470]	training's binary_logloss: 0.33198	valid_1's binary_logloss: 0.353927
[480]	training's binary_logloss: 0.331254	valid_1's binary_logloss: 0.3539
[490]	training's binary_logloss: 0.330562	valid_1's binary_logloss: 0.353923
[500]	training's binary_logloss: 0.329909	valid_1's binary_logloss: 0.35394
[510]	training's binary_logloss: 0.329197	valid_1's binary_logloss: 0.353897
[520]	training's binary_logloss: 0.328595	valid_1's binary_logloss: 0.353925
[530]	training's binary_logloss: 0.32796	valid_1's binary_logloss: 0.353898
[540]	training's binary_logloss: 0.327339	valid_1's binary_logloss: 0.353918
[550]	training's binary_logloss: 0.326625	valid_1's binary_logloss: 0.353916
[560]	training's binary_logloss: 0.325978	valid_1's binary_logloss: 0.353901
[570]	training's binary_logloss: 0.325392	valid_1's binary_logloss: 0.353924
[580]	training's binary_logloss: 0.324693	valid_1's binary_logloss: 0.353934
[590]	training's binary_logloss: 0.324195	valid_1's binary_logloss: 0.353966
[600]	training's binary_logloss: 0.323644	valid_1's binary_logloss: 0.353972
Early stopping, best iteration is:
[558]	training's binary_logloss: 0.326104	valid_1's binary_logloss: 0.35388

2.4 加入数据count编码

cate_cols = ['C1','banner_pos','site_id','site_domain','site_category','app_id','app_domain','app_category','device_id', 'device_ip','device_model','device_type','device_conn_type',\'C14','C15','C16','C17','C18','C19','C20','C21']
for col in cate_cols: print(col)data[col+'_cnt_code']       = data.groupby(col)['click'].transform('count') 
C1
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14
C15
C16
C17
C18
C19
C20
C21

2.4.1 线下检验(0.352798)


train_data = data.iloc[:train.shape[0],:]
test_data = data.iloc[train.shape[0]:,:] train_data_train = train_data.loc[:800000,:]
train_data_val = train_data.loc[800000: ,:]train_cols = [x for x in train_data_train.columns if x!='id' and x!='click' and train_data_train[x].dtypes!='O']
y = 'click'
lgb_train = lgb.Dataset(train_data_train[train_cols].values, train_data_train[y]) 
lgb_eval= lgb.Dataset(train_data_val[train_cols].values, train_data_val[y],reference=lgb_train)  params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 255,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.85,
'bagging_freq': 5, 
'min_data_in_leaf':15,
'verbose': 0 
}
print('Start training...')
# train
gbm_val_1 = lgb.train(params,
lgb_train,
num_boost_round=2000,
valid_sets=[lgb_train,lgb_eval],
early_stopping_rounds=50,verbose_eval=10) 
Start training...
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.070490 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 50 rounds
[10]	training's binary_logloss: 0.404531	valid_1's binary_logloss: 0.387466
[20]	training's binary_logloss: 0.388403	valid_1's binary_logloss: 0.373173
[30]	training's binary_logloss: 0.380199	valid_1's binary_logloss: 0.366352
[40]	training's binary_logloss: 0.375236	valid_1's binary_logloss: 0.362693
[50]	training's binary_logloss: 0.371983	valid_1's binary_logloss: 0.3607
[60]	training's binary_logloss: 0.369455	valid_1's binary_logloss: 0.359339
[70]	training's binary_logloss: 0.367318	valid_1's binary_logloss: 0.358226
[80]	training's binary_logloss: 0.365324	valid_1's binary_logloss: 0.357321
[90]	training's binary_logloss: 0.363464	valid_1's binary_logloss: 0.356642
[100]	training's binary_logloss: 0.361842	valid_1's binary_logloss: 0.356127
[110]	training's binary_logloss: 0.360382	valid_1's binary_logloss: 0.355631
[120]	training's binary_logloss: 0.359057	valid_1's binary_logloss: 0.355329
[130]	training's binary_logloss: 0.357846	valid_1's binary_logloss: 0.355249
[140]	training's binary_logloss: 0.356635	valid_1's binary_logloss: 0.354866
[150]	training's binary_logloss: 0.35549	valid_1's binary_logloss: 0.354733
[160]	training's binary_logloss: 0.354301	valid_1's binary_logloss: 0.354472
[170]	training's binary_logloss: 0.353274	valid_1's binary_logloss: 0.354382
[180]	training's binary_logloss: 0.35209	valid_1's binary_logloss: 0.354205
[190]	training's binary_logloss: 0.351018	valid_1's binary_logloss: 0.354118
[200]	training's binary_logloss: 0.349923	valid_1's binary_logloss: 0.353956
[210]	training's binary_logloss: 0.348832	valid_1's binary_logloss: 0.353877
[220]	training's binary_logloss: 0.34786	valid_1's binary_logloss: 0.353786
[230]	training's binary_logloss: 0.346821	valid_1's binary_logloss: 0.353661
[240]	training's binary_logloss: 0.345835	valid_1's binary_logloss: 0.353622
[250]	training's binary_logloss: 0.344851	valid_1's binary_logloss: 0.353514
[260]	training's binary_logloss: 0.343781	valid_1's binary_logloss: 0.353416
[270]	training's binary_logloss: 0.342893	valid_1's binary_logloss: 0.353344
[280]	training's binary_logloss: 0.341943	valid_1's binary_logloss: 0.353273
[290]	training's binary_logloss: 0.341033	valid_1's binary_logloss: 0.353208
[300]	training's binary_logloss: 0.340069	valid_1's binary_logloss: 0.35314
[310]	training's binary_logloss: 0.339145	valid_1's binary_logloss: 0.353074
[320]	training's binary_logloss: 0.338296	valid_1's binary_logloss: 0.35305
[330]	training's binary_logloss: 0.337463	valid_1's binary_logloss: 0.353056
[340]	training's binary_logloss: 0.336665	valid_1's binary_logloss: 0.353096
[350]	training's binary_logloss: 0.33593	valid_1's binary_logloss: 0.353086
[360]	training's binary_logloss: 0.335081	valid_1's binary_logloss: 0.353057
[370]	training's binary_logloss: 0.33422	valid_1's binary_logloss: 0.352987
[380]	training's binary_logloss: 0.33337	valid_1's binary_logloss: 0.352933
[390]	training's binary_logloss: 0.33254	valid_1's binary_logloss: 0.352951
[400]	training's binary_logloss: 0.331765	valid_1's binary_logloss: 0.352908
[410]	training's binary_logloss: 0.33095	valid_1's binary_logloss: 0.352876
[420]	training's binary_logloss: 0.330151	valid_1's binary_logloss: 0.352842
[430]	training's binary_logloss: 0.329352	valid_1's binary_logloss: 0.352848
[440]	training's binary_logloss: 0.328604	valid_1's binary_logloss: 0.352839
[450]	training's binary_logloss: 0.327801	valid_1's binary_logloss: 0.352854
[460]	training's binary_logloss: 0.327144	valid_1's binary_logloss: 0.352895
[470]	training's binary_logloss: 0.326418	valid_1's binary_logloss: 0.352891
[480]	training's binary_logloss: 0.325777	valid_1's binary_logloss: 0.35289
Early stopping, best iteration is:
[435]	training's binary_logloss: 0.328955	valid_1's binary_logloss: 0.352798

3 总结

主要学习——我们通过挖掘出用户信息,并依赖于用户信息以及时间信息构建了几个重要的特征,并了解该特征在模型上的效果体现!

特征工程很重要!!!

这篇关于【kaggle】avazu-ctr-prediction的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/859515

相关文章

kaggle竞赛宝典 | Mamba模型综述!

本文来源公众号“kaggle竞赛宝典”,仅用于学术分享,侵权删,干货满满。 原文链接:Mamba模型综述! 型语言模型(LLMs),成为深度学习的基石。尽管取得了令人瞩目的成就,Transformers仍面临固有的局限性,尤其是在推理时,由于注意力计算的平方复杂度,导致推理过程耗时较长。 最近,一种名为Mamba的新型架构应运而生,其灵感源自经典的状态空间模型,成为构建基础模型的有力替代方案

Kaggle刷比赛的利器,LR,LGBM,XGBoost,Keras

刷比赛利器,感谢分享的人。 摘要 最近打各种比赛,在这里分享一些General Model,稍微改改就能用的 环境: python 3.5.2 XGBoost调参大全: http://blog.csdn.net/han_xiaoyang/article/details/52665396 XGBoost 官方API: http://xgboost.readthedocs.io/en

24/9/3算法笔记 kaggle泰坦尼克

题目: 这次我用两种算法做了这道题 逻辑回归二分类算法 import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegr

Kaggle竞赛——手写数字识别(Digit Recognizer)

目录 1. 数据集介绍2. 数据分析3. 数据处理与封装3.1 数据集划分3.2 将数据转为tensor张量3.3 数据封装 4. 模型训练4.1 定义功能函数4.1 resnet18模型4.3 CNN模型4.4 FCNN模型 5. 结果分析5.1 混淆矩阵5.2 查看错误分类的样本 6. 加载最佳模型7. 参考文献 本次手写数字识别使用了resnet18(比resnet50精度更

Kaggle克隆github项目+文件操作+Kaggle常见操作问题解决方案——一文搞定,以openpose姿态估计项目为例

文章目录 前言一、Kaggle克隆仓库1、克隆项目2、查看目录 二、安装依赖三、文件的上传、复制、转移操作1.上传.pth文件到input目录2、将权重文件从input目录转移到工作目录 三、修改工作目录里的文件内容1、修改demo_camera.py内容 四、运行! 前言 想跑一些深度学习的项目,但是电脑没有显卡,遂看向云服务器Kaggle,这里可以每周免费使用30h的GP

keras 实现dense prediction 逐像素标注 语义分割 像素级语义标注 pixelwise segmention labeling classification 3D数据

主要是keras的示例都是图片分类。而真正的论文代码,又太大了,不适合初学者(比如我)来学习。 所以我查找了一些资料。我在google 上捞的。 其中有个教程让人感觉很好.更完整的教程。另一个教程。 大概就是说,你的输入ground truth label需要是(width*height,class number),然后网络最后需要加个sigmoid,后面用binary_crossentro

机器学习学习--Kaggle Titanic--LR,GBDT,bagging

参考,机器学习系列(3)_逻辑回归应用之Kaggle泰坦尼克之灾  http://www.cnblogs.com/zhizhan/p/5238908.html 机器学习(二) 如何做到Kaggle排名前2%  http://www.jasongj.com/ml/classification/ 一、认识数据 1.把csv文件读入成dataframe格式 import pandas as

kaggle平台free使用GPU

1、注册 请保证在【科学上网】条件下进入如下操作,只有在注册账户和手机号验证时需要。 step1:注册账户 进入kaggle官网:https://www.kaggle.com/,点击右上角【Register】进入注册页面 最好选择使用邮箱注册(!!!如果你先用goole注册,然后改成其他邮箱,再用其他邮箱登录时会报错,需要重新找回密码) 输入【邮箱】、【密码】和【用户名】后,勾选

Kaggle竞赛:Rossmann Store Sales第66名策略复现

之前做过一次Kaggle的时间序列竞赛数据集练习:CSDN链接效果并不理想,之后在Kaggle的评论中又找到了各式各样的模型方法,其中我还手动还原过第三名的Entity Embedding:CSDN链接。这个参赛方法中,使用了除了比赛给出的数据以外的外部数据(天气数据等)。而这次,我准备还原一个没有使用外部数据且方法较为简单,但是效果较好的策略。也就是第66名的策略。 详细的策略可以看这里 R语言

Battery Cycle Life Prediction From Initial Operation Data

这个例子展示了如何使用线性回归(一种监督机器学习算法)预测快速充电锂离子电池的剩余循环寿命。使用基于物理的建模方法预测锂离子电池的循环寿命是非常复杂的,因为不同的操作条件和显著的设备可变性,即使是来自同一制造商的电池。对于这种情况,当有足够的测试数据可用时,基于机器学习的方法提供了有希望的结果。在电池寿命的早期阶段准确的电池循环寿命预测将允许快速验证新的制造工艺。它还允许最终用户在足够的交货时间内