本文主要是介绍【12月Top 2】MarTech Challenge 点击反欺诈预测,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
背景
广告欺诈是数字营销需要面临的重要挑战之一,点击会欺诈浪费广告主大量金钱,同时对点击数据会产生误导作用。本次比赛提供了约50万次点击数据。特别注意:我们对数据进行了模拟生成,对某些特征含义进行了隐藏,并进行了脱敏处理。
请预测用户的点击行为是否为正常点击,还是作弊行为。点击欺诈预测适用于各种信息流广告投放,banner广告投放,以及百度网盟平台,帮助商家鉴别点击欺诈,锁定精准真实用户。
- 比赛地址:https://aistudio.baidu.com/aistudio/competition/detail/52/0/introduction
- 比赛数据集:https://download.csdn.net/download/turkeym4/72338032#
数据与任务
大赛提供50万的训练数据以及15万的测试数据。目标是预测该笔数据是否存在反欺诈行为。
字段 | 类型 | 说明 |
---|---|---|
sid | string | 样本id/请求会话sid |
package | string | 媒体信息,包名(已加密) |
version | string | 媒体信息,app版本 |
android_id | string | 媒体信息,对外广告位ID(已加密) |
media_id | string | 媒体信息,对外媒体ID(已加密) |
apptype | int | 媒体信息,app所属分类 |
timestamp | bigint | 请求到达服务时间,单位ms |
location | int | 用户地理位置编码(精确到城市) |
fea_hash | int | 用户特征编码(具体物理含义略去) |
fea1_hash | int | 用户特征编码(具体物理含义略去) |
cus_type | int | 用户特征编码(具体物理含义略去) |
ntt | int | 网络类型 0-未知, 1-有线网, 2-WIFI, 3-蜂窝网络未知, 4-2G, 5-3G, 6–4G |
carrier | string | 设备使用的运营商 0-未知, 46000-移动, 46001-联通, 46003-电信 |
os | string | 操作系统,默认为android |
osv | string | 操作系统版本 |
lan | string | 设备采用的语言,默认为中文 |
dev_height | int | 设备高 |
dev_width | int | 设备宽 |
dev_ppi | int | 屏幕分辨率 |
label | int | 是否存在反欺诈 |
通过数据label可以得知,该命题是一个二分类任务。可使用机器学习算法或者MLP进行求解。
解题思路
解题方案可分为两部分:
- 使用机器学习算法的二分类预测:LGB/XGB/CatBoost
- 使用深度学习算法的二分类预测:MLP/Wide & Deep/DeepFM
下面将列出大致的建模方案,具体可查看源码:gitee仓库
机器学习
机器学习无非就是特征工程+祖传参数的问题。通常经过下为了快速出第一版本的Baseline,我们常常会使用LGB(lightgbm)起步。这个算法的最大的特点就是保证准确率的同时还很快。
特征处理
空值处理
经调研发现,在lan和osv上面出现空值。
# 字符串类型 需要转换为数值(labelencoder)
object_cols = train.select_dtypes(include='object').columns# 缺失值个数
temp = train.isnull().sum()
# 有缺失值的字段: lan, osv
temp[temp>0]
# 获取分析字段
features = train.columns.tolist()
features.remove('label')
print(features)
连续值与分类值
接着分析连续值与分类值。最终发现对osv需要进行转换处理,对fea_hash与fea1_hash初步先求字符长度处理
for feature in features:print(feature, train[feature].nunique())
osv处理方法
# 处理osv
def trans_osv(osv):global resultosv = str(osv).replace(' ','').replace('.','').replace('Android_','').replace('十核20G_HD','').replace('Android','').replace('W','')if osv == 'nan' or osv == 'GIONEE_YNGA':result = 810elif osv.count('-') >0:result = int(osv.split('-')[0])elif osv == 'f073b_changxiang_v01_b1b8_20180915':result = 810elif osv == '%E6%B1%9F%E7%81%B5OS+50':result = 500else:result = int(osv)if result < 10:result = result * 100elif result < 100:result = result * 10return int(result)
最后测试与训练集的转换
# 特征筛选
features = train[col]
# 构造fea_hash_len特征
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['osv'] = features['osv'].apply(trans_osv)test_features = test[col]
# 构造fea_hash_len特征
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['osv'] = test_features['osv'].apply(trans_osv)
建模
使用默认参数的lgb进行建模,最终成绩:88.094
#train['os'].value_counts()
# 使用LGBM训练
import lightgbm as lgb
model = lgb.LGBMClassifier()
# 模型训练
model.fit(features.drop(['timestamp', 'version'], axis=1), train['label'])
result = model.predict(test_features.drop(['timestamp', 'version'], axis=1))
#features['version'].value_counts()
res = pd.DataFrame(test['sid'])
res['label'] = result
res.to_csv('./baseline.csv', index=False)
res
优化方向
下面列出做过的方案列表,具体版本对比见文末模型结果。具体查看源码:gitee仓库
- 添加version的转换使用
- 添加timestamp详细使用,增加年、月、日、时、分、周末以及diff特征
- 添加osv与version的差
- 添加lan的准换使用
- 添加屏幕比、屏幕面积、像素比
- 使用祖传lgb、祖传xgb等自定义参数模型
- 对模型进行5折交叉训练
- 多模型5折交叉训练融合
深度学习
本次深度学习方法着重使用百度的飞桨作为基础框架完成
特征处理
针对数据处理模块,大致与机器学习的类似。但由于使用到深度学习,所以在处理完成以后需要对数据进行归一化处理。
import pandas as pd
import warningswarnings.filterwarnings('ignore')# 数据加载
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
test = test.iloc[:, 1:]
train = train.iloc[:, 1:]
train# ##### Object类型: lan, os, osv, version, fea_hash
# ##### 有缺失值的字段: lan, osv# In[2]:# ['os', 'osv', 'lan', 'sid’]
features = train.columns.tolist()
features.remove('label')
print(features)# In[3]:for feature in features:print(feature, train[feature].nunique())# In[4]:# 对osv进行数据清洗
def osv_trans(x):x = str(x).replace('Android_', '').replace('Android ', '').replace('W', '')if str(x).find('.') > 0:temp_index1 = x.find('.')if x.find(' ') > 0:temp_index2 = x.find(' ')else:temp_index2 = len(x)if x.find('-') > 0:temp_index2 = x.find('-')result = x[0:temp_index1] + '.' + x[temp_index1 + 1:temp_index2].replace('.', '')try:return float(result)except:print(x + '#########')return 0try:return float(x)except:print(x + '#########')return 0# train['osv'] => LabelEncoder ?
# 采用众数,进行缺失值的填充
train['osv'].fillna('8.1.0', inplace=True)
# 数据清洗
train['osv'] = train['osv'].apply(osv_trans)# 采用众数,进行缺失值的填充
test['osv'].fillna('8.1.0', inplace=True)
# 数据清洗
test['osv'] = test['osv'].apply(osv_trans)# In[5]:# train['os'].value_counts()
train['lan'].value_counts()
# lan_map = {'zh-CN': 1, }
train['lan'].value_counts().index
lan_map = {'zh-CN': 1, 'zh_CN': 2, 'Zh-CN': 3, 'zh-cn': 4, 'zh_CN_#Hans': 5, 'zh': 6, 'ZH': 7, 'cn': 8, 'CN': 9,'zh-HK': 10, 'tw': 11, 'TW': 12, 'zh-TW': 13, 'zh-MO': 14, 'en': 15, 'en-GB': 16, 'en-US': 17, 'ko': 18,'ja': 19, 'it': 20, 'mi': 21}
train['lan'] = train['lan'].map(lan_map)
test['lan'] = test['lan'].map(lan_map)
test['lan'].value_counts()# In[6]:# 对于有缺失的lan 设置为22
train['lan'].fillna(22, inplace=True)
test['lan'].fillna(22, inplace=True)# In[7]:remove_list = ['os', 'sid']
col = features
for i in remove_list:col.remove(i)
col# In[8]:# train['timestamp'].value_counts()
# train['timestamp'] = pd.to_datetime(train['timestamp'])
# train['timestamp']
from datetime import datetime# lambda 是一句话函数,匿名函数
train['timestamp'] = train['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
# 1559892728241.7212
# 1559871800477.1477
# 1625493942.538375
# import time
# time.time()
test['timestamp'] = test['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
test['timestamp']# In[9]:def version_trans(x):if x == 'V3':return 3if x == 'v1':return 1if x == 'P_Final_6':return 6if x == 'V6':return 6if x == 'GA3':return 3if x == 'GA2':return 2if x == 'V2':return 2if x == '50':return 5return int(x)train['version'] = train['version'].apply(version_trans)
test['version'] = test['version'].apply(version_trans)
train['version'] = train['version'].astype('int')
test['version'] = test['version'].astype('int')# In[10]:# 特征筛选
features = train[col]
# 构造fea_hash_len特征
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
featurestest_features = test[col]
# 构造fea_hash_len特征
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features# 对训练集的timestamp提取时间多尺度
# 创建时间戳索引
temp = pd.DatetimeIndex(features['timestamp'])
features['year'] = temp.year
features['month'] = temp.month
features['day'] = temp.day
features['week_day'] = temp.weekday # 星期几
features['hour'] = temp.hour
features['minute'] = temp.minute# 求时间的diff
start_time = features['timestamp'].min()
features['time_diff'] = features['timestamp'] - start_time
features['time_diff'] = features['time_diff'].dt.days + features['time_diff'].dt.seconds / 3600 / 24
features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']]# 创建时间戳索引
temp = pd.DatetimeIndex(test_features['timestamp'])
test_features['year'] = temp.year
test_features['month'] = temp.month
test_features['day'] = temp.day
test_features['week_day'] = temp.weekday # 星期几
test_features['hour'] = temp.hour
test_features['minute'] = temp.minute# 求时间的diff
# start_time = features['timestamp'].min()
test_features['time_diff'] = test_features['timestamp'] - start_time
test_features['time_diff'] = test_features['time_diff'].dt.days + test_features['time_diff'].dt.seconds / 3600 / 24
# test_features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']]
test_features['time_diff']# In[12]:# test['version'].value_counts()
# features['version'].value_counts()
features['dev_height'].value_counts()
features['dev_width'].value_counts()
# 构造面积特征
features['dev_area'] = features['dev_height'] * features['dev_width']
test_features['dev_area'] = test_features['dev_height'] * test_features['dev_width']# In[13]:"""
Thinking:是否可以利用 dev_ppi 和 dev_area构造新特征
features['dev_ppi'].value_counts()
features['dev_area'].astype('float') / features['dev_ppi'].astype('float')
"""
# features['ntt'].value_counts()
features['carrier'].value_counts()
features['package'].value_counts()
# version - osv APP版本与操作系统版本差
features['osv'].value_counts()
features['version_osv'] = features['osv'] - features['version']
test_features['version_osv'] = test_features['osv'] - test_features['version']# In[14]:features = features.drop(['timestamp'], axis=1)
test_features = test_features.drop(['timestamp'], axis=1)# In[16]:# 特征归一化
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
features1 = scaler.fit_transform(features)
test_features1 = scaler.transform(test_features)
生成Dataset和Dataloader
import paddle
from paddle import nn
from paddle.io import Dataset, DataLoader
import numpy as np
paddle.device.set_device('gpu:0')# 自定义dataset
class MineDataset(Dataset):def __init__(self, X, y):super(MineDataset, self).__init__()self.num_samples = len(X)self.X = Xself.y = ydef __getitem__(self, idx):return self.X.iloc[idx].values.astype('float32'), np.array(self.y.iloc[idx]).astype('int64')def __len__(self):return self.num_samplesfrom sklearn.model_selection import train_test_splittrain_x, val_x, train_y, val_y = train_test_split(features1, train['label'], test_size=0.2, random_state=42)train_x = pd.DataFrame(train_x, columns=features.columns)
val_x = pd.DataFrame(val_x, columns=features.columns)
train_y = pd.DataFrame(train_y, columns=['label'])
val_y = pd.DataFrame(val_y, columns=['label'])train_dataloader = DataLoader(MineDataset(train_x, train_y),batch_size=1024,shuffle=True,drop_last=True,num_workers=2)val_dataloader = DataLoader(MineDataset(val_x, val_y),batch_size=1024,shuffle=True,drop_last=True,num_workers=2)test_dataloader = DataLoader(MineDataset(test_features1, pd.Series([0 for i in range(len(test_features1))])),batch_size=1024,shuffle=True,drop_last=True,num_workers=2)
网络搭建
第一版本网络仅使用简单的全连接层网络。网络结构从250到2的塔石结构,每个线性层之间经过relu和dropout层。
class ClassifyModel(nn.Layer):def __init__(self, features_len):super(ClassifyModel, self).__init__()self.fc1 = nn.layer.Linear(in_features=features_len, out_features=250)self.ac1 = nn.layer.ReLU()self.drop1 = nn.layer.Dropout(p=0.02)self.fc2 = nn.layer.Linear(in_features=250, out_features=100)self.ac2 = nn.layer.ReLU()self.drop2 = nn.layer.Dropout(p=0.02)self.fc3 = nn.layer.Linear(in_features=100, out_features=50)self.ac3 = nn.layer.ReLU()self.drop3 = nn.layer.Dropout(p=0.02)self.fc4 = nn.layer.Linear(in_features=50, out_features=25)self.ac4 = nn.layer.ReLU()self.drop4 = nn.layer.Dropout(p=0.02)self.fc5 = nn.layer.Linear(in_features=25, out_features=2)self.out = nn.layer.Sigmoid()def forward(self, input):x = self.fc1(input)x = self.ac1(x)x = self.drop1(x)x = self.fc2(x)x = self.ac2(x)x = self.drop2(x)x = self.fc3(x)x = self.ac3(x)x = self.drop3(x)x = self.fc4(x)x = self.ac4(x)x = self.drop4(x)x = self.fc5(x)output = self.out(x)return output
网络训练
# 初始化模型
model = ClassifyModel(int(len(features.columns)))
# 训练模式
model.train()
# 定义优化器
opt = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters())
loss_fn = nn.CrossEntropyLoss()EPOCHS = 10 # 设置外层循环次数
for epoch in range(EPOCHS):for iter_id, mini_batch in enumerate(train_dataloader):x_train = mini_batch[0]y_train = mini_batch[1]# 前向传播y_pred = model(x_train)# 计算损失loss = nn.functional.loss.cross_entropy(y_pred, y_train)# 打印lossavg_loss = paddle.mean(loss)if iter_id % 20 == 0:acc = paddle.metric.accuracy(y_pred, y_train)print("epoch: {}, iter: {}, loss is: {}, acc is: {}".format(epoch, iter_id, avg_loss.numpy(), acc.numpy()))# 反向传播avg_loss.backward()# 最小化loss,更新参数opt.step()# 清除梯度opt.clear_grad()
优化方向
同样,由于篇幅原因,下面两个方案可参考源码:gitee仓库
注意使用Embedding前,请先运行Embedding分析.ipynb生成对应字典文件
- 采用基于Embedding的Wide & Deep
- 采用基于FM的DeepFM
各版本模型分数结果
分类 | 模型 | 详情 | 分数 |
---|---|---|---|
ML | ML第一版本 | 1. 初步建模 2. 不参与建模的特征 [‘os’, ‘version’, ‘lan’, 'sid’] 3. 默认参数LGB | 88.094 |
ML第二版本 | 1. 基于第一版本 2. 引入version,简单转化使用timestamp 3. 测试默认参数LGB与XGB | 88.2133 | |
ML第三版本 | 1. 基于第二版本 2. 引入lan 3. 对osv和version做差 4. lgb祖传参数 | 88.9487 | |
ML第四版本 | 1. 基于第三版本 2. 5折lgb 3. 5折xgb 4. 融合 | 89.0293 89.0253 89.054 | |
ML第五版本 | 1.基于第三版本 2.添加像素比、像素大小、像素分辨率比 3. 5折lgb 4. 5折xgb 5. 融合 | 89.1873 89.108 89.1713 | |
Paddle | Paddle第一版本 | 1. 基于ML第三版本特征工程 2. 简单基于paddle搭建网络 | 未上传结果 |
Paddle第二版本 | 1. 基于第一版本 2. 添加embedding字典创建(在Embedding分析.ipynb) 3.基于embedding的混合基础模型 | 88.71 | |
Paddle第三版本 | 1. 基于第二版本 2. 添加DeepFM部分模型,然后合并 | 87.816 | |
TensorFlow | TF第一版本 | 1. 基于ML第三版本特征工程 2. 简单基于TensorFlow搭建网络 | 未上传结果 |
FM | FM第一版本 | 1. 基于FM模型的第一次简单建模 | 57.2147 |
最终排名得分
源码地址
https://gitee.com/turkeymz/coggle/tree/master/coggle_202112/mlp
这篇关于【12月Top 2】MarTech Challenge 点击反欺诈预测的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!