本文主要是介绍二手车价格预测task01:赛题理解和baseline实现,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
task01进行了完成了赛题的理解和bsaeline的实现,通过对数据的简单分析,以及用所有的数据在没有任何处理的情况下通过LGB和XGB将得到的结果进行提交得分379.5001,目前排名64。下一步会进行数据分析和特征工程,对数据做进一步的处理来提高训练和测试的效果。第一次参加比赛,通过后续学习争取排到第一页.[滑稽]
赛题理解
1.赛题概况
比赛要求参赛选手根据给定的数据集,建立模型,二手汽车的交易价格。
赛题以预测二手车的交易价格为任务,数据集报名后可见并可下载,该数据来自某交易平台的二手车交易记录,
总数据量超过40w,包含31列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取15万条作
为训练集,5万条作为测试集A,5万条作为测试集B,同时会对name、model、brand和regionCode等信息进行脱
敏。
通过这道赛题来引导大家走进 AI 数据竞赛的世界,主要针对于于竞赛新人进行自我练 习、自我提高。
2.预测指标
3.赛题分析
- 此题为传统的数据挖掘问题,通过数据科学以及机器学习深度学习的办法来进行建模得到结果。
- 此题是一个典型的回归问题。
- 主要应用xgb、lgb、catboost,以及pandas、numpy、matplotlib、seabon、sklearn、keras等等数据挖掘常
用库或者框架来进行数据挖掘任务。 - 通过EDA来挖掘数据的联系和自我熟悉数据。
4.代码示例及分析
4.1 载入训练集和测试集,并查看数据
path = './'
train = pd.read_csv(path+'car_train_0110.csv', sep=' ')
test = pd.read_csv(path+'car_testA_0110.csv', sep=' ')print('Train data shape:',train.shape)
print('TestA data shape:',test.shape)
Train data shape: (250000, 40)
TestA data shape: (50000, 39)
# 通过 .head() 简要浏览读取数据的形式
train.head()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 134890 | 734 | 20160002 | 13.0 | 9 | NaN | 0.0 | 1.0 | 0 | 15.0 | ... | 0.092139 | 0.000000 | 18.763832 | -1.512063 | -1.008718 | -12.100623 | -0.947052 | 9.077297 | 0.581214 | 3.945923 |
1 | 306648 | 196973 | 20080307 | 72.0 | 9 | 7.0 | 5.0 | 1.0 | 173 | 15.0 | ... | 0.001070 | 0.122335 | -5.685612 | -0.489963 | -2.223693 | -0.226865 | -0.658246 | -3.949621 | 4.593618 | -1.145653 |
2 | 340675 | 25347 | 20020312 | 18.0 | 12 | 3.0 | 0.0 | 1.0 | 50 | 12.5 | ... | 0.064410 | 0.003345 | -3.295700 | 1.816499 | 3.554439 | -0.683675 | 0.971495 | 2.625318 | -0.851922 | -1.246135 |
3 | 57332 | 5382 | 20000611 | 38.0 | 8 | 7.0 | 0.0 | 1.0 | 54 | 15.0 | ... | 0.069231 | 0.000000 | -3.405521 | 1.497826 | 4.782636 | 0.039101 | 1.227646 | 3.040629 | -0.801854 | -1.251894 |
4 | 265235 | 173174 | 20030109 | 87.0 | 0 | 5.0 | 5.0 | 1.0 | 131 | 3.0 | ... | 0.000099 | 0.001655 | -4.475429 | 0.124138 | 1.364567 | -0.319848 | -1.131568 | -3.303424 | -1.998466 | -1.279368 |
5 rows × 40 columns
# 通过 .info() 简要可以看到对应一些数据列名,以及NAN缺失信息
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 250000 non-null int64 1 name 250000 non-null int64 2 regDate 250000 non-null int64 3 model 250000 non-null float644 brand 250000 non-null int64 5 bodyType 224620 non-null float646 fuelType 227510 non-null float647 gearbox 236487 non-null float648 power 250000 non-null int64 9 kilometer 250000 non-null float6410 notRepairedDamage 201464 non-null float6411 regionCode 250000 non-null int64 12 seller 250000 non-null int64 13 offerType 250000 non-null int64 14 creatDate 250000 non-null int64 15 price 250000 non-null int64 16 v_0 250000 non-null float6417 v_1 250000 non-null float6418 v_2 250000 non-null float6419 v_3 250000 non-null float6420 v_4 250000 non-null float6421 v_5 250000 non-null float6422 v_6 250000 non-null float6423 v_7 250000 non-null float6424 v_8 250000 non-null float6425 v_9 250000 non-null float6426 v_10 250000 non-null float6427 v_11 250000 non-null float6428 v_12 250000 non-null float6429 v_13 250000 non-null float6430 v_14 250000 non-null float6431 v_15 250000 non-null float6432 v_16 250000 non-null float6433 v_17 250000 non-null float6434 v_18 250000 non-null float6435 v_19 250000 non-null float6436 v_20 250000 non-null float6437 v_21 250000 non-null float6438 v_22 250000 non-null float6439 v_23 250000 non-null float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
# 查看列
train.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21','v_22', 'v_23'],dtype='object')
- 可以看出数据集中每个数据有40个特征,其中有 v_0 – v_23 这23个数值型匿名特征,其他特征介绍如下:
- SaleID - 销售样本ID
- name - 汽车编码
- regDate - 汽车注册时间
- model - 车型编码
- brand - 品牌
- bodyType - 车身类型
- fuelType - 燃油类型
- gearbox - 变速箱
- power - 汽车功率
- kilometer - 汽车行驶公里
- notRepairedDamage - 汽车有尚未修复的损坏
- regionCode - 看车地区编码|
- seller - 销售方
- offerType - 报价类型
- creatDate - 广告发布时间
- price - 汽车价格
- 所有的特征列皆为数值型特征,其中’notRepairedDamage’,‘bodyType’, ‘fuelType’,'gearbox’四列含有null ,其他所有特征均为数值型特征,且没有空值
# 通过 .describe() 可以查看数值特征列的一些统计信息
train.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 250000.000000 | 250000.000000 | 2.500000e+05 | 250000.000000 | 250000.000000 | 224620.000000 | 227510.000000 | 236487.000000 | 250000.000000 | 250000.000000 | ... | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 |
mean | 185351.790768 | 83153.362172 | 2.003401e+07 | 44.911480 | 7.785236 | 4.563271 | 1.665008 | 0.780783 | 115.528412 | 12.577418 | ... | 0.032489 | 0.030408 | 0.014725 | 0.000915 | 0.006273 | 0.006604 | -0.001374 | 0.000609 | -0.004025 | 0.001834 |
std | 107121.188763 | 72540.799964 | 7.770250e+04 | 50.640081 | 7.694010 | 1.912515 | 2.339646 | 0.413717 | 196.141828 | 3.990632 | ... | 0.038792 | 0.049333 | 8.779163 | 5.771081 | 4.880981 | 4.124722 | 3.803626 | 3.555353 | 2.864713 | 2.323680 |
min | 1.000000 | 0.000000 | 1.910000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | -10.412444 | -15.538236 | -21.009214 | -13.989955 | -9.599285 | -11.181255 | -7.671327 | -2.350888 |
25% | 92501.750000 | 14500.000000 | 1.999061e+07 | 6.000000 | 1.000000 | 3.000000 | 0.000000 | 1.000000 | 70.000000 | 12.500000 | ... | 0.000129 | 0.000000 | -5.552269 | -0.901181 | -3.150385 | -0.478173 | -1.727237 | -3.067073 | -2.092178 | -1.402804 |
50% | 185264.500000 | 65314.500000 | 2.003111e+07 | 27.000000 | 6.000000 | 4.000000 | 0.000000 | 1.000000 | 105.000000 | 15.000000 | ... | 0.001961 | 0.002567 | -3.821770 | 0.223181 | -0.058502 | 0.038427 | -0.995044 | -0.880587 | -1.199807 | -1.145588 |
75% | 278128.500000 | 143761.250000 | 2.008081e+07 | 70.000000 | 11.000000 | 7.000000 | 5.000000 | 1.000000 | 150.000000 | 15.000000 | ... | 0.075672 | 0.056568 | 3.599747 | 1.263737 | 2.800475 | 0.569198 | 1.563382 | 3.269987 | 2.737614 | 0.044865 |
max | 370946.000000 | 233044.000000 | 2.019121e+07 | 250.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 20000.000000 | 15.000000 | ... | 0.130785 | 0.184340 | 36.756878 | 26.134561 | 23.055660 | 16.576027 | 20.324572 | 14.039422 | 8.764597 | 8.574730 |
8 rows × 40 columns
4.2 分类指标评价计算示例
# accuracy
y_pred = [0,1,0,1]
y_true = [0,1,1,1]
print('accuracy:',accuracy_score(y_true=y_true,y_pred=y_pred))
accuracy: 0.75
## Precision,Recall,F1-score
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
print('Precision',metrics.precision_score(y_true, y_pred))
print('Recall',metrics.recall_score(y_true, y_pred))
print('F1-score:',metrics.f1_score(y_true, y_pred))
Precision 1.0
Recall 0.5
F1-score: 0.6666666666666666
## AUC
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
print('AUC socre:',metrics.roc_auc_score(y_true, y_scores))
AUC socre: 0.75
4.3 回归指标评价计算示例
# MAPE需要自己实现
def mape(y_true,y_pred):return np.mean(np.abs(y_pred - y_true) / y_true)
y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])# MSE
print('MSE:',metrics.mean_squared_error(y_true, y_pred))
# RMSE
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE
print('MAE:',metrics.mean_absolute_error(y_true, y_pred))
# MAPE
print('MAPE:',mape(y_true, y_pred))
MSE: 0.2871428571428571
RMSE: 0.5358571238146014
MAE: 0.4142857142857143
MAPE: 0.07000000000000003
# R2-score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print('R2-score:',metrics.r2_score(y_true, y_pred))
R2-score: 0.9486081370449679
5.赛题理解总结
作为切入一道赛题的基础,赛题理解是极其重要的,对于赛题的理解甚至会影响后续的特征工程构建以及模型的
选择,最主要是会影响后续发展工作的方向,比如挖掘特征的方向或者存在问题解决问题的方向,对了赛题背后
的思想以及赛题业务逻辑的清晰,也很有利于花费更少时间构建更为有效的特征模型,赛题理解要达到的地步是
什么呢,把一道赛题转化为一种宏观理解的解决思路。 以下将从多方面对于此进行说明:
- 1) 赛题理解究竟是理解什么: 理解赛题是不是把一道赛题的背景介绍读一遍就OK了呢?并不是的,理解
赛题其实也是从直观上梳理问题,分析问题是否可行的方法,有多少可行度,赛题做的价值大不大,理清一
道赛题要从背后的赛题背景引发的赛题任务理解其中的任务逻辑,可能对于赛题有意义的外在数据有哪些,
并对于赛题数据有一个初步了解,知道现在和任务的相关数据有哪些,其中数据之间的关联逻辑是什么样
的。 对于不同的问题,在处理方式上的差异是很大的。如果用简短的话来说,并且在比赛的角度或者做工程
的角度,就是该赛题符合的问题是什么问题,大概要去用哪些指标,哪些指标是否会做到线上线下的一致
性,是否有效的利于我们进一步的探索更高线上分数的线下验证方法,在业务上,你是否对很多原始特征有
很深刻的了解,并且可以通过EDA来寻求他们直接的关系,最后构造出满意的特征。 - 2) 有了赛题理解后能做什么: 在对于赛题有了一定的了解后,分析清楚了问题的类型性质和对于数据理解
的这一基础上,是不是赛题理解就做完了呢? 并不是的,就像摸清了敌情后,我们至少就要有一些相应的理
解分析,比如这题的难点可能在哪里,关键点可能在哪里,哪些地方可以挖掘更好的特征,用什么样得线下
验证方式更为稳定,出现了过拟合或者其他问题,估摸可以用什么方法去解决这些问题,哪些数据是可靠
的,哪些数据是需要精密的处理的,哪部分数据应该是关键数据(背景的业务逻辑下,比如CTR的题,一个
寻常顾客大体会有怎么样的购买行为逻辑规律,或者风电那种题,如果机组比较邻近,相关一些风速,转速
特征是否会很近似)。这时是在一个宏观的大体下分析的,有助于摸清整个题的思路脉络,以及后续的分析
方向。 - 3) 赛题理解的-评价指标: 为什么要把这部分单独拿出来呢,因为这部分会涉及后续模型预测中两个很重要
的问题: 1. 本地模型的验证方式,很多情况下,线上验证是有一定的时间和次数限制的,所以在比赛中构
建一个合理的本地的验证集和验证的评价指标是很关键的步骤,能有效的节省很多时间。 2. 不同的指标对
于同样的预测结果是具有误差敏感的差异性的,比如AUC,logloss, MAE,RSME,或者一些特定的评价函
数。是会有很大可能会影响后续一些预测的侧重点。 - 4) 赛题背景中可能潜在隐藏的条件: 其实赛题中有些说明是很有利益-都可以在后续答辩中以及问题思考中
所体现出来的,比如高效性要求,比如对于数据异常的识别处理,比如工序流程的差异性,比如模型运行的
时间,比模型的鲁棒性,有些的意识是可以贯穿问题思考,特征,模型以及后续处理的,也有些会对于特征
构建或者选择模型上有很大益处,反过来如果在模型预测效果不好,其实有时也要反过来思考,是不是赛题
背景有没有哪方面理解不清晰或者什么其中的问题没考虑到。
baseline实现
1.导入相关的包
## 基础工具
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time
warnings.filterwarnings('ignore')
%matplotlib inline## 模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor## 数据降维处理的
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA
import lightgbm as lgb
import xgboost as xgb## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
# 安装相关的工具包
#!pip --default-timeout=100 install xgboost -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
2.数据读取 – 观察分析数据
path = './'
train = pd.read_csv(path+'car_train_0110.csv', sep=' ')
test = pd.read_csv(path+'car_testA_0110.csv', sep=' ')print('Train data shape:',train.shape)
print('TestA data shape:',test.shape)
Train data shape: (250000, 40)
TestA data shape: (50000, 39)
train.head()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 134890 | 734 | 20160002 | 13.0 | 9 | NaN | 0.0 | 1.0 | 0 | 15.0 | ... | 0.092139 | 0.000000 | 18.763832 | -1.512063 | -1.008718 | -12.100623 | -0.947052 | 9.077297 | 0.581214 | 3.945923 |
1 | 306648 | 196973 | 20080307 | 72.0 | 9 | 7.0 | 5.0 | 1.0 | 173 | 15.0 | ... | 0.001070 | 0.122335 | -5.685612 | -0.489963 | -2.223693 | -0.226865 | -0.658246 | -3.949621 | 4.593618 | -1.145653 |
2 | 340675 | 25347 | 20020312 | 18.0 | 12 | 3.0 | 0.0 | 1.0 | 50 | 12.5 | ... | 0.064410 | 0.003345 | -3.295700 | 1.816499 | 3.554439 | -0.683675 | 0.971495 | 2.625318 | -0.851922 | -1.246135 |
3 | 57332 | 5382 | 20000611 | 38.0 | 8 | 7.0 | 0.0 | 1.0 | 54 | 15.0 | ... | 0.069231 | 0.000000 | -3.405521 | 1.497826 | 4.782636 | 0.039101 | 1.227646 | 3.040629 | -0.801854 | -1.251894 |
4 | 265235 | 173174 | 20030109 | 87.0 | 0 | 5.0 | 5.0 | 1.0 | 131 | 3.0 | ... | 0.000099 | 0.001655 | -4.475429 | 0.124138 | 1.364567 | -0.319848 | -1.131568 | -3.303424 | -1.998466 | -1.279368 |
5 rows × 40 columns
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 250000 non-null int64 1 name 250000 non-null int64 2 regDate 250000 non-null int64 3 model 250000 non-null float644 brand 250000 non-null int64 5 bodyType 224620 non-null float646 fuelType 227510 non-null float647 gearbox 236487 non-null float648 power 250000 non-null int64 9 kilometer 250000 non-null float6410 notRepairedDamage 201464 non-null float6411 regionCode 250000 non-null int64 12 seller 250000 non-null int64 13 offerType 250000 non-null int64 14 creatDate 250000 non-null int64 15 price 250000 non-null int64 16 v_0 250000 non-null float6417 v_1 250000 non-null float6418 v_2 250000 non-null float6419 v_3 250000 non-null float6420 v_4 250000 non-null float6421 v_5 250000 non-null float6422 v_6 250000 non-null float6423 v_7 250000 non-null float6424 v_8 250000 non-null float6425 v_9 250000 non-null float6426 v_10 250000 non-null float6427 v_11 250000 non-null float6428 v_12 250000 non-null float6429 v_13 250000 non-null float6430 v_14 250000 non-null float6431 v_15 250000 non-null float6432 v_16 250000 non-null float6433 v_17 250000 non-null float6434 v_18 250000 non-null float6435 v_19 250000 non-null float6436 v_20 250000 non-null float6437 v_21 250000 non-null float6438 v_22 250000 non-null float6439 v_23 250000 non-null float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
test.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 5.000000e+04 | 50000.000000 | 50000.000000 | 44890.000000 | 45598.000000 | 47287.000000 | 50000.000000 | 50000.000000 | ... | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 |
mean | 556029.053380 | 82878.251420 | 2.003441e+07 | 44.922840 | 7.779420 | 4.556226 | 1.681192 | 0.781081 | 114.116060 | 12.555210 | ... | 0.032570 | 0.030773 | -0.024819 | 0.007051 | -0.008488 | -0.030104 | 0.014609 | -0.003353 | 0.013125 | -0.011936 |
std | 106952.402565 | 72292.076936 | 7.788055e+04 | 50.576255 | 7.661667 | 1.908291 | 2.344829 | 0.413518 | 177.274154 | 4.034901 | ... | 0.038779 | 0.049521 | 8.759663 | 5.784299 | 4.825261 | 4.100561 | 3.812667 | 3.548944 | 2.866774 | 2.316144 |
min | 370951.000000 | 0.000000 | 1.910000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | -10.196998 | -15.167961 | -21.925773 | -13.682825 | -9.282567 | -11.117367 | -6.365723 | -2.394516 |
25% | 463258.500000 | 14121.250000 | 1.999061e+07 | 6.000000 | 1.000000 | 3.000000 | 0.000000 | 1.000000 | 69.000000 | 12.500000 | ... | 0.000135 | 0.000000 | -5.575131 | -0.891030 | -3.105073 | -0.481952 | -1.697763 | -3.069575 | -2.089326 | -1.402958 |
50% | 556296.000000 | 65359.000000 | 2.003111e+07 | 27.000000 | 6.000000 | 4.000000 | 0.000000 | 1.000000 | 105.000000 | 15.000000 | ... | 0.001949 | 0.002593 | -3.837572 | 0.221379 | -0.081836 | 0.039376 | -0.971210 | -0.877377 | -1.192502 | -1.146398 |
75% | 648862.250000 | 143083.750000 | 2.008091e+07 | 70.000000 | 11.000000 | 7.000000 | 5.000000 | 1.000000 | 150.000000 | 15.000000 | ... | 0.075826 | 0.062063 | 3.531269 | 1.257687 | 2.784538 | 0.560046 | 1.572508 | 3.276918 | 2.772742 | -0.010769 |
max | 741887.000000 | 233028.000000 | 2.019040e+07 | 248.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 17700.000000 | 15.000000 | ... | 0.135900 | 0.180091 | 36.364986 | 26.043572 | 22.598441 | 16.333051 | 20.273633 | 11.691851 | 7.970303 | 8.749647 |
8 rows × 39 columns
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 50000 non-null int64 1 name 50000 non-null int64 2 regDate 50000 non-null int64 3 model 50000 non-null float644 brand 50000 non-null int64 5 bodyType 44890 non-null float646 fuelType 45598 non-null float647 gearbox 47287 non-null float648 power 50000 non-null int64 9 kilometer 50000 non-null float6410 notRepairedDamage 40372 non-null float6411 regionCode 50000 non-null int64 12 seller 50000 non-null int64 13 offerType 50000 non-null int64 14 creatDate 50000 non-null int64 15 v_0 50000 non-null float6416 v_1 50000 non-null float6417 v_2 50000 non-null float6418 v_3 50000 non-null float6419 v_4 50000 non-null float6420 v_5 50000 non-null float6421 v_6 50000 non-null float6422 v_7 50000 non-null float6423 v_8 50000 non-null float6424 v_9 50000 non-null float6425 v_10 50000 non-null float6426 v_11 50000 non-null float6427 v_12 50000 non-null float6428 v_13 50000 non-null float6429 v_14 50000 non-null float6430 v_15 50000 non-null float6431 v_16 50000 non-null float6432 v_17 50000 non-null float6433 v_18 50000 non-null float6434 v_19 50000 non-null float6435 v_20 50000 non-null float6436 v_21 50000 non-null float6437 v_22 50000 non-null float6438 v_23 50000 non-null float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB
- 可以发现 - 测试集中含有null值的特征列和训练集中一样
train.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 250000.000000 | 250000.000000 | 2.500000e+05 | 250000.000000 | 250000.000000 | 224620.000000 | 227510.000000 | 236487.000000 | 250000.000000 | 250000.000000 | ... | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 |
mean | 185351.790768 | 83153.362172 | 2.003401e+07 | 44.911480 | 7.785236 | 4.563271 | 1.665008 | 0.780783 | 115.528412 | 12.577418 | ... | 0.032489 | 0.030408 | 0.014725 | 0.000915 | 0.006273 | 0.006604 | -0.001374 | 0.000609 | -0.004025 | 0.001834 |
std | 107121.188763 | 72540.799964 | 7.770250e+04 | 50.640081 | 7.694010 | 1.912515 | 2.339646 | 0.413717 | 196.141828 | 3.990632 | ... | 0.038792 | 0.049333 | 8.779163 | 5.771081 | 4.880981 | 4.124722 | 3.803626 | 3.555353 | 2.864713 | 2.323680 |
min | 1.000000 | 0.000000 | 1.910000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | -10.412444 | -15.538236 | -21.009214 | -13.989955 | -9.599285 | -11.181255 | -7.671327 | -2.350888 |
25% | 92501.750000 | 14500.000000 | 1.999061e+07 | 6.000000 | 1.000000 | 3.000000 | 0.000000 | 1.000000 | 70.000000 | 12.500000 | ... | 0.000129 | 0.000000 | -5.552269 | -0.901181 | -3.150385 | -0.478173 | -1.727237 | -3.067073 | -2.092178 | -1.402804 |
50% | 185264.500000 | 65314.500000 | 2.003111e+07 | 27.000000 | 6.000000 | 4.000000 | 0.000000 | 1.000000 | 105.000000 | 15.000000 | ... | 0.001961 | 0.002567 | -3.821770 | 0.223181 | -0.058502 | 0.038427 | -0.995044 | -0.880587 | -1.199807 | -1.145588 |
75% | 278128.500000 | 143761.250000 | 2.008081e+07 | 70.000000 | 11.000000 | 7.000000 | 5.000000 | 1.000000 | 150.000000 | 15.000000 | ... | 0.075672 | 0.056568 | 3.599747 | 1.263737 | 2.800475 | 0.569198 | 1.563382 | 3.269987 | 2.737614 | 0.044865 |
max | 370946.000000 | 233044.000000 | 2.019121e+07 | 250.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 20000.000000 | 15.000000 | ... | 0.130785 | 0.184340 | 36.756878 | 26.134561 | 23.055660 | 16.576027 | 20.324572 | 14.039422 | 8.764597 | 8.574730 |
8 rows × 40 columns
test.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 5.000000e+04 | 50000.000000 | 50000.000000 | 44890.000000 | 45598.000000 | 47287.000000 | 50000.000000 | 50000.000000 | ... | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 |
mean | 556029.053380 | 82878.251420 | 2.003441e+07 | 44.922840 | 7.779420 | 4.556226 | 1.681192 | 0.781081 | 114.116060 | 12.555210 | ... | 0.032570 | 0.030773 | -0.024819 | 0.007051 | -0.008488 | -0.030104 | 0.014609 | -0.003353 | 0.013125 | -0.011936 |
std | 106952.402565 | 72292.076936 | 7.788055e+04 | 50.576255 | 7.661667 | 1.908291 | 2.344829 | 0.413518 | 177.274154 | 4.034901 | ... | 0.038779 | 0.049521 | 8.759663 | 5.784299 | 4.825261 | 4.100561 | 3.812667 | 3.548944 | 2.866774 | 2.316144 |
min | 370951.000000 | 0.000000 | 1.910000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | -10.196998 | -15.167961 | -21.925773 | -13.682825 | -9.282567 | -11.117367 | -6.365723 | -2.394516 |
25% | 463258.500000 | 14121.250000 | 1.999061e+07 | 6.000000 | 1.000000 | 3.000000 | 0.000000 | 1.000000 | 69.000000 | 12.500000 | ... | 0.000135 | 0.000000 | -5.575131 | -0.891030 | -3.105073 | -0.481952 | -1.697763 | -3.069575 | -2.089326 | -1.402958 |
50% | 556296.000000 | 65359.000000 | 2.003111e+07 | 27.000000 | 6.000000 | 4.000000 | 0.000000 | 1.000000 | 105.000000 | 15.000000 | ... | 0.001949 | 0.002593 | -3.837572 | 0.221379 | -0.081836 | 0.039376 | -0.971210 | -0.877377 | -1.192502 | -1.146398 |
75% | 648862.250000 | 143083.750000 | 2.008091e+07 | 70.000000 | 11.000000 | 7.000000 | 5.000000 | 1.000000 | 150.000000 | 15.000000 | ... | 0.075826 | 0.062063 | 3.531269 | 1.257687 | 2.784538 | 0.560046 | 1.572508 | 3.276918 | 2.772742 | -0.010769 |
max | 741887.000000 | 233028.000000 | 2.019040e+07 | 248.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 17700.000000 | 15.000000 | ... | 0.135900 | 0.180091 | 36.364986 | 26.043572 | 22.598441 | 16.333051 | 20.273633 | 11.691851 | 7.970303 | 8.749647 |
8 rows × 39 columns
- 可以看到匿名特征列的mean都在0附近,是处理后的特征
3.特征与标签的创建
3.1 提取特征列名
- 划分数值型特征和类别型特征
# 数值特征列
numerical_cols = train.select_dtypes(exclude = 'object').columns
print(numerical_cols)
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21','v_22', 'v_23'],dtype='object')
# 类别型特征列 -- 这个和二手车不同,二手车这里categorical_cols为类别型特征
categorical_cols = train.select_dtypes(include = 'object').columns
print(categorical_cols)
Index([], dtype='object')
3.2 构建训练和测试样本
# 选择特征列 -- 先将所有的列都作为特征列
feature_cols = [col for col in numerical_cols if 'price' not in col]# 提取特征列,标签列构造训练样本和测试样本
X_data = train[feature_cols]
Y_data = train['price']
X_test = test[feature_cols]
print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
X train shape: (250000, 39)
X test shape: (50000, 39)
X_data
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 134890 | 734 | 20160002 | 13.0 | 9 | NaN | 0.0 | 1.0 | 0 | 15.0 | ... | 0.092139 | 0.000000 | 18.763832 | -1.512063 | -1.008718 | -12.100623 | -0.947052 | 9.077297 | 0.581214 | 3.945923 |
1 | 306648 | 196973 | 20080307 | 72.0 | 9 | 7.0 | 5.0 | 1.0 | 173 | 15.0 | ... | 0.001070 | 0.122335 | -5.685612 | -0.489963 | -2.223693 | -0.226865 | -0.658246 | -3.949621 | 4.593618 | -1.145653 |
2 | 340675 | 25347 | 20020312 | 18.0 | 12 | 3.0 | 0.0 | 1.0 | 50 | 12.5 | ... | 0.064410 | 0.003345 | -3.295700 | 1.816499 | 3.554439 | -0.683675 | 0.971495 | 2.625318 | -0.851922 | -1.246135 |
3 | 57332 | 5382 | 20000611 | 38.0 | 8 | 7.0 | 0.0 | 1.0 | 54 | 15.0 | ... | 0.069231 | 0.000000 | -3.405521 | 1.497826 | 4.782636 | 0.039101 | 1.227646 | 3.040629 | -0.801854 | -1.251894 |
4 | 265235 | 173174 | 20030109 | 87.0 | 0 | 5.0 | 5.0 | 1.0 | 131 | 3.0 | ... | 0.000099 | 0.001655 | -4.475429 | 0.124138 | 1.364567 | -0.319848 | -1.131568 | -3.303424 | -1.998466 | -1.279368 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
249995 | 10556 | 9332 | 20170003 | 13.0 | 9 | NaN | NaN | 1.0 | 58 | 15.0 | ... | 0.079119 | 0.001447 | 11.782508 | 20.402576 | -2.722772 | 0.462388 | -4.429385 | 7.883413 | 0.698405 | -1.082013 |
249996 | 146710 | 102110 | 20030511 | 29.0 | 17 | 3.0 | 0.0 | 0.0 | 61 | 15.0 | ... | 0.000000 | 0.002342 | -2.988272 | 1.500532 | 3.502201 | -0.761715 | -2.484556 | -2.532968 | -0.940266 | -1.106426 |
249997 | 116066 | 82802 | 20130312 | 124.0 | 16 | 6.0 | 0.0 | 1.0 | 122 | 3.0 | ... | 0.003358 | 0.100760 | -6.939560 | -1.144959 | -5.337949 | 0.896026 | -0.592565 | -3.872725 | 2.135984 | 3.807554 |
249998 | 90082 | 65971 | 20121212 | 111.0 | 4 | 7.0 | 5.0 | 0.0 | 184 | 9.0 | ... | 0.002974 | 0.008251 | -7.222167 | -1.383696 | -5.402794 | -0.409451 | -1.891556 | -3.104789 | -3.777374 | 3.186218 |
249999 | 76453 | 56954 | 20051111 | 13.0 | 9 | 3.0 | 0.0 | 1.0 | 58 | 12.5 | ... | 0.000000 | 0.009071 | 10.491312 | -11.270043 | -0.272595 | -0.026478 | -2.168249 | -0.980042 | -0.955164 | -1.169593 |
250000 rows × 39 columns
# 定义了一个统计函数,方便后续信息统计
def Sta_inf(data):print('_min',np.min(data))print('_max:',np.max(data))print('_mean',np.mean(data))print('_ptp',np.ptp(data))print('_std',np.std(data))print('_var',np.var(data))
# 统计标签的基本分布信息
print('Sta of label:')
Sta_inf(Y_data)
Sta of label:
_min 0
_max: 100000
_mean 5599.181116
_ptp 100000
_std 7470.932963236185
_var 55814839.341169
## 绘制标签的统计图,查看标签分布 -- 主要分布在0--20000的范围
plt.hist(Y_data)
plt.show()
plt.close()
3.3 缺省值用-1填补
X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)
4.模型训练与预测
4.1利用xgb进行五折交叉验证查看模型的参数效果
## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'
scores_train = []
scores = []
## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):train_x=X_data.iloc[train_ind].valuestrain_y=Y_data.iloc[train_ind]val_x=X_data.iloc[val_ind].valuesval_y=Y_data.iloc[val_ind]xgr.fit(train_x,train_y)pred_train_xgb=xgr.predict(train_x)pred_xgb=xgr.predict(val_x)score_train = mean_absolute_error(train_y,pred_train_xgb)scores_train.append(score_train)score = mean_absolute_error(val_y,pred_xgb)scores.append(score)
print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
Train mae: 355.92984509530174
Val mae 416.219085805194
4.2 定义xgb和lgb模型函数
def build_model_xgb(x_train,y_train):model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'model.fit(x_train, y_train)return model
def build_model_lgb(x_train,y_train):estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2],}gbm = GridSearchCV(estimator, param_grid)gbm.fit(x_train, y_train)return gbm
4.3 切分数据集(Train,Val)进行模型训练,评价和预测
## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)
print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)
print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)
Train lgb...
MAE of val with lgb: 405.1620812257069
Predict lgb...
Sta of Predict lgb:
_min -962.1276302396495
_max: 93490.91395949211
_mean 5607.366367495364
_ptp 94453.04158973176
_std 7416.690755216019
_var 55007301.75850676
print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)
print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)
Train xgb...
MAE of val with xgb: 412.91539463291707
Predict xgb...
Sta of Predict xgb:
_min -604.3489
_max: 93991.46
_mean 5603.4546
_ptp 94595.81
_std 7376.0728
_var 54406452.0
### 4.4 进行两模型的结果加权融合
## 这里我们采取了简单的加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数,而真实情况下,price为负是不存在的,
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))
MAE of val with Weighted ensemble: 387.83417717031836
sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb
## 查看预测值的统计进行
plt.hist(Y_data)
plt.show()
plt.close()
### 4.5 输出结果
sub = pd.DataFrame()
sub['SaleID'] = X_test.SaleID
sub['price'] = sub_Weighted
sub.to_csv('./sub_Weighted.csv',index=False)
sub.head()
SaleID | price | |
---|---|---|
0 | 720326 | 5812.952572 |
1 | 714316 | 1372.148145 |
2 | 704693 | 3057.827599 |
3 | 624972 | 599.607068 |
4 | 669753 | 6902.600236 |
5.提交结果展示
这篇关于二手车价格预测task01:赛题理解和baseline实现的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!