二手车价格预测task01:赛题理解和baseline实现

本文主要是介绍二手车价格预测task01:赛题理解和baseline实现，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

task01进行了完成了赛题的理解和bsaeline的实现,通过对数据的简单分析,以及用所有的数据在没有任何处理的情况下通过LGB和XGB将得到的结果进行提交得分379.5001,目前排名64。下一步会进行数据分析和特征工程,对数据做进一步的处理来提高训练和测试的效果。第一次参加比赛，通过后续学习争取排到第一页.[滑稽]

赛题理解

1.赛题概况

比赛要求参赛选手根据给定的数据集，建立模型，二手汽车的交易价格。
赛题以预测二手车的交易价格为任务，数据集报名后可见并可下载，该数据来自某交易平台的二手车交易记录，
总数据量超过40w，包含31列变量信息，其中15列为匿名变量。为了保证比赛的公平性，将会从中抽取15万条作
为训练集，5万条作为测试集A，5万条作为测试集B，同时会对name、model、brand和regionCode等信息进行脱
敏。
通过这道赛题来引导大家走进 AI 数据竞赛的世界，主要针对于于竞赛新人进行自我练习、自我提高。

2.预测指标

在这里插入图片描述

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vOVw1AOw-1618322334609)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-D2223Jfy-1618322334615)(attachment:image.png)]

3.赛题分析

此题为传统的数据挖掘问题，通过数据科学以及机器学习深度学习的办法来进行建模得到结果。
此题是一个典型的回归问题。
主要应用xgb、lgb、catboost，以及pandas、numpy、matplotlib、seabon、sklearn、keras等等数据挖掘常
用库或者框架来进行数据挖掘任务。
通过EDA来挖掘数据的联系和自我熟悉数据。

4.代码示例及分析

4.1 载入训练集和测试集,并查看数据

path = './'
train = pd.read_csv(path+'car_train_0110.csv', sep=' ')
test = pd.read_csv(path+'car_testA_0110.csv', sep=' ')print('Train data shape:',train.shape)
print('TestA data shape:',test.shape)

Train data shape: (250000, 40)
TestA data shape: (50000, 39)

# 通过 .head() 简要浏览读取数据的形式
train.head()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_14	v_15	v_16	v_17	v_18	v_19	v_20	v_21	v_22	v_23
0	134890	734	20160002	13.0	9	NaN	0.0	1.0	0	15.0	...	0.092139	0.000000	18.763832	-1.512063	-1.008718	-12.100623	-0.947052	9.077297	0.581214	3.945923
1	306648	196973	20080307	72.0	9	7.0	5.0	1.0	173	15.0	...	0.001070	0.122335	-5.685612	-0.489963	-2.223693	-0.226865	-0.658246	-3.949621	4.593618	-1.145653
2	340675	25347	20020312	18.0	12	3.0	0.0	1.0	50	12.5	...	0.064410	0.003345	-3.295700	1.816499	3.554439	-0.683675	0.971495	2.625318	-0.851922	-1.246135
3	57332	5382	20000611	38.0	8	7.0	0.0	1.0	54	15.0	...	0.069231	0.000000	-3.405521	1.497826	4.782636	0.039101	1.227646	3.040629	-0.801854	-1.251894
4	265235	173174	20030109	87.0	0	5.0	5.0	1.0	131	3.0	...	0.000099	0.001655	-4.475429	0.124138	1.364567	-0.319848	-1.131568	-3.303424	-1.998466	-1.279368

5 rows × 40 columns

# 通过 .info() 简要可以看到对应一些数据列名，以及NAN缺失信息
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):#   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  0   SaleID             250000 non-null  int64  1   name               250000 non-null  int64  2   regDate            250000 non-null  int64  3   model              250000 non-null  float644   brand              250000 non-null  int64  5   bodyType           224620 non-null  float646   fuelType           227510 non-null  float647   gearbox            236487 non-null  float648   power              250000 non-null  int64  9   kilometer          250000 non-null  float6410  notRepairedDamage  201464 non-null  float6411  regionCode         250000 non-null  int64  12  seller             250000 non-null  int64  13  offerType          250000 non-null  int64  14  creatDate          250000 non-null  int64  15  price              250000 non-null  int64  16  v_0                250000 non-null  float6417  v_1                250000 non-null  float6418  v_2                250000 non-null  float6419  v_3                250000 non-null  float6420  v_4                250000 non-null  float6421  v_5                250000 non-null  float6422  v_6                250000 non-null  float6423  v_7                250000 non-null  float6424  v_8                250000 non-null  float6425  v_9                250000 non-null  float6426  v_10               250000 non-null  float6427  v_11               250000 non-null  float6428  v_12               250000 non-null  float6429  v_13               250000 non-null  float6430  v_14               250000 non-null  float6431  v_15               250000 non-null  float6432  v_16               250000 non-null  float6433  v_17               250000 non-null  float6434  v_18               250000 non-null  float6435  v_19               250000 non-null  float6436  v_20               250000 non-null  float6437  v_21               250000 non-null  float6438  v_22               250000 non-null  float6439  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB

# 查看列
train.columns

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21','v_22', 'v_23'],dtype='object')

可以看出数据集中每个数据有40个特征,其中有 v_0 – v_23 这23个数值型匿名特征,其他特征介绍如下:
SaleID - 销售样本ID
name - 汽车编码
regDate - 汽车注册时间
model - 车型编码
brand - 品牌
bodyType - 车身类型
fuelType - 燃油类型
gearbox - 变速箱
power - 汽车功率
kilometer - 汽车行驶公里
notRepairedDamage - 汽车有尚未修复的损坏
regionCode - 看车地区编码|
seller - 销售方
offerType - 报价类型
creatDate - 广告发布时间
price - 汽车价格
所有的特征列皆为数值型特征,其中’notRepairedDamage’,‘bodyType’, ‘fuelType’,'gearbox’四列含有null ,其他所有特征均为数值型特征,且没有空值

# 通过 .describe() 可以查看数值特征列的一些统计信息
train.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_14	v_15	v_16	v_17	v_18	v_19	v_20	v_21	v_22	v_23
count	250000.000000	250000.000000	2.500000e+05	250000.000000	250000.000000	224620.000000	227510.000000	236487.000000	250000.000000	250000.000000	...	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000
mean	185351.790768	83153.362172	2.003401e+07	44.911480	7.785236	4.563271	1.665008	0.780783	115.528412	12.577418	...	0.032489	0.030408	0.014725	0.000915	0.006273	0.006604	-0.001374	0.000609	-0.004025	0.001834
std	107121.188763	72540.799964	7.770250e+04	50.640081	7.694010	1.912515	2.339646	0.413717	196.141828	3.990632	...	0.038792	0.049333	8.779163	5.771081	4.880981	4.124722	3.803626	3.555353	2.864713	2.323680
min	1.000000	0.000000	1.910000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	-10.412444	-15.538236	-21.009214	-13.989955	-9.599285	-11.181255	-7.671327	-2.350888
25%	92501.750000	14500.000000	1.999061e+07	6.000000	1.000000	3.000000	0.000000	1.000000	70.000000	12.500000	...	0.000129	0.000000	-5.552269	-0.901181	-3.150385	-0.478173	-1.727237	-3.067073	-2.092178	-1.402804
50%	185264.500000	65314.500000	2.003111e+07	27.000000	6.000000	4.000000	0.000000	1.000000	105.000000	15.000000	...	0.001961	0.002567	-3.821770	0.223181	-0.058502	0.038427	-0.995044	-0.880587	-1.199807	-1.145588
75%	278128.500000	143761.250000	2.008081e+07	70.000000	11.000000	7.000000	5.000000	1.000000	150.000000	15.000000	...	0.075672	0.056568	3.599747	1.263737	2.800475	0.569198	1.563382	3.269987	2.737614	0.044865
max	370946.000000	233044.000000	2.019121e+07	250.000000	39.000000	7.000000	6.000000	1.000000	20000.000000	15.000000	...	0.130785	0.184340	36.756878	26.134561	23.055660	16.576027	20.324572	14.039422	8.764597	8.574730

8 rows × 40 columns

4.2 分类指标评价计算示例

# accuracy
y_pred = [0,1,0,1]
y_true = [0,1,1,1]
print('accuracy:',accuracy_score(y_true=y_true,y_pred=y_pred))

accuracy: 0.75

## Precision,Recall,F1-score
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
print('Precision',metrics.precision_score(y_true, y_pred))
print('Recall',metrics.recall_score(y_true, y_pred))
print('F1-score:',metrics.f1_score(y_true, y_pred))

Precision 1.0
Recall 0.5
F1-score: 0.6666666666666666

## AUC
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
print('AUC socre:',metrics.roc_auc_score(y_true, y_scores))

AUC socre: 0.75

4.3 回归指标评价计算示例

# MAPE需要自己实现
def mape(y_true,y_pred):return np.mean(np.abs(y_pred - y_true) / y_true)

y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])# MSE
print('MSE:',metrics.mean_squared_error(y_true, y_pred))
# RMSE
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE
print('MAE:',metrics.mean_absolute_error(y_true, y_pred))
# MAPE
print('MAPE:',mape(y_true, y_pred))

MSE: 0.2871428571428571
RMSE: 0.5358571238146014
MAE: 0.4142857142857143
MAPE: 0.07000000000000003

# R2-score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print('R2-score:',metrics.r2_score(y_true, y_pred))

R2-score: 0.9486081370449679

5.赛题理解总结

作为切入一道赛题的基础，赛题理解是极其重要的，对于赛题的理解甚至会影响后续的特征工程构建以及模型的
选择，最主要是会影响后续发展工作的方向，比如挖掘特征的方向或者存在问题解决问题的方向，对了赛题背后
的思想以及赛题业务逻辑的清晰，也很有利于花费更少时间构建更为有效的特征模型，赛题理解要达到的地步是
什么呢，把一道赛题转化为一种宏观理解的解决思路。以下将从多方面对于此进行说明：

1）赛题理解究竟是理解什么：理解赛题是不是把一道赛题的背景介绍读一遍就OK了呢？并不是的，理解
赛题其实也是从直观上梳理问题，分析问题是否可行的方法，有多少可行度，赛题做的价值大不大，理清一
道赛题要从背后的赛题背景引发的赛题任务理解其中的任务逻辑，可能对于赛题有意义的外在数据有哪些，
并对于赛题数据有一个初步了解，知道现在和任务的相关数据有哪些，其中数据之间的关联逻辑是什么样
的。对于不同的问题，在处理方式上的差异是很大的。如果用简短的话来说，并且在比赛的角度或者做工程
的角度，就是该赛题符合的问题是什么问题，大概要去用哪些指标，哪些指标是否会做到线上线下的一致
性，是否有效的利于我们进一步的探索更高线上分数的线下验证方法，在业务上，你是否对很多原始特征有
很深刻的了解，并且可以通过EDA来寻求他们直接的关系，最后构造出满意的特征。
2）有了赛题理解后能做什么：在对于赛题有了一定的了解后，分析清楚了问题的类型性质和对于数据理解
的这一基础上，是不是赛题理解就做完了呢? 并不是的，就像摸清了敌情后，我们至少就要有一些相应的理
解分析，比如这题的难点可能在哪里，关键点可能在哪里，哪些地方可以挖掘更好的特征，用什么样得线下
验证方式更为稳定，出现了过拟合或者其他问题，估摸可以用什么方法去解决这些问题，哪些数据是可靠
的，哪些数据是需要精密的处理的，哪部分数据应该是关键数据（背景的业务逻辑下，比如CTR的题，一个
寻常顾客大体会有怎么样的购买行为逻辑规律，或者风电那种题，如果机组比较邻近，相关一些风速，转速
特征是否会很近似）。这时是在一个宏观的大体下分析的，有助于摸清整个题的思路脉络，以及后续的分析
方向。
3）赛题理解的-评价指标：为什么要把这部分单独拿出来呢，因为这部分会涉及后续模型预测中两个很重要
的问题： 1．本地模型的验证方式，很多情况下，线上验证是有一定的时间和次数限制的，所以在比赛中构
建一个合理的本地的验证集和验证的评价指标是很关键的步骤，能有效的节省很多时间。 2．不同的指标对
于同样的预测结果是具有误差敏感的差异性的，比如AUC，logloss, MAE，RSME，或者一些特定的评价函
数。是会有很大可能会影响后续一些预测的侧重点。
4）赛题背景中可能潜在隐藏的条件：其实赛题中有些说明是很有利益-都可以在后续答辩中以及问题思考中
所体现出来的，比如高效性要求，比如对于数据异常的识别处理，比如工序流程的差异性，比如模型运行的
时间，比模型的鲁棒性，有些的意识是可以贯穿问题思考，特征，模型以及后续处理的，也有些会对于特征
构建或者选择模型上有很大益处，反过来如果在模型预测效果不好，其实有时也要反过来思考，是不是赛题
背景有没有哪方面理解不清晰或者什么其中的问题没考虑到。

baseline实现

1.导入相关的包

## 基础工具 
import numpy as np 
import pandas as pd 
import warnings 
import matplotlib 
import matplotlib.pyplot as plt 
import seaborn as sns 
from scipy.special import jn 
from IPython.display import display, clear_output
import time
warnings.filterwarnings('ignore') 
%matplotlib inline## 模型预测的 
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor## 数据降维处理的 
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA
import lightgbm as lgb
import xgboost as xgb## 参数搜索和评价的 
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

# 安装相关的工具包
#!pip --default-timeout=100 install xgboost -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

2.数据读取 – 观察分析数据

path = './'
train = pd.read_csv(path+'car_train_0110.csv', sep=' ')
test = pd.read_csv(path+'car_testA_0110.csv', sep=' ')print('Train data shape:',train.shape)
print('TestA data shape:',test.shape)

Train data shape: (250000, 40)
TestA data shape: (50000, 39)

train.head()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_14	v_15	v_16	v_17	v_18	v_19	v_20	v_21	v_22	v_23
0	134890	734	20160002	13.0	9	NaN	0.0	1.0	0	15.0	...	0.092139	0.000000	18.763832	-1.512063	-1.008718	-12.100623	-0.947052	9.077297	0.581214	3.945923
1	306648	196973	20080307	72.0	9	7.0	5.0	1.0	173	15.0	...	0.001070	0.122335	-5.685612	-0.489963	-2.223693	-0.226865	-0.658246	-3.949621	4.593618	-1.145653
2	340675	25347	20020312	18.0	12	3.0	0.0	1.0	50	12.5	...	0.064410	0.003345	-3.295700	1.816499	3.554439	-0.683675	0.971495	2.625318	-0.851922	-1.246135
3	57332	5382	20000611	38.0	8	7.0	0.0	1.0	54	15.0	...	0.069231	0.000000	-3.405521	1.497826	4.782636	0.039101	1.227646	3.040629	-0.801854	-1.251894
4	265235	173174	20030109	87.0	0	5.0	5.0	1.0	131	3.0	...	0.000099	0.001655	-4.475429	0.124138	1.364567	-0.319848	-1.131568	-3.303424	-1.998466	-1.279368

5 rows × 40 columns

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):#   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  0   SaleID             250000 non-null  int64  1   name               250000 non-null  int64  2   regDate            250000 non-null  int64  3   model              250000 non-null  float644   brand              250000 non-null  int64  5   bodyType           224620 non-null  float646   fuelType           227510 non-null  float647   gearbox            236487 non-null  float648   power              250000 non-null  int64  9   kilometer          250000 non-null  float6410  notRepairedDamage  201464 non-null  float6411  regionCode         250000 non-null  int64  12  seller             250000 non-null  int64  13  offerType          250000 non-null  int64  14  creatDate          250000 non-null  int64  15  price              250000 non-null  int64  16  v_0                250000 non-null  float6417  v_1                250000 non-null  float6418  v_2                250000 non-null  float6419  v_3                250000 non-null  float6420  v_4                250000 non-null  float6421  v_5                250000 non-null  float6422  v_6                250000 non-null  float6423  v_7                250000 non-null  float6424  v_8                250000 non-null  float6425  v_9                250000 non-null  float6426  v_10               250000 non-null  float6427  v_11               250000 non-null  float6428  v_12               250000 non-null  float6429  v_13               250000 non-null  float6430  v_14               250000 non-null  float6431  v_15               250000 non-null  float6432  v_16               250000 non-null  float6433  v_17               250000 non-null  float6434  v_18               250000 non-null  float6435  v_19               250000 non-null  float6436  v_20               250000 non-null  float6437  v_21               250000 non-null  float6438  v_22               250000 non-null  float6439  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB

test.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_14	v_15	v_16	v_17	v_18	v_19	v_20	v_21	v_22	v_23
count	50000.000000	50000.000000	5.000000e+04	50000.000000	50000.000000	44890.000000	45598.000000	47287.000000	50000.000000	50000.000000	...	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000
mean	556029.053380	82878.251420	2.003441e+07	44.922840	7.779420	4.556226	1.681192	0.781081	114.116060	12.555210	...	0.032570	0.030773	-0.024819	0.007051	-0.008488	-0.030104	0.014609	-0.003353	0.013125	-0.011936
std	106952.402565	72292.076936	7.788055e+04	50.576255	7.661667	1.908291	2.344829	0.413518	177.274154	4.034901	...	0.038779	0.049521	8.759663	5.784299	4.825261	4.100561	3.812667	3.548944	2.866774	2.316144
min	370951.000000	0.000000	1.910000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	-10.196998	-15.167961	-21.925773	-13.682825	-9.282567	-11.117367	-6.365723	-2.394516
25%	463258.500000	14121.250000	1.999061e+07	6.000000	1.000000	3.000000	0.000000	1.000000	69.000000	12.500000	...	0.000135	0.000000	-5.575131	-0.891030	-3.105073	-0.481952	-1.697763	-3.069575	-2.089326	-1.402958
50%	556296.000000	65359.000000	2.003111e+07	27.000000	6.000000	4.000000	0.000000	1.000000	105.000000	15.000000	...	0.001949	0.002593	-3.837572	0.221379	-0.081836	0.039376	-0.971210	-0.877377	-1.192502	-1.146398
75%	648862.250000	143083.750000	2.008091e+07	70.000000	11.000000	7.000000	5.000000	1.000000	150.000000	15.000000	...	0.075826	0.062063	3.531269	1.257687	2.784538	0.560046	1.572508	3.276918	2.772742	-0.010769
max	741887.000000	233028.000000	2.019040e+07	248.000000	39.000000	7.000000	6.000000	1.000000	17700.000000	15.000000	...	0.135900	0.180091	36.364986	26.043572	22.598441	16.333051	20.273633	11.691851	7.970303	8.749647

8 rows × 39 columns

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):#   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  0   SaleID             50000 non-null  int64  1   name               50000 non-null  int64  2   regDate            50000 non-null  int64  3   model              50000 non-null  float644   brand              50000 non-null  int64  5   bodyType           44890 non-null  float646   fuelType           45598 non-null  float647   gearbox            47287 non-null  float648   power              50000 non-null  int64  9   kilometer          50000 non-null  float6410  notRepairedDamage  40372 non-null  float6411  regionCode         50000 non-null  int64  12  seller             50000 non-null  int64  13  offerType          50000 non-null  int64  14  creatDate          50000 non-null  int64  15  v_0                50000 non-null  float6416  v_1                50000 non-null  float6417  v_2                50000 non-null  float6418  v_3                50000 non-null  float6419  v_4                50000 non-null  float6420  v_5                50000 non-null  float6421  v_6                50000 non-null  float6422  v_7                50000 non-null  float6423  v_8                50000 non-null  float6424  v_9                50000 non-null  float6425  v_10               50000 non-null  float6426  v_11               50000 non-null  float6427  v_12               50000 non-null  float6428  v_13               50000 non-null  float6429  v_14               50000 non-null  float6430  v_15               50000 non-null  float6431  v_16               50000 non-null  float6432  v_17               50000 non-null  float6433  v_18               50000 non-null  float6434  v_19               50000 non-null  float6435  v_20               50000 non-null  float6436  v_21               50000 non-null  float6437  v_22               50000 non-null  float6438  v_23               50000 non-null  float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB

可以发现 - 测试集中含有null值的特征列和训练集中一样

train.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_14	v_15	v_16	v_17	v_18	v_19	v_20	v_21	v_22	v_23
count	250000.000000	250000.000000	2.500000e+05	250000.000000	250000.000000	224620.000000	227510.000000	236487.000000	250000.000000	250000.000000	...	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000	250000.000000
mean	185351.790768	83153.362172	2.003401e+07	44.911480	7.785236	4.563271	1.665008	0.780783	115.528412	12.577418	...	0.032489	0.030408	0.014725	0.000915	0.006273	0.006604	-0.001374	0.000609	-0.004025	0.001834
std	107121.188763	72540.799964	7.770250e+04	50.640081	7.694010	1.912515	2.339646	0.413717	196.141828	3.990632	...	0.038792	0.049333	8.779163	5.771081	4.880981	4.124722	3.803626	3.555353	2.864713	2.323680
min	1.000000	0.000000	1.910000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	-10.412444	-15.538236	-21.009214	-13.989955	-9.599285	-11.181255	-7.671327	-2.350888
25%	92501.750000	14500.000000	1.999061e+07	6.000000	1.000000	3.000000	0.000000	1.000000	70.000000	12.500000	...	0.000129	0.000000	-5.552269	-0.901181	-3.150385	-0.478173	-1.727237	-3.067073	-2.092178	-1.402804
50%	185264.500000	65314.500000	2.003111e+07	27.000000	6.000000	4.000000	0.000000	1.000000	105.000000	15.000000	...	0.001961	0.002567	-3.821770	0.223181	-0.058502	0.038427	-0.995044	-0.880587	-1.199807	-1.145588
75%	278128.500000	143761.250000	2.008081e+07	70.000000	11.000000	7.000000	5.000000	1.000000	150.000000	15.000000	...	0.075672	0.056568	3.599747	1.263737	2.800475	0.569198	1.563382	3.269987	2.737614	0.044865
max	370946.000000	233044.000000	2.019121e+07	250.000000	39.000000	7.000000	6.000000	1.000000	20000.000000	15.000000	...	0.130785	0.184340	36.756878	26.134561	23.055660	16.576027	20.324572	14.039422	8.764597	8.574730

8 rows × 40 columns

test.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_14	v_15	v_16	v_17	v_18	v_19	v_20	v_21	v_22	v_23
count	50000.000000	50000.000000	5.000000e+04	50000.000000	50000.000000	44890.000000	45598.000000	47287.000000	50000.000000	50000.000000	...	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000
mean	556029.053380	82878.251420	2.003441e+07	44.922840	7.779420	4.556226	1.681192	0.781081	114.116060	12.555210	...	0.032570	0.030773	-0.024819	0.007051	-0.008488	-0.030104	0.014609	-0.003353	0.013125	-0.011936
std	106952.402565	72292.076936	7.788055e+04	50.576255	7.661667	1.908291	2.344829	0.413518	177.274154	4.034901	...	0.038779	0.049521	8.759663	5.784299	4.825261	4.100561	3.812667	3.548944	2.866774	2.316144
min	370951.000000	0.000000	1.910000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	-10.196998	-15.167961	-21.925773	-13.682825	-9.282567	-11.117367	-6.365723	-2.394516
25%	463258.500000	14121.250000	1.999061e+07	6.000000	1.000000	3.000000	0.000000	1.000000	69.000000	12.500000	...	0.000135	0.000000	-5.575131	-0.891030	-3.105073	-0.481952	-1.697763	-3.069575	-2.089326	-1.402958
50%	556296.000000	65359.000000	2.003111e+07	27.000000	6.000000	4.000000	0.000000	1.000000	105.000000	15.000000	...	0.001949	0.002593	-3.837572	0.221379	-0.081836	0.039376	-0.971210	-0.877377	-1.192502	-1.146398
75%	648862.250000	143083.750000	2.008091e+07	70.000000	11.000000	7.000000	5.000000	1.000000	150.000000	15.000000	...	0.075826	0.062063	3.531269	1.257687	2.784538	0.560046	1.572508	3.276918	2.772742	-0.010769
max	741887.000000	233028.000000	2.019040e+07	248.000000	39.000000	7.000000	6.000000	1.000000	17700.000000	15.000000	...	0.135900	0.180091	36.364986	26.043572	22.598441	16.333051	20.273633	11.691851	7.970303	8.749647

8 rows × 39 columns

可以看到匿名特征列的mean都在0附近,是处理后的特征

3.特征与标签的创建

3.1 提取特征列名

划分数值型特征和类别型特征

# 数值特征列
numerical_cols = train.select_dtypes(exclude = 'object').columns
print(numerical_cols)

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21','v_22', 'v_23'],dtype='object')

# 类别型特征列 -- 这个和二手车不同,二手车这里categorical_cols为类别型特征
categorical_cols = train.select_dtypes(include = 'object').columns
print(categorical_cols)

Index([], dtype='object')

3.2 构建训练和测试样本

# 选择特征列 -- 先将所有的列都作为特征列
feature_cols = [col for col in numerical_cols if 'price' not in col]# 提取特征列，标签列构造训练样本和测试样本
X_data = train[feature_cols]
Y_data = train['price']
X_test = test[feature_cols]
print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)

X train shape: (250000, 39)
X test shape: (50000, 39)

X_data

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_14	v_15	v_16	v_17	v_18	v_19	v_20	v_21	v_22	v_23
0	134890	734	20160002	13.0	9	NaN	0.0	1.0	0	15.0	...	0.092139	0.000000	18.763832	-1.512063	-1.008718	-12.100623	-0.947052	9.077297	0.581214	3.945923
1	306648	196973	20080307	72.0	9	7.0	5.0	1.0	173	15.0	...	0.001070	0.122335	-5.685612	-0.489963	-2.223693	-0.226865	-0.658246	-3.949621	4.593618	-1.145653
2	340675	25347	20020312	18.0	12	3.0	0.0	1.0	50	12.5	...	0.064410	0.003345	-3.295700	1.816499	3.554439	-0.683675	0.971495	2.625318	-0.851922	-1.246135
3	57332	5382	20000611	38.0	8	7.0	0.0	1.0	54	15.0	...	0.069231	0.000000	-3.405521	1.497826	4.782636	0.039101	1.227646	3.040629	-0.801854	-1.251894
4	265235	173174	20030109	87.0	0	5.0	5.0	1.0	131	3.0	...	0.000099	0.001655	-4.475429	0.124138	1.364567	-0.319848	-1.131568	-3.303424	-1.998466	-1.279368
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
249995	10556	9332	20170003	13.0	9	NaN	NaN	1.0	58	15.0	...	0.079119	0.001447	11.782508	20.402576	-2.722772	0.462388	-4.429385	7.883413	0.698405	-1.082013
249996	146710	102110	20030511	29.0	17	3.0	0.0	0.0	61	15.0	...	0.000000	0.002342	-2.988272	1.500532	3.502201	-0.761715	-2.484556	-2.532968	-0.940266	-1.106426
249997	116066	82802	20130312	124.0	16	6.0	0.0	1.0	122	3.0	...	0.003358	0.100760	-6.939560	-1.144959	-5.337949	0.896026	-0.592565	-3.872725	2.135984	3.807554
249998	90082	65971	20121212	111.0	4	7.0	5.0	0.0	184	9.0	...	0.002974	0.008251	-7.222167	-1.383696	-5.402794	-0.409451	-1.891556	-3.104789	-3.777374	3.186218
249999	76453	56954	20051111	13.0	9	3.0	0.0	1.0	58	12.5	...	0.000000	0.009071	10.491312	-11.270043	-0.272595	-0.026478	-2.168249	-0.980042	-0.955164	-1.169593

250000 rows × 39 columns

# 定义了一个统计函数，方便后续信息统计 
def Sta_inf(data):print('_min',np.min(data))print('_max:',np.max(data))print('_mean',np.mean(data))print('_ptp',np.ptp(data))print('_std',np.std(data))print('_var',np.var(data))

# 统计标签的基本分布信息
print('Sta of label:')
Sta_inf(Y_data)

Sta of label:
_min 0
_max: 100000
_mean 5599.181116
_ptp 100000
_std 7470.932963236185
_var 55814839.341169

## 绘制标签的统计图，查看标签分布 -- 主要分布在0--20000的范围
plt.hist(Y_data)
plt.show()
plt.close()

在这里插入图片描述

3.3 缺省值用-1填补

X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)

4.模型训练与预测

4.1利用xgb进行五折交叉验证查看模型的参数效果

## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'
scores_train = []
scores = []
## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0) 
for train_ind,val_ind in sk.split(X_data,Y_data):train_x=X_data.iloc[train_ind].valuestrain_y=Y_data.iloc[train_ind]val_x=X_data.iloc[val_ind].valuesval_y=Y_data.iloc[val_ind]xgr.fit(train_x,train_y)pred_train_xgb=xgr.predict(train_x)pred_xgb=xgr.predict(val_x)score_train = mean_absolute_error(train_y,pred_train_xgb)scores_train.append(score_train)score = mean_absolute_error(val_y,pred_xgb)scores.append(score)
print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))

Train mae: 355.92984509530174
Val mae 416.219085805194

4.2 定义xgb和lgb模型函数

def build_model_xgb(x_train,y_train):model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'model.fit(x_train, y_train)return model
def build_model_lgb(x_train,y_train):estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2],}gbm = GridSearchCV(estimator, param_grid)gbm.fit(x_train, y_train)return gbm

4.3 切分数据集（Train,Val）进行模型训练，评价和预测

## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)

print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)
print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)

Train lgb...
MAE of val with lgb: 405.1620812257069
Predict lgb...
Sta of Predict lgb:
_min -962.1276302396495
_max: 93490.91395949211
_mean 5607.366367495364
_ptp 94453.04158973176
_std 7416.690755216019
_var 55007301.75850676

print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)
print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)

Train xgb...
MAE of val with xgb: 412.91539463291707
Predict xgb...
Sta of Predict xgb:
_min -604.3489
_max: 93991.46
_mean 5603.4546
_ptp 94595.81
_std 7376.0728
_var 54406452.0

### 4.4 进行两模型的结果加权融合

## 这里我们采取了简单的加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数，而真实情况下，price为负是不存在的，
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))

MAE of val with Weighted ensemble: 387.83417717031836

sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb
## 查看预测值的统计进行
plt.hist(Y_data)
plt.show()
plt.close()

在这里插入图片描述

### 4.5 输出结果

sub = pd.DataFrame()
sub['SaleID'] = X_test.SaleID
sub['price'] = sub_Weighted
sub.to_csv('./sub_Weighted.csv',index=False)

sub.head()

	SaleID	price
0	720326	5812.952572
1	714316	1372.148145
2	704693	3057.827599
3	624972	599.607068
4	669753	6902.600236

5.提交结果展示

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-A7NwrfzR-1618322334632)(attachment:image.png)]

这篇关于二手车价格预测task01:赛题理解和baseline实现的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

二手车价格预测task01:赛题理解和baseline实现

赛题理解

1.赛题概况

2.预测指标

3.赛题分析

4.代码示例及分析

4.1 载入训练集和测试集,并查看数据

4.2 分类指标评价计算示例

4.3 回归指标评价计算示例

5.赛题理解总结

baseline实现

1.导入相关的包

2.数据读取 – 观察分析数据

3.特征与标签的创建

3.1 提取特征列名

3.2 构建训练和测试样本

3.3 缺省值用-1填补

4.模型训练与预测

4.1利用xgb进行五折交叉验证查看模型的参数效果

4.2 定义xgb和lgb模型函数

4.3 切分数据集（Train,Val）进行模型训练，评价和预测

5.提交结果展示

相关文章

pandas中位数填充空值的实现示例

Golang HashMap实现原理解析

Pandas使用AdaBoost进行分类的实现

使用Pandas进行均值填充的实现

Java对象转换的实现方式汇总

Go语言开发实现查询IP信息的MCP服务器

SpringBoot基于配置实现短信服务策略的动态切换

python实现svg图片转换为png和gif

Python利用ElementTree实现快速解析XML文件

Java的栈与队列实现代码解析