二手车价格预测task01:赛题理解和baseline实现

2023-11-05 15:30

本文主要是介绍二手车价格预测task01:赛题理解和baseline实现,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

task01进行了完成了赛题的理解和bsaeline的实现,通过对数据的简单分析,以及用所有的数据在没有任何处理的情况下通过LGB和XGB将得到的结果进行提交得分379.5001,目前排名64。下一步会进行数据分析和特征工程,对数据做进一步的处理来提高训练和测试的效果。第一次参加比赛,通过后续学习争取排到第一页.[滑稽]

赛题理解

1.赛题概况

比赛要求参赛选手根据给定的数据集,建立模型,二手汽车的交易价格。
赛题以预测二手车的交易价格为任务,数据集报名后可见并可下载,该数据来自某交易平台的二手车交易记录,
总数据量超过40w,包含31列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取15万条作
为训练集,5万条作为测试集A,5万条作为测试集B,同时会对name、model、brand和regionCode等信息进行脱
敏。
通过这道赛题来引导大家走进 AI 数据竞赛的世界,主要针对于于竞赛新人进行自我练 习、自我提高。

2.预测指标

在这里插入图片描述

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vOVw1AOw-1618322334609)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-D2223Jfy-1618322334615)(attachment:image.png)]

3.赛题分析

  1. 此题为传统的数据挖掘问题,通过数据科学以及机器学习深度学习的办法来进行建模得到结果。
  2. 此题是一个典型的回归问题。
  3. 主要应用xgb、lgb、catboost,以及pandas、numpy、matplotlib、seabon、sklearn、keras等等数据挖掘常
    用库或者框架来进行数据挖掘任务。
  4. 通过EDA来挖掘数据的联系和自我熟悉数据。

4.代码示例及分析

4.1 载入训练集和测试集,并查看数据

path = './'
train = pd.read_csv(path+'car_train_0110.csv', sep=' ')
test = pd.read_csv(path+'car_testA_0110.csv', sep=' ')print('Train data shape:',train.shape)
print('TestA data shape:',test.shape)
Train data shape: (250000, 40)
TestA data shape: (50000, 39)
# 通过 .head() 简要浏览读取数据的形式
train.head() 
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
01348907342016000213.09NaN0.01.0015.0...0.0921390.00000018.763832-1.512063-1.008718-12.100623-0.9470529.0772970.5812143.945923
13066481969732008030772.097.05.01.017315.0...0.0010700.122335-5.685612-0.489963-2.223693-0.226865-0.658246-3.9496214.593618-1.145653
2340675253472002031218.0123.00.01.05012.5...0.0644100.003345-3.2957001.8164993.554439-0.6836750.9714952.625318-0.851922-1.246135
35733253822000061138.087.00.01.05415.0...0.0692310.000000-3.4055211.4978264.7826360.0391011.2276463.040629-0.801854-1.251894
42652351731742003010987.005.05.01.01313.0...0.0000990.001655-4.4754290.1241381.364567-0.319848-1.131568-3.303424-1.998466-1.279368

5 rows × 40 columns

# 通过 .info() 简要可以看到对应一些数据列名,以及NAN缺失信息
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):#   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  0   SaleID             250000 non-null  int64  1   name               250000 non-null  int64  2   regDate            250000 non-null  int64  3   model              250000 non-null  float644   brand              250000 non-null  int64  5   bodyType           224620 non-null  float646   fuelType           227510 non-null  float647   gearbox            236487 non-null  float648   power              250000 non-null  int64  9   kilometer          250000 non-null  float6410  notRepairedDamage  201464 non-null  float6411  regionCode         250000 non-null  int64  12  seller             250000 non-null  int64  13  offerType          250000 non-null  int64  14  creatDate          250000 non-null  int64  15  price              250000 non-null  int64  16  v_0                250000 non-null  float6417  v_1                250000 non-null  float6418  v_2                250000 non-null  float6419  v_3                250000 non-null  float6420  v_4                250000 non-null  float6421  v_5                250000 non-null  float6422  v_6                250000 non-null  float6423  v_7                250000 non-null  float6424  v_8                250000 non-null  float6425  v_9                250000 non-null  float6426  v_10               250000 non-null  float6427  v_11               250000 non-null  float6428  v_12               250000 non-null  float6429  v_13               250000 non-null  float6430  v_14               250000 non-null  float6431  v_15               250000 non-null  float6432  v_16               250000 non-null  float6433  v_17               250000 non-null  float6434  v_18               250000 non-null  float6435  v_19               250000 non-null  float6436  v_20               250000 non-null  float6437  v_21               250000 non-null  float6438  v_22               250000 non-null  float6439  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
# 查看列
train.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21','v_22', 'v_23'],dtype='object')
  • 可以看出数据集中每个数据有40个特征,其中有 v_0 – v_23 这23个数值型匿名特征,其他特征介绍如下:
  • SaleID - 销售样本ID
  • name - 汽车编码
  • regDate - 汽车注册时间
  • model - 车型编码
  • brand - 品牌
  • bodyType - 车身类型
  • fuelType - 燃油类型
  • gearbox - 变速箱
  • power - 汽车功率
  • kilometer - 汽车行驶公里
  • notRepairedDamage - 汽车有尚未修复的损坏
  • regionCode - 看车地区编码|
  • seller - 销售方
  • offerType - 报价类型
  • creatDate - 广告发布时间
  • price - 汽车价格
  • 所有的特征列皆为数值型特征,其中’notRepairedDamage’,‘bodyType’, ‘fuelType’,'gearbox’四列含有null ,其他所有特征均为数值型特征,且没有空值
# 通过 .describe() 可以查看数值特征列的一些统计信息
train.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
count250000.000000250000.0000002.500000e+05250000.000000250000.000000224620.000000227510.000000236487.000000250000.000000250000.000000...250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000
mean185351.79076883153.3621722.003401e+0744.9114807.7852364.5632711.6650080.780783115.52841212.577418...0.0324890.0304080.0147250.0009150.0062730.006604-0.0013740.000609-0.0040250.001834
std107121.18876372540.7999647.770250e+0450.6400817.6940101.9125152.3396460.413717196.1418283.990632...0.0387920.0493338.7791635.7710814.8809814.1247223.8036263.5553532.8647132.323680
min1.0000000.0000001.910000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.000000-10.412444-15.538236-21.009214-13.989955-9.599285-11.181255-7.671327-2.350888
25%92501.75000014500.0000001.999061e+076.0000001.0000003.0000000.0000001.00000070.00000012.500000...0.0001290.000000-5.552269-0.901181-3.150385-0.478173-1.727237-3.067073-2.092178-1.402804
50%185264.50000065314.5000002.003111e+0727.0000006.0000004.0000000.0000001.000000105.00000015.000000...0.0019610.002567-3.8217700.223181-0.0585020.038427-0.995044-0.880587-1.199807-1.145588
75%278128.500000143761.2500002.008081e+0770.00000011.0000007.0000005.0000001.000000150.00000015.000000...0.0756720.0565683.5997471.2637372.8004750.5691981.5633823.2699872.7376140.044865
max370946.000000233044.0000002.019121e+07250.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.1307850.18434036.75687826.13456123.05566016.57602720.32457214.0394228.7645978.574730

8 rows × 40 columns

4.2 分类指标评价计算示例

# accuracy
y_pred = [0,1,0,1]
y_true = [0,1,1,1]
print('accuracy:',accuracy_score(y_true=y_true,y_pred=y_pred))
accuracy: 0.75
## Precision,Recall,F1-score
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
print('Precision',metrics.precision_score(y_true, y_pred))
print('Recall',metrics.recall_score(y_true, y_pred))
print('F1-score:',metrics.f1_score(y_true, y_pred))
Precision 1.0
Recall 0.5
F1-score: 0.6666666666666666
## AUC
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
print('AUC socre:',metrics.roc_auc_score(y_true, y_scores))
AUC socre: 0.75

4.3 回归指标评价计算示例

# MAPE需要自己实现
def mape(y_true,y_pred):return np.mean(np.abs(y_pred - y_true) / y_true)
y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])# MSE
print('MSE:',metrics.mean_squared_error(y_true, y_pred))
# RMSE
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE
print('MAE:',metrics.mean_absolute_error(y_true, y_pred))
# MAPE
print('MAPE:',mape(y_true, y_pred))
MSE: 0.2871428571428571
RMSE: 0.5358571238146014
MAE: 0.4142857142857143
MAPE: 0.07000000000000003
# R2-score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print('R2-score:',metrics.r2_score(y_true, y_pred))
R2-score: 0.9486081370449679

5.赛题理解总结

作为切入一道赛题的基础,赛题理解是极其重要的,对于赛题的理解甚至会影响后续的特征工程构建以及模型的
选择,最主要是会影响后续发展工作的方向,比如挖掘特征的方向或者存在问题解决问题的方向,对了赛题背后
的思想以及赛题业务逻辑的清晰,也很有利于花费更少时间构建更为有效的特征模型,赛题理解要达到的地步是
什么呢,把一道赛题转化为一种宏观理解的解决思路。 以下将从多方面对于此进行说明:

  • 1) 赛题理解究竟是理解什么: 理解赛题是不是把一道赛题的背景介绍读一遍就OK了呢?并不是的,理解
    赛题其实也是从直观上梳理问题,分析问题是否可行的方法,有多少可行度,赛题做的价值大不大,理清一
    道赛题要从背后的赛题背景引发的赛题任务理解其中的任务逻辑,可能对于赛题有意义的外在数据有哪些,
    并对于赛题数据有一个初步了解,知道现在和任务的相关数据有哪些,其中数据之间的关联逻辑是什么样
    的。 对于不同的问题,在处理方式上的差异是很大的。如果用简短的话来说,并且在比赛的角度或者做工程
    的角度,就是该赛题符合的问题是什么问题,大概要去用哪些指标,哪些指标是否会做到线上线下的一致
    性,是否有效的利于我们进一步的探索更高线上分数的线下验证方法,在业务上,你是否对很多原始特征有
    很深刻的了解,并且可以通过EDA来寻求他们直接的关系,最后构造出满意的特征。
  • 2) 有了赛题理解后能做什么: 在对于赛题有了一定的了解后,分析清楚了问题的类型性质和对于数据理解
    的这一基础上,是不是赛题理解就做完了呢? 并不是的,就像摸清了敌情后,我们至少就要有一些相应的理
    解分析,比如这题的难点可能在哪里,关键点可能在哪里,哪些地方可以挖掘更好的特征,用什么样得线下
    验证方式更为稳定,出现了过拟合或者其他问题,估摸可以用什么方法去解决这些问题,哪些数据是可靠
    的,哪些数据是需要精密的处理的,哪部分数据应该是关键数据(背景的业务逻辑下,比如CTR的题,一个
    寻常顾客大体会有怎么样的购买行为逻辑规律,或者风电那种题,如果机组比较邻近,相关一些风速,转速
    特征是否会很近似)。这时是在一个宏观的大体下分析的,有助于摸清整个题的思路脉络,以及后续的分析
    方向。
  • 3) 赛题理解的-评价指标: 为什么要把这部分单独拿出来呢,因为这部分会涉及后续模型预测中两个很重要
    的问题: 1. 本地模型的验证方式,很多情况下,线上验证是有一定的时间和次数限制的,所以在比赛中构
    建一个合理的本地的验证集和验证的评价指标是很关键的步骤,能有效的节省很多时间。 2. 不同的指标对
    于同样的预测结果是具有误差敏感的差异性的,比如AUC,logloss, MAE,RSME,或者一些特定的评价函
    数。是会有很大可能会影响后续一些预测的侧重点。
  • 4) 赛题背景中可能潜在隐藏的条件: 其实赛题中有些说明是很有利益-都可以在后续答辩中以及问题思考中
    所体现出来的,比如高效性要求,比如对于数据异常的识别处理,比如工序流程的差异性,比如模型运行的
    时间,比模型的鲁棒性,有些的意识是可以贯穿问题思考,特征,模型以及后续处理的,也有些会对于特征
    构建或者选择模型上有很大益处,反过来如果在模型预测效果不好,其实有时也要反过来思考,是不是赛题
    背景有没有哪方面理解不清晰或者什么其中的问题没考虑到。

baseline实现

1.导入相关的包

## 基础工具 
import numpy as np 
import pandas as pd 
import warnings 
import matplotlib 
import matplotlib.pyplot as plt 
import seaborn as sns 
from scipy.special import jn 
from IPython.display import display, clear_output
import time
warnings.filterwarnings('ignore') 
%matplotlib inline## 模型预测的 
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor## 数据降维处理的 
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA
import lightgbm as lgb
import xgboost as xgb## 参数搜索和评价的 
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
# 安装相关的工具包
#!pip --default-timeout=100 install xgboost -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

2.数据读取 – 观察分析数据

path = './'
train = pd.read_csv(path+'car_train_0110.csv', sep=' ')
test = pd.read_csv(path+'car_testA_0110.csv', sep=' ')print('Train data shape:',train.shape)
print('TestA data shape:',test.shape)
Train data shape: (250000, 40)
TestA data shape: (50000, 39)
train.head()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
01348907342016000213.09NaN0.01.0015.0...0.0921390.00000018.763832-1.512063-1.008718-12.100623-0.9470529.0772970.5812143.945923
13066481969732008030772.097.05.01.017315.0...0.0010700.122335-5.685612-0.489963-2.223693-0.226865-0.658246-3.9496214.593618-1.145653
2340675253472002031218.0123.00.01.05012.5...0.0644100.003345-3.2957001.8164993.554439-0.6836750.9714952.625318-0.851922-1.246135
35733253822000061138.087.00.01.05415.0...0.0692310.000000-3.4055211.4978264.7826360.0391011.2276463.040629-0.801854-1.251894
42652351731742003010987.005.05.01.01313.0...0.0000990.001655-4.4754290.1241381.364567-0.319848-1.131568-3.303424-1.998466-1.279368

5 rows × 40 columns

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):#   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  0   SaleID             250000 non-null  int64  1   name               250000 non-null  int64  2   regDate            250000 non-null  int64  3   model              250000 non-null  float644   brand              250000 non-null  int64  5   bodyType           224620 non-null  float646   fuelType           227510 non-null  float647   gearbox            236487 non-null  float648   power              250000 non-null  int64  9   kilometer          250000 non-null  float6410  notRepairedDamage  201464 non-null  float6411  regionCode         250000 non-null  int64  12  seller             250000 non-null  int64  13  offerType          250000 non-null  int64  14  creatDate          250000 non-null  int64  15  price              250000 non-null  int64  16  v_0                250000 non-null  float6417  v_1                250000 non-null  float6418  v_2                250000 non-null  float6419  v_3                250000 non-null  float6420  v_4                250000 non-null  float6421  v_5                250000 non-null  float6422  v_6                250000 non-null  float6423  v_7                250000 non-null  float6424  v_8                250000 non-null  float6425  v_9                250000 non-null  float6426  v_10               250000 non-null  float6427  v_11               250000 non-null  float6428  v_12               250000 non-null  float6429  v_13               250000 non-null  float6430  v_14               250000 non-null  float6431  v_15               250000 non-null  float6432  v_16               250000 non-null  float6433  v_17               250000 non-null  float6434  v_18               250000 non-null  float6435  v_19               250000 non-null  float6436  v_20               250000 non-null  float6437  v_21               250000 non-null  float6438  v_22               250000 non-null  float6439  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
test.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
count50000.00000050000.0000005.000000e+0450000.00000050000.00000044890.00000045598.00000047287.00000050000.00000050000.000000...50000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
mean556029.05338082878.2514202.003441e+0744.9228407.7794204.5562261.6811920.781081114.11606012.555210...0.0325700.030773-0.0248190.007051-0.008488-0.0301040.014609-0.0033530.013125-0.011936
std106952.40256572292.0769367.788055e+0450.5762557.6616671.9082912.3448290.413518177.2741544.034901...0.0387790.0495218.7596635.7842994.8252614.1005613.8126673.5489442.8667742.316144
min370951.0000000.0000001.910000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.000000-10.196998-15.167961-21.925773-13.682825-9.282567-11.117367-6.365723-2.394516
25%463258.50000014121.2500001.999061e+076.0000001.0000003.0000000.0000001.00000069.00000012.500000...0.0001350.000000-5.575131-0.891030-3.105073-0.481952-1.697763-3.069575-2.089326-1.402958
50%556296.00000065359.0000002.003111e+0727.0000006.0000004.0000000.0000001.000000105.00000015.000000...0.0019490.002593-3.8375720.221379-0.0818360.039376-0.971210-0.877377-1.192502-1.146398
75%648862.250000143083.7500002.008091e+0770.00000011.0000007.0000005.0000001.000000150.00000015.000000...0.0758260.0620633.5312691.2576872.7845380.5600461.5725083.2769182.772742-0.010769
max741887.000000233028.0000002.019040e+07248.00000039.0000007.0000006.0000001.00000017700.00000015.000000...0.1359000.18009136.36498626.04357222.59844116.33305120.27363311.6918517.9703038.749647

8 rows × 39 columns

test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):#   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  0   SaleID             50000 non-null  int64  1   name               50000 non-null  int64  2   regDate            50000 non-null  int64  3   model              50000 non-null  float644   brand              50000 non-null  int64  5   bodyType           44890 non-null  float646   fuelType           45598 non-null  float647   gearbox            47287 non-null  float648   power              50000 non-null  int64  9   kilometer          50000 non-null  float6410  notRepairedDamage  40372 non-null  float6411  regionCode         50000 non-null  int64  12  seller             50000 non-null  int64  13  offerType          50000 non-null  int64  14  creatDate          50000 non-null  int64  15  v_0                50000 non-null  float6416  v_1                50000 non-null  float6417  v_2                50000 non-null  float6418  v_3                50000 non-null  float6419  v_4                50000 non-null  float6420  v_5                50000 non-null  float6421  v_6                50000 non-null  float6422  v_7                50000 non-null  float6423  v_8                50000 non-null  float6424  v_9                50000 non-null  float6425  v_10               50000 non-null  float6426  v_11               50000 non-null  float6427  v_12               50000 non-null  float6428  v_13               50000 non-null  float6429  v_14               50000 non-null  float6430  v_15               50000 non-null  float6431  v_16               50000 non-null  float6432  v_17               50000 non-null  float6433  v_18               50000 non-null  float6434  v_19               50000 non-null  float6435  v_20               50000 non-null  float6436  v_21               50000 non-null  float6437  v_22               50000 non-null  float6438  v_23               50000 non-null  float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB
  • 可以发现 - 测试集中含有null值的特征列和训练集中一样
train.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
count250000.000000250000.0000002.500000e+05250000.000000250000.000000224620.000000227510.000000236487.000000250000.000000250000.000000...250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000
mean185351.79076883153.3621722.003401e+0744.9114807.7852364.5632711.6650080.780783115.52841212.577418...0.0324890.0304080.0147250.0009150.0062730.006604-0.0013740.000609-0.0040250.001834
std107121.18876372540.7999647.770250e+0450.6400817.6940101.9125152.3396460.413717196.1418283.990632...0.0387920.0493338.7791635.7710814.8809814.1247223.8036263.5553532.8647132.323680
min1.0000000.0000001.910000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.000000-10.412444-15.538236-21.009214-13.989955-9.599285-11.181255-7.671327-2.350888
25%92501.75000014500.0000001.999061e+076.0000001.0000003.0000000.0000001.00000070.00000012.500000...0.0001290.000000-5.552269-0.901181-3.150385-0.478173-1.727237-3.067073-2.092178-1.402804
50%185264.50000065314.5000002.003111e+0727.0000006.0000004.0000000.0000001.000000105.00000015.000000...0.0019610.002567-3.8217700.223181-0.0585020.038427-0.995044-0.880587-1.199807-1.145588
75%278128.500000143761.2500002.008081e+0770.00000011.0000007.0000005.0000001.000000150.00000015.000000...0.0756720.0565683.5997471.2637372.8004750.5691981.5633823.2699872.7376140.044865
max370946.000000233044.0000002.019121e+07250.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.1307850.18434036.75687826.13456123.05566016.57602720.32457214.0394228.7645978.574730

8 rows × 40 columns

test.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
count50000.00000050000.0000005.000000e+0450000.00000050000.00000044890.00000045598.00000047287.00000050000.00000050000.000000...50000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
mean556029.05338082878.2514202.003441e+0744.9228407.7794204.5562261.6811920.781081114.11606012.555210...0.0325700.030773-0.0248190.007051-0.008488-0.0301040.014609-0.0033530.013125-0.011936
std106952.40256572292.0769367.788055e+0450.5762557.6616671.9082912.3448290.413518177.2741544.034901...0.0387790.0495218.7596635.7842994.8252614.1005613.8126673.5489442.8667742.316144
min370951.0000000.0000001.910000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.000000-10.196998-15.167961-21.925773-13.682825-9.282567-11.117367-6.365723-2.394516
25%463258.50000014121.2500001.999061e+076.0000001.0000003.0000000.0000001.00000069.00000012.500000...0.0001350.000000-5.575131-0.891030-3.105073-0.481952-1.697763-3.069575-2.089326-1.402958
50%556296.00000065359.0000002.003111e+0727.0000006.0000004.0000000.0000001.000000105.00000015.000000...0.0019490.002593-3.8375720.221379-0.0818360.039376-0.971210-0.877377-1.192502-1.146398
75%648862.250000143083.7500002.008091e+0770.00000011.0000007.0000005.0000001.000000150.00000015.000000...0.0758260.0620633.5312691.2576872.7845380.5600461.5725083.2769182.772742-0.010769
max741887.000000233028.0000002.019040e+07248.00000039.0000007.0000006.0000001.00000017700.00000015.000000...0.1359000.18009136.36498626.04357222.59844116.33305120.27363311.6918517.9703038.749647

8 rows × 39 columns

  • 可以看到匿名特征列的mean都在0附近,是处理后的特征

3.特征与标签的创建

3.1 提取特征列名

  • 划分数值型特征和类别型特征
# 数值特征列
numerical_cols = train.select_dtypes(exclude = 'object').columns
print(numerical_cols)
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21','v_22', 'v_23'],dtype='object')
# 类别型特征列 -- 这个和二手车不同,二手车这里categorical_cols为类别型特征
categorical_cols = train.select_dtypes(include = 'object').columns
print(categorical_cols)
Index([], dtype='object')

3.2 构建训练和测试样本

# 选择特征列 -- 先将所有的列都作为特征列
feature_cols = [col for col in numerical_cols if 'price' not in col]# 提取特征列,标签列构造训练样本和测试样本
X_data = train[feature_cols]
Y_data = train['price']
X_test = test[feature_cols]
print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
X train shape: (250000, 39)
X test shape: (50000, 39)
X_data
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
01348907342016000213.09NaN0.01.0015.0...0.0921390.00000018.763832-1.512063-1.008718-12.100623-0.9470529.0772970.5812143.945923
13066481969732008030772.097.05.01.017315.0...0.0010700.122335-5.685612-0.489963-2.223693-0.226865-0.658246-3.9496214.593618-1.145653
2340675253472002031218.0123.00.01.05012.5...0.0644100.003345-3.2957001.8164993.554439-0.6836750.9714952.625318-0.851922-1.246135
35733253822000061138.087.00.01.05415.0...0.0692310.000000-3.4055211.4978264.7826360.0391011.2276463.040629-0.801854-1.251894
42652351731742003010987.005.05.01.01313.0...0.0000990.001655-4.4754290.1241381.364567-0.319848-1.131568-3.303424-1.998466-1.279368
..................................................................
2499951055693322017000313.09NaNNaN1.05815.0...0.0791190.00144711.78250820.402576-2.7227720.462388-4.4293857.8834130.698405-1.082013
2499961467101021102003051129.0173.00.00.06115.0...0.0000000.002342-2.9882721.5005323.502201-0.761715-2.484556-2.532968-0.940266-1.106426
2499971160668280220130312124.0166.00.01.01223.0...0.0033580.100760-6.939560-1.144959-5.3379490.896026-0.592565-3.8727252.1359843.807554
249998900826597120121212111.047.05.00.01849.0...0.0029740.008251-7.222167-1.383696-5.402794-0.409451-1.891556-3.104789-3.7773743.186218
24999976453569542005111113.093.00.01.05812.5...0.0000000.00907110.491312-11.270043-0.272595-0.026478-2.168249-0.980042-0.955164-1.169593

250000 rows × 39 columns

# 定义了一个统计函数,方便后续信息统计 
def Sta_inf(data):print('_min',np.min(data))print('_max:',np.max(data))print('_mean',np.mean(data))print('_ptp',np.ptp(data))print('_std',np.std(data))print('_var',np.var(data))
# 统计标签的基本分布信息
print('Sta of label:')
Sta_inf(Y_data)
Sta of label:
_min 0
_max: 100000
_mean 5599.181116
_ptp 100000
_std 7470.932963236185
_var 55814839.341169
## 绘制标签的统计图,查看标签分布 -- 主要分布在0--20000的范围
plt.hist(Y_data)
plt.show()
plt.close()


在这里插入图片描述

3.3 缺省值用-1填补

X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)

4.模型训练与预测

4.1利用xgb进行五折交叉验证查看模型的参数效果

## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'
scores_train = []
scores = []
## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0) 
for train_ind,val_ind in sk.split(X_data,Y_data):train_x=X_data.iloc[train_ind].valuestrain_y=Y_data.iloc[train_ind]val_x=X_data.iloc[val_ind].valuesval_y=Y_data.iloc[val_ind]xgr.fit(train_x,train_y)pred_train_xgb=xgr.predict(train_x)pred_xgb=xgr.predict(val_x)score_train = mean_absolute_error(train_y,pred_train_xgb)scores_train.append(score_train)score = mean_absolute_error(val_y,pred_xgb)scores.append(score)
print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
Train mae: 355.92984509530174
Val mae 416.219085805194

4.2 定义xgb和lgb模型函数

def build_model_xgb(x_train,y_train):model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'model.fit(x_train, y_train)return model
def build_model_lgb(x_train,y_train):estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2],}gbm = GridSearchCV(estimator, param_grid)gbm.fit(x_train, y_train)return gbm

4.3 切分数据集(Train,Val)进行模型训练,评价和预测

## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)
print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)
print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)
Train lgb...
MAE of val with lgb: 405.1620812257069
Predict lgb...
Sta of Predict lgb:
_min -962.1276302396495
_max: 93490.91395949211
_mean 5607.366367495364
_ptp 94453.04158973176
_std 7416.690755216019
_var 55007301.75850676
print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)
print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)
Train xgb...
MAE of val with xgb: 412.91539463291707
Predict xgb...
Sta of Predict xgb:
_min -604.3489
_max: 93991.46
_mean 5603.4546
_ptp 94595.81
_std 7376.0728
_var 54406452.0
### 4.4 进行两模型的结果加权融合
## 这里我们采取了简单的加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数,而真实情况下,price为负是不存在的,
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))
MAE of val with Weighted ensemble: 387.83417717031836
sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb
## 查看预测值的统计进行
plt.hist(Y_data)
plt.show()
plt.close()


在这里插入图片描述

### 4.5 输出结果
sub = pd.DataFrame()
sub['SaleID'] = X_test.SaleID
sub['price'] = sub_Weighted
sub.to_csv('./sub_Weighted.csv',index=False)
sub.head()
SaleIDprice
07203265812.952572
17143161372.148145
27046933057.827599
3624972599.607068
46697536902.600236

5.提交结果展示

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-A7NwrfzR-1618322334632)(attachment:image.png)]


这篇关于二手车价格预测task01:赛题理解和baseline实现的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/350777

相关文章

C#提取PDF表单数据的实现流程

《C#提取PDF表单数据的实现流程》PDF表单是一种常见的数据收集工具,广泛应用于调查问卷、业务合同等场景,凭借出色的跨平台兼容性和标准化特点,PDF表单在各行各业中得到了广泛应用,本文将探讨如何使用... 目录引言使用工具C# 提取多个PDF表单域的数据C# 提取特定PDF表单域的数据引言PDF表单是一

使用Python实现高效的端口扫描器

《使用Python实现高效的端口扫描器》在网络安全领域,端口扫描是一项基本而重要的技能,通过端口扫描,可以发现目标主机上开放的服务和端口,这对于安全评估、渗透测试等有着不可忽视的作用,本文将介绍如何使... 目录1. 端口扫描的基本原理2. 使用python实现端口扫描2.1 安装必要的库2.2 编写端口扫

PyCharm接入DeepSeek实现AI编程的操作流程

《PyCharm接入DeepSeek实现AI编程的操作流程》DeepSeek是一家专注于人工智能技术研发的公司,致力于开发高性能、低成本的AI模型,接下来,我们把DeepSeek接入到PyCharm中... 目录引言效果演示创建API key在PyCharm中下载Continue插件配置Continue引言

MySQL分表自动化创建的实现方案

《MySQL分表自动化创建的实现方案》在数据库应用场景中,随着数据量的不断增长,单表存储数据可能会面临性能瓶颈,例如查询、插入、更新等操作的效率会逐渐降低,分表是一种有效的优化策略,它将数据分散存储在... 目录一、项目目的二、实现过程(一)mysql 事件调度器结合存储过程方式1. 开启事件调度器2. 创

使用Python实现操作mongodb详解

《使用Python实现操作mongodb详解》这篇文章主要为大家详细介绍了使用Python实现操作mongodb的相关知识,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 目录一、示例二、常用指令三、遇到的问题一、示例from pymongo import MongoClientf

SQL Server使用SELECT INTO实现表备份的代码示例

《SQLServer使用SELECTINTO实现表备份的代码示例》在数据库管理过程中,有时我们需要对表进行备份,以防数据丢失或修改错误,在SQLServer中,可以使用SELECTINT... 在数据库管理过程中,有时我们需要对表进行备份,以防数据丢失或修改错误。在 SQL Server 中,可以使用 SE

基于Go语言实现一个压测工具

《基于Go语言实现一个压测工具》这篇文章主要为大家详细介绍了基于Go语言实现一个简单的压测工具,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 目录整体架构通用数据处理模块Http请求响应数据处理Curl参数解析处理客户端模块Http客户端处理Grpc客户端处理Websocket客户端

Java CompletableFuture如何实现超时功能

《JavaCompletableFuture如何实现超时功能》:本文主要介绍实现超时功能的基本思路以及CompletableFuture(之后简称CF)是如何通过代码实现超时功能的,需要的... 目录基本思路CompletableFuture 的实现1. 基本实现流程2. 静态条件分析3. 内存泄露 bug

C#实现添加/替换/提取或删除Excel中的图片

《C#实现添加/替换/提取或删除Excel中的图片》在Excel中插入与数据相关的图片,能将关键数据或信息以更直观的方式呈现出来,使文档更加美观,下面我们来看看如何在C#中实现添加/替换/提取或删除E... 在Excandroidel中插入与数据相关的图片,能将关键数据或信息以更直观的方式呈现出来,使文档更

C#实现系统信息监控与获取功能

《C#实现系统信息监控与获取功能》在C#开发的众多应用场景中,获取系统信息以及监控用户操作有着广泛的用途,比如在系统性能优化工具中,需要实时读取CPU、GPU资源信息,本文将详细介绍如何使用C#来实现... 目录前言一、C# 监控键盘1. 原理与实现思路2. 代码实现二、读取 CPU、GPU 资源信息1.