本文主要是介绍ML之R:通过数据预处理(缺失值/异常值/特殊值的处理/长尾转正态分布/目标log变换/柱形图-箱形图-小提琴图可视化/构造特征/特征筛选)利用算法实现二手汽车产品交易价格回归预测之详细攻略,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
ML之R:通过数据预处理(缺失值/异常值/特殊值的处理/长尾转正态分布/目标log变换/柱形图-箱形图-小提琴图可视化/构造特征/特征筛选)利用算法实现二手汽车产品交易价格回归预测之详细攻略
目录
二手汽车产品交易价格预测
赛题背景
字段说明
通过数据预处理利用LightGBM算法实现二手汽车产品交易价格回归预测
# 一、定义数据集
# 1.1、载入训练集和测试集
# 1.2、简略观察数据
# 1.3、分离特征与标签
# 1.4、合并训练集、测试集(标记数据来源):以便同步各种操作(特征处理、构造特征等)
# 1.5、划分特征类型
# B1.7、纠正字段数据类型
# B1.8、纠正后重新统计
# T1.1、统计每个【类别型】特征的子分类
# T1.2、统计每个【类别型】特征的多样性
# 二、特征工程/数据集预处理
# 2.1、缺失值分析与处理
# 2.1.1、缺失值统计分析
# T1、所有特征样本个数(非空数值)柱状图可视化
# T2、仅缺失值的特征空值占比柱状图可视化
# 2.1.2、缺失值填充处理
# T1、两大类型数据缺失值填充
# 2.2、异常值分析与处理
# 2.2.2、异常值的处理
# T2、基于3-Sigma标准差的删除异常样本点+箱线图对比可视化
# T3、对异常值执行截断处理:只针对异常值,截断阈值要具体看分布
# 2.3、特殊值的分析与处理
# T1、将某字段的特殊字符替换填充
# 2.4、特殊字段的分析与处理
# 2.4.1、寻找严重失衡/倾斜分布的字段
# 2.5、变量分布的分析与处理
# 2.5.1、统计并可视化所有变量的偏态skew、峰态kurt
# 2.5.2、【数字型】特征的长尾分布转为正态分布
# 2.6、目标变量的分析与处理
# 2.6.1、查看目标变量的分布
# 2.6.2、计算目标变量的skew、kurt
# 2.6.3、目标变量分布log变换
# 2.7、【类别型】特征分析
# 2.7.1、各个特征的丰富度统计及其可视化
# 2.7.2、各个特征的与目标变量的柱形图/箱形图/小提琴图可视化
# 2.8、【数字型】特征分析与处理
# 2.8.1、【数字型】特征分布性可视化
# 2.8.2、【数字型】特征相关性分析
# T1、【数字型】特征间的PCC热图可视化
# T3、【数字型】特征间的散点图可视化
# 2.9、构造特征
# 2.10、数据规范化
# 2.11、定义入模特征
# 2.11.1、删除特征
# 2.11.2、特征筛选
# T2、包裹式wrapper
# T3、嵌入式Embedded(最常用)
# 2.12、导出入模数据集
三、模型训练与验证
ML之R:通过数据预处理利用LiR/XGBoost等(特征重要性/交叉训练曲线可视化/线性和非线性算法对比/三种模型调参/三种模型融合)实现二手汽车产品交易价格回归预测之详细攻略
相关文章
ML之R:通过数据预处理(缺失值/异常值/特殊值的处理/长尾转正态分布/目标log变换/柱形图-箱形图-小提琴图可视化/构造特征/特征筛选)利用算法实现二手汽车产品交易价格回归预测之详细攻略
ML之R:通过数据预处理利用LiR/XGBoost等(特征重要性/交叉训练曲线可视化/线性和非线性算法对比/三种模型调参/三种模型融合)实现二手汽车产品交易价格回归预测之详细攻略
ML之R:通过数据预处理(缺失值/异常值/特殊值的处理/长尾转正态分布/目标log变换/柱形图-箱形图-小提琴图可视化/构造特征/特征筛选)利用算法实现二手汽车产品交易价格回归预测代码实现
二手汽车产品交易价格预测
官网地址:零基础入门数据挖掘 - 二手车交易价格预测_学习赛_赛题与数据_天池大赛-阿里云天池
赛题背景
赛题以二手车市场为背景,要求选手预测二手汽车的交易价格。
字段说明
该数据来自某交易平台的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时会对name、model、brand和regionCode等信息进行脱敏。
Field | Description |
SaleID | 交易ID,唯一编码 |
name | 汽车交易名称,已脱敏 汽车编码 |
regDate | 汽车注册日期,例如20160101,2016年01月01日 |
model | 车型编码,已脱敏 |
brand | 汽车品牌,已脱敏 |
bodyType | 车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7 |
fuelType | 燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6 |
gearbox | 变速箱:手动:0,自动:1 |
power | 发动机功率:范围 [ 0, 600 ] |
kilometer | 汽车已行驶公里,单位万km |
notRepairedDamage | 汽车有尚未修复的损坏:是:0,否:1 |
regionCode | 地区编码,已脱敏 |
seller | 销售方:个体:0,非个体:1 |
offerType | 报价类型:提供:0,请求:1 |
creatDate | 汽车上线时间,即开始售卖时间 |
price | 二手车交易价格(预测目标) |
v系列特征 | 匿名特征,包含v0-14在内15个匿名特征 |
通过数据预处理利用LightGBM算法实现二手汽车产品交易价格回归预测
# 一、定义数据集
# 1.1、载入训练集和测试集
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 |
0 | 736 | 20040402 | 30 | 6 | 1 | 0 | 0 | 60 | 12.5 | 0 | 1046 | 0 | 0 | 20160404 | 1850 | 43.35779631 | 3.966344166 | 0.050257094 | 2.159744094 | 1.143786187 | 0.235675907 | 0.101988241 | 0.129548661 | 0.022816367 | 0.097461829 | -2.881803239 | 2.804096771 | -2.420820793 | 0.795291943 | 0.9147625 |
1 | 2262 | 20030301 | 40 | 1 | 2 | 0 | 0 | 0 | 15 | - | 4366 | 0 | 0 | 20160309 | 3600 | 45.30527302 | 5.236111898 | 0.137925324 | 1.38065746 | -1.422164921 | 0.264777256 | 0.121003594 | 0.135730707 | 0.026597448 | 0.020581663 | -4.900481882 | 2.096337644 | -1.030482837 | -1.722673775 | 0.245522411 |
2 | 14874 | 20040403 | 115 | 15 | 1 | 0 | 0 | 163 | 12.5 | 0 | 2806 | 0 | 0 | 20160402 | 6222 | 45.97835906 | 4.823792215 | 1.319524152 | -0.998467274 | -0.996911035 | 0.251410148 | 0.114912277 | 0.165147493 | 0.062172837 | 0.027074824 | -4.84674926 | 1.803558941 | 1.565329625 | -0.832687327 | -0.229962856 |
3 | 71865 | 19960908 | 109 | 10 | 0 | 0 | 1 | 193 | 15 | 0 | 434 | 0 | 0 | 20160312 | 2400 | 45.6874782 | 4.492574134 | -0.050615843 | 0.883599671 | -2.228078725 | 0.274293171 | 0.110300085 | 0.121963746 | 0.033394547 | 0 | -4.509598824 | 1.285939744 | -0.501867908 | -2.438352737 | -0.478699379 |
4 | 111080 | 20120103 | 110 | 5 | 1 | 0 | 0 | 68 | 5 | 0 | 6977 | 0 | 0 | 20160313 | 5200 | 44.38351084 | 2.031433258 | 0.572168948 | -1.571239028 | 2.246088325 | 0.228035622 | 0.073205054 | 0.091880479 | 0.078819385 | 0.121534241 | -1.896240279 | 0.910783134 | 0.931109559 | 2.83451782 | 1.923481963 |
# 1.2、简略观察数据
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 150000 non-null int64 1 name 150000 non-null int64 2 regDate 150000 non-null int64 3 model 149999 non-null float644 brand 150000 non-null int64 5 bodyType 145494 non-null float646 fuelType 141320 non-null float647 gearbox 144019 non-null float648 power 150000 non-null int64 9 kilometer 150000 non-null float6410 notRepairedDamage 150000 non-null object 11 regionCode 150000 non-null int64 12 seller 150000 non-null int64 13 offerType 150000 non-null int64 14 creatDate 150000 non-null int64 15 price 150000 non-null int64 16 v_0 150000 non-null float6417 v_1 150000 non-null float6418 v_2 150000 non-null float6419 v_3 150000 non-null float6420 v_4 150000 non-null float6421 v_5 150000 non-null float6422 v_6 150000 non-null float6423 v_7 150000 non-null float6424 v_8 150000 non-null float6425 v_9 150000 non-null float6426 v_10 150000 non-null float6427 v_11 150000 non-null float6428 v_12 150000 non-null float6429 v_13 150000 non-null float6430 v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
used_car.info: None
used_car.shape: (150000, 31) 31 150000
used_car.columns: Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14'],dtype='object')
used_car.dtypes: float64 20
int64 10
object 1
dtype: int64
used_car.head: SaleID name regDate model ... v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 ... 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 ... 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 ... 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 ... 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 ... 0.910783 0.931110 2.834518 1.923482
149995 149995 163978 20000607 121.0 ... -2.983973 0.589167 -1.304370 -0.302592
149996 149996 184535 20091102 116.0 ... -2.774615 2.553994 0.924196 -0.272160
149997 149997 147587 20101003 60.0 ... -1.630677 2.290197 1.891922 0.414931
149998 149998 45907 20060312 34.0 ... -2.633719 1.414937 0.431981 -1.659014
149999 149999 177672 19990204 19.0 ... -3.179913 0.031724 -1.483350 -0.342674[10 rows x 31 columns]
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
count | 150000 | 150000 | 150000 | 149999 | 150000 | 145494 | 141320 | 144019 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 |
mean | 74999.5 | 68349.17287 | 20034170.51 | 47.12902086 | 8.052733333 | 1.792369445 | 0.375842061 | 0.224942542 | 119.3165467 | 12.59716 | 2583.077267 | 6.67E-06 | 0 | 20160330.79 | 5923.327333 | 44.40626753 | -0.044809123 | 0.080765058 | 0.078833423 | 0.017874615 | 0.248203528 | 0.044923004 | 0.124692461 | 0.058143855 | 0.061995895 | -0.001000239 | 0.009034543 | 0.004812595 | 0.000312612 | -0.000688231 |
std | 43301.41453 | 61103.87509 | 53649.87926 | 49.53603965 | 7.864956341 | 1.760639503 | 0.548676623 | 0.417545932 | 177.1684192 | 3.919575532 | 1885.363218 | 0.002581989 | 0 | 106.7328088 | 7501.998477 | 2.457547906 | 3.641893018 | 2.929617945 | 2.026514036 | 1.193661387 | 0.045803971 | 0.051742787 | 0.20140953 | 0.029185756 | 0.035691979 | 3.772386394 | 3.286071221 | 2.517477676 | 1.288987639 | 1.038685151 |
min | 0 | 0 | 19910001 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5 | 0 | 0 | 0 | 20150618 | 11 | 30.45197649 | -4.295588903 | -4.47067143 | -7.275036707 | -4.364565242 | 0 | 0 | 0 | 0 | 0 | -9.16819241 | -5.558206704 | -9.639552114 | -4.153898796 | -6.546555965 |
25% | 37499.75 | 11156 | 19990912 | 10 | 1 | 0 | 0 | 0 | 75 | 12.5 | 1018 | 0 | 0 | 20160313 | 1300 | 43.13579888 | -3.192349286 | -0.9706712 | -1.462580044 | -0.921191484 | 0.243615353 | 3.81E-05 | 0.062473533 | 0.035333687 | 0.033930177 | -3.72230288 | -1.951543007 | -1.871845761 | -1.057788984 | -0.437033668 |
50% | 74999.5 | 51638 | 20030912 | 30 | 6 | 1 | 0 | 0 | 110 | 15 | 2196 | 0 | 0 | 20160321 | 3250 | 44.61026572 | -3.052671416 | -0.38294689 | 0.099721985 | -0.075910429 | 0.257797966 | 0.000812059 | 0.095865898 | 0.057013598 | 0.058483667 | 1.624076331 | -0.358052697 | -0.130753318 | -0.036244604 | 0.141245993 |
75% | 112499.25 | 118841.25 | 20071109 | 66 | 13 | 3 | 1 | 0 | 150 | 15 | 3843 | 0 | 0 | 20160329 | 7700 | 46.0047209 | 4.000669795 | 0.241334852 | 1.565838202 | 0.868758435 | 0.265297259 | 0.102009298 | 0.125242945 | 0.079381571 | 0.087490548 | 2.844356776 | 1.255021657 | 1.776932949 | 0.942813083 | 0.680378075 |
max | 149999 | 196812 | 20151212 | 247 | 39 | 7 | 6 | 1 | 19312 | 15 | 8120 | 1 | 0 | 20160407 | 99999 | 52.30417826 | 7.320308375 | 19.0354965 | 9.854701534 | 6.82935164 | 0.291838113 | 0.151419596 | 1.404936375 | 0.160790985 | 0.222787488 | 12.35701062 | 18.81904247 | 13.84779152 | 11.14766861 | 8.658417877 |
# 1.3、分离特征与标签
# 1.4、合并训练集、测试集(标记数据来源):以便同步各种操作(特征处理、构造特征等)
# 1.5、划分特征类型
float64 20 ['model', 'bodyType', 'fuelType', 'gearbox', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14']
int32 0 []
int64 10 ['SaleID', 'name', 'regDate', 'brand', 'power', 'regionCode', 'seller', 'offerType', 'creatDate', 'price']
object_category_bool 1 ['notRepairedDamage']
others 0 []
# B1.7、纠正字段数据类型
# B1.8、纠正后重新统计
# T1.1、统计每个【类别型】特征的子分类
字段回归正确数据类型:# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 150000 non-null int64 1 name 150000 non-null int64 2 regDate 150000 non-null int64 3 model 149999 non-null object 4 brand 150000 non-null object 5 bodyType 145494 non-null object 6 fuelType 141320 non-null object 7 gearbox 144019 non-null object 8 power 150000 non-null int64 9 kilometer 150000 non-null float6410 notRepairedDamage 150000 non-null object 11 regionCode 150000 non-null int64 12 seller 150000 non-null int64 13 offerType 150000 non-null int64 14 creatDate 150000 non-null int64 15 price 150000 non-null int64 16 v_0 150000 non-null float6417 v_1 150000 non-null float6418 v_2 150000 non-null float6419 v_3 150000 non-null float6420 v_4 150000 non-null float6421 v_5 150000 non-null float6422 v_6 150000 non-null float6423 v_7 150000 non-null float6424 v_8 150000 non-null float6425 v_9 150000 non-null float6426 v_10 150000 non-null float6427 v_11 150000 non-null float6428 v_12 150000 non-null float6429 v_13 150000 non-null float6430 v_14 150000 non-null float64
dtypes: float64(16), int64(9), object(6)
memory usage: 35.5+ MB
# T1.2、统计每个【类别型】特征的多样性
model | counts | brand | counts | bodyType | counts | fuelType | counts | gearbox | counts | notRepairedDamage | counts |
0 | 11762 | 0 | 31480 | 0 | 41420 | 0 | 91656 | 0 | 111623 | 0.0 | 111361 |
19 | 9573 | 4 | 16737 | 1 | 35272 | 1 | 46991 | 1 | 32396 | - | 24324 |
4 | 8445 | 14 | 16089 | 2 | 30324 | 2 | 2212 | null | 5981 | 1.0 | 14315 |
1 | 6038 | 10 | 14249 | 3 | 13491 | 3 | 262 | null | 0 | ||
29 | 5186 | 1 | 13794 | 4 | 9609 | 4 | 118 | ||||
48 | 5052 | 6 | 10217 | 5 | 7607 | 5 | 45 | ||||
40 | 4502 | 9 | 7306 | 6 | 6482 | 6 | 36 | ||||
26 | 4496 | 5 | 4665 | 7 | 1289 | null | 8680 | ||||
8 | 4391 | 13 | 3817 | null | 4506 | ||||||
31 | 3827 | 11 | 2945 | ||||||||
13 | 3762 | 3 | 2461 | ||||||||
17 | 3121 | 7 | 2361 | ||||||||
65 | 2730 | 16 | 2223 | ||||||||
49 | 2608 | 8 | 2077 | ||||||||
46 | 2454 | 25 | 2064 | ||||||||
30 | 2342 | 27 | 2053 | ||||||||
44 | 2195 | 21 | 1547 | ||||||||
5 | 2063 | 15 | 1458 | ||||||||
10 | 2004 | 19 | 1388 | ||||||||
21 | 1872 | 20 | 1236 | ||||||||
73 | 1789 | 12 | 1109 | ||||||||
11 | 1775 | 22 | 1085 | ||||||||
23 | 1696 | 26 | 966 | ||||||||
22 | 1524 | 30 | 940 | ||||||||
69 | 1522 | 17 | 913 | ||||||||
63 | 1469 | 24 | 772 | ||||||||
7 | 1460 | 28 | 649 | ||||||||
16 | 1349 | 32 | 592 | ||||||||
88 | 1309 | 29 | 406 | ||||||||
66 | 1250 | 37 | 333 | ||||||||
60 | 1177 | 2 | 321 | ||||||||
67 | 1084 | 31 | 318 | ||||||||
41 | 1078 | 18 | 316 | ||||||||
104 | 1020 | 36 | 228 | ||||||||
87 | 965 | 34 | 227 | ||||||||
115 | 927 | 33 | 218 | ||||||||
3 | 920 | 23 | 186 | ||||||||
121 | 811 | 35 | 180 | ||||||||
32 | 705 | 38 | 65 | ||||||||
77 | 675 | 39 | 9 | ||||||||
98 | 662 | null | 0 | ||||||||
247 | 1 | ||||||||||
null | 1 |
# 二、特征工程/数据集预处理
# 2.1、缺失值分析与处理
# 2.1.1、缺失值统计分析
# T1、所有特征样本个数(非空数值)柱状图可视化
# T2、仅缺失值的特征空值占比柱状图可视化
{'fuelType': 0.057866666666666663, 'gearbox': 0.03987333333333333, 'bodyType': 0.03004, 'model': 6.666666666666667e-06}
# 2.1.2、缺失值填充处理
# T1、两大类型数据缺失值填充
-------------------before fillna: SaleID 0 name 0 regDate 0 model 1 brand 0 bodyType 4506 fuelType 8680 gearbox 5981 power 0 kilometer 0 notRepairedDamage 0 regionCode 0 seller 0 offerType 0 creatDate 0 price 0 v_0 0 v_1 0 v_2 0 v_3 0 v_4 0 v_5 0 v_6 0 v_7 0 v_8 0 v_9 0 v_10 0 v_11 0 v_12 0 v_13 0 v_14 0 dtype: int64 | -------------------after fillna: SaleID 0 name 0 regDate 0 model 0 brand 0 bodyType 0 fuelType 0 gearbox 0 power 0 kilometer 0 notRepairedDamage 0 regionCode 0 seller 0 offerType 0 creatDate 0 price 0 v_0 0 v_1 0 v_2 0 v_3 0 v_4 0 v_5 0 v_6 0 v_7 0 v_8 0 v_9 0 v_10 0 v_11 0 v_12 0 v_13 0 v_14 0 dtype: int64 |
# 2.2、异常值分析与处理
# 2.2.2、异常值的处理
# T2、基于3-Sigma标准差的删除异常样本点+箱线图对比可视化
3-Sigma,Delete number is: 963
Now column number is: 149037
outliers_low: Description of data less than the lower bound is:
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: power, dtype: float64
outliers_up: Description of data larger than the upper bound is:
count 963.000000
mean 846.836968
std 1929.418081
min 376.000000
25% 400.000000
50% 436.000000
75% 514.000000
max 19312.000000
Name: power, dtype: float64
# T3、对异常值执行截断处理:只针对异常值,截断阈值要具体看分布
# 2.3、特殊值的分析与处理
# T1、将某字段的特殊字符替换填充
df_train:0.0 135685
1.0 14315
Name: notRepairedDamage, dtype: int64
# 2.4、特殊字段的分析与处理
# 2.4.1、寻找严重失衡/倾斜分布的字段
seller 0 149999
1 1
Name: seller, dtype: int64
offerType 0 150000
Name: offerType, dtype: int64
# 2.5、变量分布的分析与处理
# 2.5.1、统计并可视化所有变量的偏态skew、峰态kurt
# 2.5.2、【数字型】特征的长尾分布转为正态分布
# 2.6、目标变量的分析与处理
# 2.6.1、查看目标变量的分布
# 2.6.2、计算目标变量的skew、kurt
price Skewness: 3.3464867626369608
price Kurtosis: 18.995183355632562
# 2.6.3、目标变量分布log变换
# 2.7、【类别型】特征分析
# 2.7.1、各个特征的丰富度统计及其可视化
# 2.7.2、各个特征的与目标变量的柱形图/箱形图/小提琴图可视化
# 2.8、【数字型】特征分析与处理
# 2.8.1、【数字型】特征分布性可视化
# 2.8.2、【数字型】特征相关性分析
# T1、【数字型】特征间的PCC热图可视化
corr sort_values price 1.000000
v_12 0.692823
v_8 0.685798
v_0 0.628397
regDate 0.611959
power 0.219834
v_5 0.164317
v_2 0.085322
v_6 0.068970
v_1 0.060914
v_14 0.035911
regionCode 0.014036
creatDate 0.002955
name 0.002030
SaleID -0.001043
seller -0.002004
v_13 -0.013993
brand -0.043799
v_7 -0.053024
v_4 -0.147085
v_9 -0.206205
v_10 -0.246175
v_11 -0.275320
kilometer -0.440519
v_3 -0.730946
offerType NaN
Name: price, dtype: float64
# T3、【数字型】特征间的散点图可视化
# 2.9、构造特征
Int64Index: 150000 entries, 0 to 149999
Data columns (total 41 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 150000 non-null float641 name 150000 non-null float642 regDate 150000 non-null float643 model 150000 non-null int32 4 brand 150000 non-null float645 bodyType 150000 non-null int32 6 fuelType 150000 non-null int32 7 gearbox 150000 non-null int32 8 power 150000 non-null float649 kilometer 150000 non-null float6410 notRepairedDamage 150000 non-null int32 11 regionCode 150000 non-null float6412 seller 150000 non-null float6413 offerType 150000 non-null float6414 creatDate 150000 non-null float6415 price 150000 non-null int64 16 v_0 150000 non-null float6417 v_1 150000 non-null float6418 v_2 150000 non-null float6419 v_3 150000 non-null float6420 v_4 150000 non-null float6421 v_5 150000 non-null float6422 v_6 150000 non-null float6423 v_7 150000 non-null float6424 v_8 150000 non-null float6425 v_9 150000 non-null float6426 v_10 150000 non-null float6427 v_11 150000 non-null float6428 v_12 150000 non-null float6429 v_13 150000 non-null float6430 v_14 150000 non-null float6431 city 150000 non-null int32 32 used_time 150000 non-null float6433 brand_amount 150000 non-null float6434 price_max_GBYbrand 150000 non-null float6435 price_median_GBYbrand 150000 non-null float6436 price_min_GBYbrand 150000 non-null float6437 price_sum_GBYbrand 150000 non-null float6438 price_std_GBYbrand 150000 non-null float6439 price_average_GBYbrand 150000 non-null float6440 power_bin 150000 non-null float64
# 2.10、数据规范化
catcols2LabelEncoder: 7 ['model', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'city', 'power_bin']
LEDict {'model': {'0.0': 0, '1.0': 1, '10.0': 2, '100.0': 3, '101.0': 4, …… '93.0': 241, '94.0': 242, '95.0': 243, '96.0': 244, '97.0': 245, '98.0': 246, '99.0': 247, 'missing': 248},
'bodyType': {'0.0': 0, '1.0': 1, '2.0': 2, '3.0': 3, '4.0': 4, '5.0': 5, '6.0': 6, '7.0': 7, 'missing': 8},
'fuelType': {'0.0': 0, '1.0': 1, '2.0': 2, '3.0': 3, '4.0': 4, '5.0': 5, '6.0': 6, 'missing': 7},
'gearbox': {'0.0': 0, '1.0': 1, 'missing': 2},
'notRepairedDamage': {'0.0': 0, '1.0': 1},
'city': {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5, '7': 6, '8': 7, 'missing': 8},
'power_bin': {'0.0': 0, '1.0': 1, '10.0': 2, '11.0': 3, '12.0': 4, '13.0': 5, '14.0': 6, '15.0': 7, '16.0': 8, '17.0': 9, '18.0': 10, '19.0': 11, '2.0': 12, '20.0': 13, '21.0': 14, '22.0': 15, '23.0': 16, '24.0': 17, '25.0': 18, '26.0': 19, '27.0': 20, '28.0': 21, '29.0': 22, '3.0': 23, '4.0': 24, '5.0': 25, '6.0': 26, '7.0': 27, '8.0': 28, '9.0': 29, 'missing': 30}}
after Encoder NoneSaleID name ... price_average_GBYbrand power_bin
0 0.000000 0.003740 ... 0.073848 0
1 0.000007 0.011493 ... 0.234956 4
2 0.000013 0.075575 ... 0.251439 3
3 0.000020 0.365145 ... 0.212120 3
4 0.000027 0.564396 ... 0.065144 0
... ... ... ... ... ...
149995 0.999973 0.833171 ... 0.212120 3
149996 0.999980 0.937621 ... 0.100505 2
149997 0.999987 0.749888 ... 0.100505 1
149998 0.999993 0.233253 ... 0.212120 3
149999 1.000000 0.902750 ... 0.135830 3
# 2.11、定义入模特征
# 2.11.1、删除特征
# 2.11.2、特征筛选
# T2、包裹式wrapper
k_featurenames ('bodyType', 'gearbox', 'kilometer', 'v_0', 'v_3', 'v_7', 'v_14', 'used_time', 'price_average_GBYbrand', 'power_bin')
# T3、嵌入式Embedded(最常用)
LiR_MSE: 15993321.471365392
LiR_R2: 0.7057326262665655
intercept: -480467.6143789641
coef: [('v_5', 547248.1399627327), ('v_6', 517106.21250813385), ('v_7', 497333.878927629), ('v_10', 365570.90980079107), ('v_11', 171543.6146836947), ('v_8', 164227.00112090845), ('v_9', 128578.71403340848), ('power', 48863.6068485829), ('v_4', 43508.82539409367), ('v_14', 19828.850095900943), ('price_average_GBYbrand', 10572.754737316918), ('brand_amount', 6968.85289671065), ('price_median_GBYbrand', 6595.631072990875), ('price_max_GBYbrand', 2237.7971368071658), ('price_std_GBYbrand', 956.376637996673), ('gearbox', 679.4055026736423), ('used_time', 387.4132818355945), ('power_bin', 291.5175148434141), ('bodyType', 217.02045635721151), ('model', -2.4899364779927495), ('city', -10.258028861593232), ('notRepairedDamage', -20.486887939604173), ('fuelType', -24.736780561186862), ('price_min_GBYbrand', -3762.1215956763376), ('kilometer', -4299.815762643461), ('price_sum_GBYbrand', -6953.314648619096), ('v_0', -67643.70870061051), ('v_2', -142475.32076890446), ('v_13', -148508.8116222008), ('v_3', -276643.4143410439), ('v_12', -303764.0882419921), ('v_1', -379287.1351181704)]
# 选取少量样本数据的单个特征分析模型的预测与真实标签的分布差异
# 2.12、导出入模数据集
model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | city | used_time | brand_amount | price_max_GBYbrand | price_median_GBYbrand | price_min_GBYbrand | price_sum_GBYbrand | price_std_GBYbrand | price_average_GBYbrand | power_bin |
172 | 0.153846154 | 1 | 0 | 0 | 0.003106877 | 0.827586207 | 0 | 1850 | 0.590595856 | 0.711260858 | 0.192329457 | 0.550783711 | 0.49208436 | 0.807556985 | 0.673547173 | 0.092209629 | 0.141900787 | 0.437465451 | 0.292047846 | 0.343037207 | 0.307345583 | 0.323443384 | 0.49071564 | 0 | 0.470440114 | 0.324362111 | 0.587029733 | 0.029269972 | 0.002063983 | 0.211594546 | 0.186944095 | 0.073847955 | 0 |
183 | 0.025641026 | 2 | 0 | 0 | 0 | 1 | 0 | 3600 | 0.679716245 | 0.820573785 | 0.196059042 | 0.505302185 | 0.262857081 | 0.907274422 | 0.799127704 | 0.09660986 | 0.165416289 | 0.092382489 | 0.19826575 | 0.314003614 | 0.366540781 | 0.158887319 | 0.446701089 | 3 | 0.511167068 | 0.438022306 | 0.998980422 | 0.191081267 | 0.004127967 | 0.734020942 | 0.399306567 | 0.234956097 | 4 |
19 | 0.384615385 | 1 | 0 | 0 | 0.008440348 | 0.827586207 | 0 | 6222 | 0.710517994 | 0.785077631 | 0.246326649 | 0.366413622 | 0.300846812 | 0.861471263 | 0.758899638 | 0.117548023 | 0.386668675 | 0.12152758 | 0.200762016 | 0.301993289 | 0.477060408 | 0.217050409 | 0.415429397 | 1 | 0.470111671 | 0.046042388 | 0.433578101 | 0.259986226 | 0.091847265 | 0.082280125 | 0.22063358 | 0.251439007 | 3 |
12 | 0.256410256 | 0 | 0 | 1 | 0.009993786 | 1 | 0 | 2400 | 0.69720671 | 0.756563426 | 0.188038118 | 0.476284942 | 0.190861388 | 0.939881251 | 0.728439962 | 0.086810868 | 0.207689175 | 0 | 0.216425071 | 0.280759589 | 0.389047154 | 0.112115708 | 0.399070505 | 8 | 0.770418218 | 0.452480061 | 0.979412764 | 0.153236915 | 0.004127967 | 0.692603009 | 0.382034156 | 0.2121201 | 3 |
14 | 0.128205128 | 1 | 0 | 0 | 0.003521127 | 0.310344828 | 0 | 5200 | 0.637534583 | 0.544686477 | 0.214532645 | 0.332976348 | 0.590557679 | 0.781377111 | 0.483458255 | 0.06539832 | 0.490197784 | 0.545516459 | 0.337834311 | 0.265369968 | 0.450057777 | 0.456712468 | 0.557057054 | 5 | 0.157981169 | 0.147945728 | 0.294544743 | 0.046487603 | 0.009287926 | 0.088308958 | 0.126353182 | 0.065143906 | 0 |
三、模型训练与验证
ML之R:通过数据预处理利用LiR/XGBoost等(特征重要性/交叉训练曲线可视化/线性和非线性算法对比/三种模型调参/三种模型融合)实现二手汽车产品交易价格回归预测之详细攻略
https://yunyaniu.blog.csdn.net/article/details/129280091
这篇关于ML之R:通过数据预处理(缺失值/异常值/特殊值的处理/长尾转正态分布/目标log变换/柱形图-箱形图-小提琴图可视化/构造特征/特征筛选)利用算法实现二手汽车产品交易价格回归预测之详细攻略的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!