二手车价格预测task02:数据探索性分析

2024-01-05 23:40

本文主要是介绍二手车价格预测task02:数据探索性分析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

  • task02学习了数据的分析画图
      1. 学习了sns.pairplot()用法
      1. 学习了sns.distplot()方法的使用
      1. 敲了一遍task数据分析,加了些注释说明
      1. 删除了两个类别特征异常的列和是三个和price相关性非常的列后进行预测,结果如图,效果并没有提高.应该做进一步的处理和特征工程(task03)

在这里插入图片描述

以下是按照教程进行数据分析的过程

# 导包
import warnings
warnings.filterwarnings('ignore') 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
  • 读取数据
Train_data = pd.read_csv('car_train_0110.csv', sep=' ')
Test_data = pd.read_csv('car_testA_0110.csv', sep=' ')
Train_data.head().append(Train_data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
01348907342016000213.09NaN0.01.0015.0...0.0921390.00000018.763832-1.512063-1.008718-12.100623-0.9470529.0772970.5812143.945923
13066481969732008030772.097.05.01.017315.0...0.0010700.122335-5.685612-0.489963-2.223693-0.226865-0.658246-3.9496214.593618-1.145653
2340675253472002031218.0123.00.01.05012.5...0.0644100.003345-3.2957001.8164993.554439-0.6836750.9714952.625318-0.851922-1.246135
35733253822000061138.087.00.01.05415.0...0.0692310.000000-3.4055211.4978264.7826360.0391011.2276463.040629-0.801854-1.251894
42652351731742003010987.005.05.01.01313.0...0.0000990.001655-4.4754290.1241381.364567-0.319848-1.131568-3.303424-1.998466-1.279368
2499951055693322017000313.09NaNNaN1.05815.0...0.0791190.00144711.78250820.402576-2.7227720.462388-4.4293857.8834130.698405-1.082013
2499961467101021102003051129.0173.00.00.06115.0...0.0000000.002342-2.9882721.5005323.502201-0.761715-2.484556-2.532968-0.940266-1.106426
2499971160668280220130312124.0166.00.01.01223.0...0.0033580.100760-6.939560-1.144959-5.3379490.896026-0.592565-3.8727252.1359843.807554
249998900826597120121212111.047.05.00.01849.0...0.0029740.008251-7.222167-1.383696-5.402794-0.409451-1.891556-3.104789-3.7773743.186218
24999976453569542005111113.093.00.01.05812.5...0.0000000.00907110.491312-11.270043-0.272595-0.026478-2.168249-0.980042-0.955164-1.169593

10 rows × 40 columns

  • name - 汽车编码
  • regDate - 汽车注册时间 – ***
  • model - 车型编码
  • brand - 品牌
  • bodyType - 车身类型
  • fuelType - 燃油类型
  • gearbox - 变速箱
  • power - 汽车功率
  • kilometer - 汽车行驶公里 –
  • notRepairedDamage - 汽车有尚未修复的损坏 – ***
  • regionCode - 看车地区编码
  • seller - 销售方
  • offerType - 报价类型
  • creatDate - 广告发布时间
  • price - 汽车价格
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21','v_22', 'v_23'],dtype='object')
Train_data_part = Train_data.cloumns=['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price']
Train_data_part
['SaleID','name','regDate','model','brand','bodyType','fuelType','gearbox','power','kilometer','notRepairedDamage','regionCode','seller','offerType','creatDate','price']
Train_data.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
count250000.000000250000.0000002.500000e+05250000.000000250000.000000224620.000000227510.000000236487.000000250000.000000250000.000000...250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000
mean185351.79076883153.3621722.003401e+0744.9114807.7852364.5632711.6650080.780783115.52841212.577418...0.0324890.0304080.0147250.0009150.0062730.006604-0.0013740.000609-0.0040250.001834
std107121.18876372540.7999647.770250e+0450.6400817.6940101.9125152.3396460.413717196.1418283.990632...0.0387920.0493338.7791635.7710814.8809814.1247223.8036263.5553532.8647132.323680
min1.0000000.0000001.910000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.000000-10.412444-15.538236-21.009214-13.989955-9.599285-11.181255-7.671327-2.350888
25%92501.75000014500.0000001.999061e+076.0000001.0000003.0000000.0000001.00000070.00000012.500000...0.0001290.000000-5.552269-0.901181-3.150385-0.478173-1.727237-3.067073-2.092178-1.402804
50%185264.50000065314.5000002.003111e+0727.0000006.0000004.0000000.0000001.000000105.00000015.000000...0.0019610.002567-3.8217700.223181-0.0585020.038427-0.995044-0.880587-1.199807-1.145588
75%278128.500000143761.2500002.008081e+0770.00000011.0000007.0000005.0000001.000000150.00000015.000000...0.0756720.0565683.5997471.2637372.8004750.5691981.5633823.2699872.7376140.044865
max370946.000000233044.0000002.019121e+07250.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.1307850.18434036.75687826.13456123.05566016.57602720.32457214.0394228.7645978.574730

8 rows × 40 columns

Test_data.describe()|
  File "<ipython-input-8-b48c1a6ece76>", line 1Test_data.describe()|^
SyntaxError: invalid syntax

power这里的max好像异常

Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):#   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  0   SaleID             250000 non-null  int64  1   name               250000 non-null  int64  2   regDate            250000 non-null  int64  3   model              250000 non-null  float644   brand              250000 non-null  int64  5   bodyType           224620 non-null  float646   fuelType           227510 non-null  float647   gearbox            236487 non-null  float648   power              250000 non-null  int64  9   kilometer          250000 non-null  float6410  notRepairedDamage  201464 non-null  float6411  regionCode         250000 non-null  int64  12  seller             250000 non-null  int64  13  offerType          250000 non-null  int64  14  creatDate          250000 non-null  int64  15  price              250000 non-null  int64  16  v_0                250000 non-null  float6417  v_1                250000 non-null  float6418  v_2                250000 non-null  float6419  v_3                250000 non-null  float6420  v_4                250000 non-null  float6421  v_5                250000 non-null  float6422  v_6                250000 non-null  float6423  v_7                250000 non-null  float6424  v_8                250000 non-null  float6425  v_9                250000 non-null  float6426  v_10               250000 non-null  float6427  v_11               250000 non-null  float6428  v_12               250000 non-null  float6429  v_13               250000 non-null  float6430  v_14               250000 non-null  float6431  v_15               250000 non-null  float6432  v_16               250000 non-null  float6433  v_17               250000 non-null  float6434  v_18               250000 non-null  float6435  v_19               250000 non-null  float6436  v_20               250000 non-null  float6437  v_21               250000 non-null  float6438  v_22               250000 non-null  float6439  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):#   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  0   SaleID             50000 non-null  int64  1   name               50000 non-null  int64  2   regDate            50000 non-null  int64  3   model              50000 non-null  float644   brand              50000 non-null  int64  5   bodyType           44890 non-null  float646   fuelType           45598 non-null  float647   gearbox            47287 non-null  float648   power              50000 non-null  int64  9   kilometer          50000 non-null  float6410  notRepairedDamage  40372 non-null  float6411  regionCode         50000 non-null  int64  12  seller             50000 non-null  int64  13  offerType          50000 non-null  int64  14  creatDate          50000 non-null  int64  15  v_0                50000 non-null  float6416  v_1                50000 non-null  float6417  v_2                50000 non-null  float6418  v_3                50000 non-null  float6419  v_4                50000 non-null  float6420  v_5                50000 non-null  float6421  v_6                50000 non-null  float6422  v_7                50000 non-null  float6423  v_8                50000 non-null  float6424  v_9                50000 non-null  float6425  v_10               50000 non-null  float6426  v_11               50000 non-null  float6427  v_12               50000 non-null  float6428  v_13               50000 non-null  float6429  v_14               50000 non-null  float6430  v_15               50000 non-null  float6431  v_16               50000 non-null  float6432  v_17               50000 non-null  float6433  v_18               50000 non-null  float6434  v_19               50000 non-null  float6435  v_20               50000 non-null  float6436  v_21               50000 non-null  float6437  v_22               50000 non-null  float6438  v_23               50000 non-null  float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB
# 查看每列的存在nan情况
Train_data.isnull()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
0FalseFalseFalseFalseFalseTrueFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
..................................................................
249995FalseFalseFalseFalseFalseTrueTrueFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249996FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249997FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249998FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249999FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

250000 rows × 40 columns

Train_data.isnull().sum() # sum是对每一列的数据进行求和
SaleID                   0
name                     0
regDate                  0
model                    0
brand                    0
bodyType             25380
fuelType             22490
gearbox              13513
power                    0
kilometer                0
notRepairedDamage    48536
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
v_15                     0
v_16                     0
v_17                     0
v_18                     0
v_19                     0
v_20                     0
v_21                     0
v_22                     0
v_23                     0
dtype: int64

NAN值的可视化

missing = Train_data.isnull().sum() # 为NAN的个数
missing = missing[missing > 0] # 只剩下空值的missing了
type(missing)
pandas.core.series.Series
missing
bodyType             25380
fuelType             22490
gearbox              13513
notRepairedDamage    48536
dtype: int64
# inplace=True 是在原数据上进行修改
missing.sort_values(inplace=True)
missing # 排序前
gearbox              13513
fuelType             22490
bodyType             25380
notRepairedDamage    48536
dtype: int64
missing # 排序后
gearbox              13513
fuelType             22490
bodyType             25380
notRepairedDamage    48536
dtype: int64
# 画出图 : 横轴为特征的名字,纵轴为数值
missing.plot.bar()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kpoIHOAZ-1618584682418)(output_25_1.png)]
通过以上两句可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于 nan存在的个数是
否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺,让树自己去优化,但如果nan存在的
过多、可以考虑删掉

# 可视化查看缺省值
msno.matrix(Train_data.sample(250))

在这里插入图片描述

msno.bar(Train_data.sample(1000))
# 可以看出1000个数据内有哪些数据不足1000,上面还有标出有多少条数据

在这里插入图片描述

# 可视化看下缺省值
msno.matrix(Test_data)

在这里插入图片描述

msno.bar(Test_data.sample(1000))

在这里插入图片描述

  • 可以看出训练集和测试集数据不一致的分布也是非常相似的

异常值检测

Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):#   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  0   SaleID             250000 non-null  int64  1   name               250000 non-null  int64  2   regDate            250000 non-null  int64  3   model              250000 non-null  float644   brand              250000 non-null  int64  5   bodyType           224620 non-null  float646   fuelType           227510 non-null  float647   gearbox            236487 non-null  float648   power              250000 non-null  int64  9   kilometer          250000 non-null  float6410  notRepairedDamage  201464 non-null  float6411  regionCode         250000 non-null  int64  12  seller             250000 non-null  int64  13  offerType          250000 non-null  int64  14  creatDate          250000 non-null  int64  15  price              250000 non-null  int64  16  v_0                250000 non-null  float6417  v_1                250000 non-null  float6418  v_2                250000 non-null  float6419  v_3                250000 non-null  float6420  v_4                250000 non-null  float6421  v_5                250000 non-null  float6422  v_6                250000 non-null  float6423  v_7                250000 non-null  float6424  v_8                250000 non-null  float6425  v_9                250000 non-null  float6426  v_10               250000 non-null  float6427  v_11               250000 non-null  float6428  v_12               250000 non-null  float6429  v_13               250000 non-null  float6430  v_14               250000 non-null  float6431  v_15               250000 non-null  float6432  v_16               250000 non-null  float6433  v_17               250000 non-null  float6434  v_18               250000 non-null  float6435  v_19               250000 non-null  float6436  v_20               250000 non-null  float6437  v_21               250000 non-null  float6438  v_22               250000 non-null  float6439  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
  • .value_counts 获取该特征列数据的种类|
# .value_counts 获取该特征列数据的种类
Train_data['notRepairedDamage'].value_counts()
1.0    176922
0.0     24542
Name: notRepairedDamage, dtype: int64
# Train_data.value_counts()
# 二手车原数据中这个特征为类别型特征,且 - 也表示为空值,这里是# 将 - 替换为nan
# Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)

以下两个类别特征严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然你也可以继续挖掘,但是一般意义不大

Train_data["seller"].value_counts()
1    249999
0         1
Name: seller, dtype: int64
Test_data["seller"].value_counts()
1    50000
Name: seller, dtype: int64
Train_data["offerType"].value_counts()
0    249991
1         9
Name: offerType, dtype: int64
Test_data['offerType'].value_counts()
0    49999
1        1
Name: offerType, dtype: int64
del Train_data["seller"] 
del Train_data["offerType"] 
del Test_data["seller"] 
del Test_data["offerType"]

所有特征的value_counts()

for f in Train_data.columns:print(f)print(Train_data[f].value_counts())
SaleID
2049      1
265515    1
277805    1
271662    1
312626    1..
107105    1
113250    1
111203    1
98917     1
2047      1
Name: SaleID, Length: 250000, dtype: int64
name
451       452
73        429
1791      428
821       391
243       346... 
92419       1
88325       1
82182       1
84231       1
157427      1
Name: name, Length: 164312, dtype: int64
regDate
20000010    306
20000001    288
20000002    288
20000007    279
20000008    278... 
19850904      1
19851010      1
19750511      1
19870912      1
19400705      1
Name: regDate, Length: 7537, dtype: int64
model
0.0      20344
6.0      17741
4.0      13837
1.0      13634
12.0      8841...  
226.0        5
245.0        5
243.0        4
249.0        4
250.0        1
Name: model, Length: 251, dtype: int64
brand
0     53699
4     27109
11    26944
10    23762
1     22144
6     17202
9     12210
5      7343
15     6500
12     4704
7      3839
3      3831
17     3543
13     3502
8      3374
28     3161
19     2561
18     2451
16     2274
22     2264
23     2088
14     1892
24     1678
25     1611
20     1610
27     1392
29     1259
34      963
30      604
2       570
31      540
21      522
38      516
35      415
32      406
36      377
33      368
37      324
26      307
39      141
Name: brand, dtype: int64
bodyType
7.0    64571
3.0    53858
4.0    45646
5.0    20343
6.0    15290
2.0    12755
1.0     9882
0.0     2275
Name: bodyType, dtype: int64
fuelType
0.0    150664
5.0     72494
4.0      3577
3.0       385
2.0       183
1.0       147
6.0        60
Name: fuelType, dtype: int64
gearbox
1.0    184645
0.0     51842
Name: gearbox, dtype: int64
power
0        27280
75       16158
60       10765
150      10373
140       9145...  
1986         1
1090         1
10311        1
960          1
3454         1
Name: power, Length: 703, dtype: int64
kilometer
15.0    162161
12.5     25743
10.0     10777
9.0       8424
8.0       7434
7.0       6642
6.0       5859
5.0       5100
0.5       4634
4.0       4204
3.0       4021
2.0       3749
1.0       1252
Name: kilometer, dtype: int64
notRepairedDamage
1.0    176922
0.0     24542
Name: notRepairedDamage, dtype: int64
regionCode
487     550
868     424
149     236
539     227
32      216... 
7959      1
8002      1
6715      1
7117      1
4144      1
Name: regionCode, Length: 8081, dtype: int64
creatDate
20160403    9758
20160404    9521
20160320    9176
20160312    8946
20160321    8895... 
20150618       1
20160114       1
20160201       1
20150611       1
20140310       1
Name: creatDate, Length: 107, dtype: int64
price
0        7312
500      3815
1500     3587
1000     3149
1200     3071... 
11320       1
7230        1
11448       1
9529        1
8188        1
Name: price, Length: 4585, dtype: int64
v_0
71.666307    2
72.346416    2
78.107692    2
71.715545    2
73.734706    2..
71.161494    1
70.253614    1
70.797686    1
74.588185    1
77.825581    1
Name: v_0, Length: 249747, dtype: int64
v_1
-1.470958     2
-3.128523     2
-3.224945     2
-3.293795     2
-3.322763     2..
-3.970355     111.487790    1
-3.456756     1
-3.746283     1
-3.579301     1
Name: v_1, Length: 249747, dtype: int64
v_2
-0.527186    2
-0.998414    20.652201    20.356107    2
-9.312859    2..
-0.815970    1
-6.729062    1
-1.683035    10.171102    10.852139    1
Name: v_2, Length: 249747, dtype: int64
v_33.580573     2
-0.633228     2
-2.541859     20.161395     220.571558    2..1.067454     1
-0.826230     1
-6.306510     10.201140     14.146853     1
Name: v_3, Length: 249747, dtype: int64
v_42.038620     2
-0.591751     2
-12.603294    2
-0.321072     2
-0.429618     2..0.742918     12.722358     1
-0.317880     1
-0.356648     11.327513     1
Name: v_4, Length: 249747, dtype: int64
v_51.273623    2
-1.589295    2
-2.350140    20.080770    2
-2.434300    2..
-1.374641    1
-2.369201    1
-2.194464    11.226827    1
-1.218480    1
Name: v_5, Length: 249747, dtype: int64
v_63.854950    2
-2.337177    2
-2.840736    2
-2.988814    20.912034    2..
-8.718013    13.185567    13.443525    1
-2.653621    1
-3.138425    1
Name: v_6, Length: 249747, dtype: int64
v_7
-2.915058    2
-2.518469    2
-1.175198    2
-3.672233    2
-2.563102    2..
-1.171334    1
-2.324847    14.015706    1
-1.895407    1
-2.156468    1
Name: v_7, Length: 249747, dtype: int64
v_8
0.000000    48244
0.315924        2
0.315905        2
0.314498        2
0.315560        2...  
0.315494        1
0.289243        1
0.316095        1
0.316209        1
0.315702        1
Name: v_8, Length: 201543, dtype: int64
v_9
1.101174    2
0.118624    2
0.164335    2
0.114609    2
0.112811    2..
1.110851    1
1.101634    1
0.116084    1
0.090707    1
0.112558    1
Name: v_9, Length: 249747, dtype: int64
v_10
0.000000    25342
0.081665        2
0.086726        2
0.081616        2
0.081701        2...  
0.089640        1
0.091852        1
0.082066        1
0.081448        1
0.087517        1
Name: v_10, Length: 224427, dtype: int64
v_11
0.000000    7421
0.121584       2
0.102037       2
0.166840       2
0.134519       2... 
0.092895       1
0.108411       1
0.131894       1
0.075781       1
0.078286       1
Name: v_11, Length: 242335, dtype: int64
v_12
0.000000    22426
0.053098        2
0.053437        2
0.055474        2
0.053432        2...  
0.053485        1
0.053471        1
0.055616        1
0.053447        1
0.053329        1
Name: v_12, Length: 227338, dtype: int64
v_13
0.000000    13495
0.130205        2
0.123467        2
0.123337        2
0.130232        2...  
0.123242        1
0.123755        1
0.123252        1
0.123047        1
0.123567        1
Name: v_13, Length: 236266, dtype: int64
v_14
0.000000    53857
0.003751        2
0.000746        2
0.002838        2
0.002283        2...  
0.094690        1
0.000690        1
0.086957        1
0.002928        1
0.083676        1
Name: v_14, Length: 195953, dtype: int64
v_15
0.000000    97223
0.010717        2
0.012704        2
0.143362        2
0.005417        2...  
0.001720        1
0.003263        1
0.007882        1
0.005242        1
0.094839        1
Name: v_15, Length: 152622, dtype: int64
v_16
-3.254226     2
-2.855248     2
-4.373334     27.744677     2
-2.847659     2..10.816862    1
-6.670231     1
-6.291694     1
-4.147668     1
-6.389964     1
Name: v_16, Length: 249747, dtype: int64
v_170.498971     20.217974     2
-13.000712    2
-0.675390     2
-1.530593     2..1.397213     10.610112     12.335480     1
-1.500048     15.289472     1
Name: v_17, Length: 249747, dtype: int64
v_18
-3.753102    27.731945    2
-0.058593    2
-1.171759    2
-2.045338    2..
-0.860834    10.643066    15.023034    1
-2.016881    16.565301    1
Name: v_18, Length: 249747, dtype: int64
v_190.082562    2
-0.469708    2
-0.138257    2
-0.657417    20.862429    2..
-1.249647    1
-0.664831    1
-0.660867    10.040847    10.700206    1
Name: v_19, Length: 249747, dtype: int64
v_20
-1.214032    2
-2.031659    2
-2.426898    2
-1.542005    2
-0.657360    2..
-2.098785    10.725159    1
-4.682086    10.342639    11.612570    1
Name: v_20, Length: 249747, dtype: int64
v_21
-3.244933    2
-3.440059    2
-4.070917    2
-3.001142    2
-4.153741    2..
-3.289808    14.942931    12.670356    1
-3.793230    13.226273    1
Name: v_21, Length: 249747, dtype: int64
v_22
-2.957315    24.311760    2
-2.101273    2
-0.936764    2
-2.562937    2..6.155533    1
-2.379816    1
-1.419529    14.872987    15.062041    1
Name: v_22, Length: 249747, dtype: int64
v_23
-1.044909    2
-1.259228    2
-1.183081    2
-0.989739    2
-0.958179    2..
-1.946367    1
-1.503159    1
-1.175261    1
-0.908192    1
-1.182790    1
Name: v_23, Length: 249747, dtype: int64
  • bodyType : 八个类别
  • fuelType : 七个类别
  • gearbox : 两个类别
  • kilometer : 12个类别
  • notRepairedDamage : 两个类别
  • seller : 两个类别但是严重倾斜 **
  • offerType : 两个类别但是严重倾斜 **
  • V_8 V_10 V_11 V_12 V_13 V_14 V_15 各有一个值特别大的类别特征

了解预测值的分布

type(Train_data['price'])
pandas.core.series.Series
Train_data['price'].value_counts()
0        7312
500      3815
1500     3587
1000     3149
1200     3071... 
11320       1
7230        1
11448       1
9529        1
8188        1
Name: price, Length: 4585, dtype: int64
## 1) price的分布情况(无界约尔逊分布等)
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
<AxesSubplot:title={'center':'Log Normal'}, xlabel='price'>

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

figure语法及操作

(1)figure语法说明

figure(num=None, figsize=None, dpi=None, facecolor=None, edgecolor=None, frameon=True)

  • num:图像编号或名称,数字为编号 ,字符串为名称
  • figsize:指定figure的宽和高,单位为英寸
  • dpi参数指定绘图对象的分辨率,即每英寸多少个像素,缺省值为 1英寸等于2.5cm,A4纸是 21*30cm的纸张
  • facecolor:背景颜色
  • edgecolor:边框颜色
  • frameon:是否显示边框

(2) 示例:

fig=plt.figure(figsize=(4,3),facecolor='blue')
plt.plot([1,2,3,4],[3,5,7,9])
plt.show()

在这里插入图片描述

plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=True,rug=True, fit=st.lognorm)
# ked = True
<AxesSubplot:title={'center':'Log Normal'}, xlabel='price', ylabel='Density'>

在这里插入图片描述

sns画图

  • seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)

将kde设置为True

  • Kernel density estimaton核密度估计

  • 核密度估计是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。.由于核密度估计方法不利用有关数据分布的先验知识,对数据分布不附加任何假定,是一种从数据样本本身出发研究数据分布特征的方法,因而,在统计学理论和应用领域均受到高度的重视。

    • hist: bool, optional #控制是否显示条形图,默认为True
    • kde: bool, optional #控制是否显示核密度估计图,默认为True
    • rug: bool, optional #控制是否显示观测的小细条(边际毛毯)默认为false

对预测值分布进行处理

价格不服从正态分布,所以在进行回归之前,需要将其转换.虽然对数变换做的很好,但最佳拟合是无界约翰逊分布

## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.535346
Kurtosis: 21.230678

在这里插入图片描述

Train_data.skew(), Train_data.kurt()
(SaleID                0.001712name                  0.513079regDate              -1.540844model                 1.499765brand                 1.314846bodyType             -0.070459fuelType              0.701802gearbox              -1.357379power                58.590829kilometer            -1.557472notRepairedDamage    -2.312519regionCode            0.690405creatDate           -95.428563price                 3.535346v_0                  -1.504738v_1                   1.582428v_2                   1.198679v_3                   1.352193v_4                   0.217941v_5                   2.052749v_6                   0.090718v_7                   0.823610v_8                  -1.532964v_9                   1.529931v_10                 -2.584452v_11                 -0.906428v_12                 -2.842834v_13                 -3.869655v_14                  0.491706v_15                  1.308716v_16                  1.662893v_17                  0.233318v_18                  0.814453v_19                  0.100073v_20                  2.001253v_21                  0.180020v_22                  0.819133v_23                  1.357847dtype: float64,SaleID                  -1.201476name                    -1.084474regDate                 11.041006model                    1.741896brand                    1.814245bodyType                -1.070358fuelType                -1.495782gearbox                 -0.157525power                 4473.885260kilometer                1.250933notRepairedDamage        3.347777regionCode              -0.352973creatDate            11376.694263price                   21.230678v_0                      2.901641v_1                      1.098703v_2                      3.749872v_3                      4.294578v_4                      6.953348v_5                      6.489791v_6                     -0.564878v_7                     -0.729838v_8                      0.370812v_9                      0.377943v_10                     4.796855v_11                     1.547812v_12                     6.136342v_13                    13.199575v_14                    -1.597532v_15                    -0.029594v_16                     2.240928v_17                     2.569341v_18                     2.967738v_19                     6.923953v_20                     6.852809v_21                    -0.759948v_22                    -0.741708v_23                     0.143713dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
# 通过axlabel、label设置标签
<AxesSubplot:xlabel='Skewness', ylabel='Density'>

在这里插入图片描述

sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
<AxesSubplot:xlabel='Kurtness', ylabel='Density'>

在这里插入图片描述


这篇关于二手车价格预测task02:数据探索性分析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/574535

相关文章

大模型研发全揭秘:客服工单数据标注的完整攻略

在人工智能(AI)领域,数据标注是模型训练过程中至关重要的一步。无论你是新手还是有经验的从业者,掌握数据标注的技术细节和常见问题的解决方案都能为你的AI项目增添不少价值。在电信运营商的客服系统中,工单数据是客户问题和解决方案的重要记录。通过对这些工单数据进行有效标注,不仅能够帮助提升客服自动化系统的智能化水平,还能优化客户服务流程,提高客户满意度。本文将详细介绍如何在电信运营商客服工单的背景下进行

基于MySQL Binlog的Elasticsearch数据同步实践

一、为什么要做 随着马蜂窝的逐渐发展,我们的业务数据越来越多,单纯使用 MySQL 已经不能满足我们的数据查询需求,例如对于商品、订单等数据的多维度检索。 使用 Elasticsearch 存储业务数据可以很好的解决我们业务中的搜索需求。而数据进行异构存储后,随之而来的就是数据同步的问题。 二、现有方法及问题 对于数据同步,我们目前的解决方案是建立数据中间表。把需要检索的业务数据,统一放到一张M

关于数据埋点,你需要了解这些基本知识

产品汪每天都在和数据打交道,你知道数据来自哪里吗? 移动app端内的用户行为数据大多来自埋点,了解一些埋点知识,能和数据分析师、技术侃大山,参与到前期的数据采集,更重要是让最终的埋点数据能为我所用,否则可怜巴巴等上几个月是常有的事。   埋点类型 根据埋点方式,可以区分为: 手动埋点半自动埋点全自动埋点 秉承“任何事物都有两面性”的道理:自动程度高的,能解决通用统计,便于统一化管理,但个性化定

使用SecondaryNameNode恢复NameNode的数据

1)需求: NameNode进程挂了并且存储的数据也丢失了,如何恢复NameNode 此种方式恢复的数据可能存在小部分数据的丢失。 2)故障模拟 (1)kill -9 NameNode进程 [lytfly@hadoop102 current]$ kill -9 19886 (2)删除NameNode存储的数据(/opt/module/hadoop-3.1.4/data/tmp/dfs/na

异构存储(冷热数据分离)

异构存储主要解决不同的数据,存储在不同类型的硬盘中,达到最佳性能的问题。 异构存储Shell操作 (1)查看当前有哪些存储策略可以用 [lytfly@hadoop102 hadoop-3.1.4]$ hdfs storagepolicies -listPolicies (2)为指定路径(数据存储目录)设置指定的存储策略 hdfs storagepolicies -setStoragePo

Hadoop集群数据均衡之磁盘间数据均衡

生产环境,由于硬盘空间不足,往往需要增加一块硬盘。刚加载的硬盘没有数据时,可以执行磁盘数据均衡命令。(Hadoop3.x新特性) plan后面带的节点的名字必须是已经存在的,并且是需要均衡的节点。 如果节点不存在,会报如下错误: 如果节点只有一个硬盘的话,不会创建均衡计划: (1)生成均衡计划 hdfs diskbalancer -plan hadoop102 (2)执行均衡计划 hd

性能分析之MySQL索引实战案例

文章目录 一、前言二、准备三、MySQL索引优化四、MySQL 索引知识回顾五、总结 一、前言 在上一讲性能工具之 JProfiler 简单登录案例分析实战中已经发现SQL没有建立索引问题,本文将一起从代码层去分析为什么没有建立索引? 开源ERP项目地址:https://gitee.com/jishenghua/JSH_ERP 二、准备 打开IDEA找到登录请求资源路径位置

【Prometheus】PromQL向量匹配实现不同标签的向量数据进行运算

✨✨ 欢迎大家来到景天科技苑✨✨ 🎈🎈 养成好习惯,先赞后看哦~🎈🎈 🏆 作者简介:景天科技苑 🏆《头衔》:大厂架构师,华为云开发者社区专家博主,阿里云开发者社区专家博主,CSDN全栈领域优质创作者,掘金优秀博主,51CTO博客专家等。 🏆《博客》:Python全栈,前后端开发,小程序开发,人工智能,js逆向,App逆向,网络系统安全,数据分析,Django,fastapi

烟火目标检测数据集 7800张 烟火检测 带标注 voc yolo

一个包含7800张带标注图像的数据集,专门用于烟火目标检测,是一个非常有价值的资源,尤其对于那些致力于公共安全、事件管理和烟花表演监控等领域的人士而言。下面是对此数据集的一个详细介绍: 数据集名称:烟火目标检测数据集 数据集规模: 图片数量:7800张类别:主要包含烟火类目标,可能还包括其他相关类别,如烟火发射装置、背景等。格式:图像文件通常为JPEG或PNG格式;标注文件可能为X

pandas数据过滤

Pandas 数据过滤方法 Pandas 提供了多种方法来过滤数据,可以根据不同的条件进行筛选。以下是一些常见的 Pandas 数据过滤方法,结合实例进行讲解,希望能帮你快速理解。 1. 基于条件筛选行 可以使用布尔索引来根据条件过滤行。 import pandas as pd# 创建示例数据data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dav