二手车价格预测task02:数据探索性分析

2024-01-05 23:40

本文主要是介绍二手车价格预测task02:数据探索性分析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

  • task02学习了数据的分析画图
      1. 学习了sns.pairplot()用法
      1. 学习了sns.distplot()方法的使用
      1. 敲了一遍task数据分析,加了些注释说明
      1. 删除了两个类别特征异常的列和是三个和price相关性非常的列后进行预测,结果如图,效果并没有提高.应该做进一步的处理和特征工程(task03)

在这里插入图片描述

以下是按照教程进行数据分析的过程

# 导包
import warnings
warnings.filterwarnings('ignore') 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
  • 读取数据
Train_data = pd.read_csv('car_train_0110.csv', sep=' ')
Test_data = pd.read_csv('car_testA_0110.csv', sep=' ')
Train_data.head().append(Train_data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
01348907342016000213.09NaN0.01.0015.0...0.0921390.00000018.763832-1.512063-1.008718-12.100623-0.9470529.0772970.5812143.945923
13066481969732008030772.097.05.01.017315.0...0.0010700.122335-5.685612-0.489963-2.223693-0.226865-0.658246-3.9496214.593618-1.145653
2340675253472002031218.0123.00.01.05012.5...0.0644100.003345-3.2957001.8164993.554439-0.6836750.9714952.625318-0.851922-1.246135
35733253822000061138.087.00.01.05415.0...0.0692310.000000-3.4055211.4978264.7826360.0391011.2276463.040629-0.801854-1.251894
42652351731742003010987.005.05.01.01313.0...0.0000990.001655-4.4754290.1241381.364567-0.319848-1.131568-3.303424-1.998466-1.279368
2499951055693322017000313.09NaNNaN1.05815.0...0.0791190.00144711.78250820.402576-2.7227720.462388-4.4293857.8834130.698405-1.082013
2499961467101021102003051129.0173.00.00.06115.0...0.0000000.002342-2.9882721.5005323.502201-0.761715-2.484556-2.532968-0.940266-1.106426
2499971160668280220130312124.0166.00.01.01223.0...0.0033580.100760-6.939560-1.144959-5.3379490.896026-0.592565-3.8727252.1359843.807554
249998900826597120121212111.047.05.00.01849.0...0.0029740.008251-7.222167-1.383696-5.402794-0.409451-1.891556-3.104789-3.7773743.186218
24999976453569542005111113.093.00.01.05812.5...0.0000000.00907110.491312-11.270043-0.272595-0.026478-2.168249-0.980042-0.955164-1.169593

10 rows × 40 columns

  • name - 汽车编码
  • regDate - 汽车注册时间 – ***
  • model - 车型编码
  • brand - 品牌
  • bodyType - 车身类型
  • fuelType - 燃油类型
  • gearbox - 变速箱
  • power - 汽车功率
  • kilometer - 汽车行驶公里 –
  • notRepairedDamage - 汽车有尚未修复的损坏 – ***
  • regionCode - 看车地区编码
  • seller - 销售方
  • offerType - 报价类型
  • creatDate - 广告发布时间
  • price - 汽车价格
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21','v_22', 'v_23'],dtype='object')
Train_data_part = Train_data.cloumns=['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price']
Train_data_part
['SaleID','name','regDate','model','brand','bodyType','fuelType','gearbox','power','kilometer','notRepairedDamage','regionCode','seller','offerType','creatDate','price']
Train_data.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
count250000.000000250000.0000002.500000e+05250000.000000250000.000000224620.000000227510.000000236487.000000250000.000000250000.000000...250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000
mean185351.79076883153.3621722.003401e+0744.9114807.7852364.5632711.6650080.780783115.52841212.577418...0.0324890.0304080.0147250.0009150.0062730.006604-0.0013740.000609-0.0040250.001834
std107121.18876372540.7999647.770250e+0450.6400817.6940101.9125152.3396460.413717196.1418283.990632...0.0387920.0493338.7791635.7710814.8809814.1247223.8036263.5553532.8647132.323680
min1.0000000.0000001.910000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.000000-10.412444-15.538236-21.009214-13.989955-9.599285-11.181255-7.671327-2.350888
25%92501.75000014500.0000001.999061e+076.0000001.0000003.0000000.0000001.00000070.00000012.500000...0.0001290.000000-5.552269-0.901181-3.150385-0.478173-1.727237-3.067073-2.092178-1.402804
50%185264.50000065314.5000002.003111e+0727.0000006.0000004.0000000.0000001.000000105.00000015.000000...0.0019610.002567-3.8217700.223181-0.0585020.038427-0.995044-0.880587-1.199807-1.145588
75%278128.500000143761.2500002.008081e+0770.00000011.0000007.0000005.0000001.000000150.00000015.000000...0.0756720.0565683.5997471.2637372.8004750.5691981.5633823.2699872.7376140.044865
max370946.000000233044.0000002.019121e+07250.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.1307850.18434036.75687826.13456123.05566016.57602720.32457214.0394228.7645978.574730

8 rows × 40 columns

Test_data.describe()|
  File "<ipython-input-8-b48c1a6ece76>", line 1Test_data.describe()|^
SyntaxError: invalid syntax

power这里的max好像异常

Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):#   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  0   SaleID             250000 non-null  int64  1   name               250000 non-null  int64  2   regDate            250000 non-null  int64  3   model              250000 non-null  float644   brand              250000 non-null  int64  5   bodyType           224620 non-null  float646   fuelType           227510 non-null  float647   gearbox            236487 non-null  float648   power              250000 non-null  int64  9   kilometer          250000 non-null  float6410  notRepairedDamage  201464 non-null  float6411  regionCode         250000 non-null  int64  12  seller             250000 non-null  int64  13  offerType          250000 non-null  int64  14  creatDate          250000 non-null  int64  15  price              250000 non-null  int64  16  v_0                250000 non-null  float6417  v_1                250000 non-null  float6418  v_2                250000 non-null  float6419  v_3                250000 non-null  float6420  v_4                250000 non-null  float6421  v_5                250000 non-null  float6422  v_6                250000 non-null  float6423  v_7                250000 non-null  float6424  v_8                250000 non-null  float6425  v_9                250000 non-null  float6426  v_10               250000 non-null  float6427  v_11               250000 non-null  float6428  v_12               250000 non-null  float6429  v_13               250000 non-null  float6430  v_14               250000 non-null  float6431  v_15               250000 non-null  float6432  v_16               250000 non-null  float6433  v_17               250000 non-null  float6434  v_18               250000 non-null  float6435  v_19               250000 non-null  float6436  v_20               250000 non-null  float6437  v_21               250000 non-null  float6438  v_22               250000 non-null  float6439  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):#   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  0   SaleID             50000 non-null  int64  1   name               50000 non-null  int64  2   regDate            50000 non-null  int64  3   model              50000 non-null  float644   brand              50000 non-null  int64  5   bodyType           44890 non-null  float646   fuelType           45598 non-null  float647   gearbox            47287 non-null  float648   power              50000 non-null  int64  9   kilometer          50000 non-null  float6410  notRepairedDamage  40372 non-null  float6411  regionCode         50000 non-null  int64  12  seller             50000 non-null  int64  13  offerType          50000 non-null  int64  14  creatDate          50000 non-null  int64  15  v_0                50000 non-null  float6416  v_1                50000 non-null  float6417  v_2                50000 non-null  float6418  v_3                50000 non-null  float6419  v_4                50000 non-null  float6420  v_5                50000 non-null  float6421  v_6                50000 non-null  float6422  v_7                50000 non-null  float6423  v_8                50000 non-null  float6424  v_9                50000 non-null  float6425  v_10               50000 non-null  float6426  v_11               50000 non-null  float6427  v_12               50000 non-null  float6428  v_13               50000 non-null  float6429  v_14               50000 non-null  float6430  v_15               50000 non-null  float6431  v_16               50000 non-null  float6432  v_17               50000 non-null  float6433  v_18               50000 non-null  float6434  v_19               50000 non-null  float6435  v_20               50000 non-null  float6436  v_21               50000 non-null  float6437  v_22               50000 non-null  float6438  v_23               50000 non-null  float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB
# 查看每列的存在nan情况
Train_data.isnull()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
0FalseFalseFalseFalseFalseTrueFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
..................................................................
249995FalseFalseFalseFalseFalseTrueTrueFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249996FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249997FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249998FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249999FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

250000 rows × 40 columns

Train_data.isnull().sum() # sum是对每一列的数据进行求和
SaleID                   0
name                     0
regDate                  0
model                    0
brand                    0
bodyType             25380
fuelType             22490
gearbox              13513
power                    0
kilometer                0
notRepairedDamage    48536
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
v_15                     0
v_16                     0
v_17                     0
v_18                     0
v_19                     0
v_20                     0
v_21                     0
v_22                     0
v_23                     0
dtype: int64

NAN值的可视化

missing = Train_data.isnull().sum() # 为NAN的个数
missing = missing[missing > 0] # 只剩下空值的missing了
type(missing)
pandas.core.series.Series
missing
bodyType             25380
fuelType             22490
gearbox              13513
notRepairedDamage    48536
dtype: int64
# inplace=True 是在原数据上进行修改
missing.sort_values(inplace=True)
missing # 排序前
gearbox              13513
fuelType             22490
bodyType             25380
notRepairedDamage    48536
dtype: int64
missing # 排序后
gearbox              13513
fuelType             22490
bodyType             25380
notRepairedDamage    48536
dtype: int64
# 画出图 : 横轴为特征的名字,纵轴为数值
missing.plot.bar()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kpoIHOAZ-1618584682418)(output_25_1.png)]
通过以上两句可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于 nan存在的个数是
否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺,让树自己去优化,但如果nan存在的
过多、可以考虑删掉

# 可视化查看缺省值
msno.matrix(Train_data.sample(250))

在这里插入图片描述

msno.bar(Train_data.sample(1000))
# 可以看出1000个数据内有哪些数据不足1000,上面还有标出有多少条数据

在这里插入图片描述

# 可视化看下缺省值
msno.matrix(Test_data)

在这里插入图片描述

msno.bar(Test_data.sample(1000))

在这里插入图片描述

  • 可以看出训练集和测试集数据不一致的分布也是非常相似的

异常值检测

Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):#   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  0   SaleID             250000 non-null  int64  1   name               250000 non-null  int64  2   regDate            250000 non-null  int64  3   model              250000 non-null  float644   brand              250000 non-null  int64  5   bodyType           224620 non-null  float646   fuelType           227510 non-null  float647   gearbox            236487 non-null  float648   power              250000 non-null  int64  9   kilometer          250000 non-null  float6410  notRepairedDamage  201464 non-null  float6411  regionCode         250000 non-null  int64  12  seller             250000 non-null  int64  13  offerType          250000 non-null  int64  14  creatDate          250000 non-null  int64  15  price              250000 non-null  int64  16  v_0                250000 non-null  float6417  v_1                250000 non-null  float6418  v_2                250000 non-null  float6419  v_3                250000 non-null  float6420  v_4                250000 non-null  float6421  v_5                250000 non-null  float6422  v_6                250000 non-null  float6423  v_7                250000 non-null  float6424  v_8                250000 non-null  float6425  v_9                250000 non-null  float6426  v_10               250000 non-null  float6427  v_11               250000 non-null  float6428  v_12               250000 non-null  float6429  v_13               250000 non-null  float6430  v_14               250000 non-null  float6431  v_15               250000 non-null  float6432  v_16               250000 non-null  float6433  v_17               250000 non-null  float6434  v_18               250000 non-null  float6435  v_19               250000 non-null  float6436  v_20               250000 non-null  float6437  v_21               250000 non-null  float6438  v_22               250000 non-null  float6439  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
  • .value_counts 获取该特征列数据的种类|
# .value_counts 获取该特征列数据的种类
Train_data['notRepairedDamage'].value_counts()
1.0    176922
0.0     24542
Name: notRepairedDamage, dtype: int64
# Train_data.value_counts()
# 二手车原数据中这个特征为类别型特征,且 - 也表示为空值,这里是# 将 - 替换为nan
# Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)

以下两个类别特征严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然你也可以继续挖掘,但是一般意义不大

Train_data["seller"].value_counts()
1    249999
0         1
Name: seller, dtype: int64
Test_data["seller"].value_counts()
1    50000
Name: seller, dtype: int64
Train_data["offerType"].value_counts()
0    249991
1         9
Name: offerType, dtype: int64
Test_data['offerType'].value_counts()
0    49999
1        1
Name: offerType, dtype: int64
del Train_data["seller"] 
del Train_data["offerType"] 
del Test_data["seller"] 
del Test_data["offerType"]

所有特征的value_counts()

for f in Train_data.columns:print(f)print(Train_data[f].value_counts())
SaleID
2049      1
265515    1
277805    1
271662    1
312626    1..
107105    1
113250    1
111203    1
98917     1
2047      1
Name: SaleID, Length: 250000, dtype: int64
name
451       452
73        429
1791      428
821       391
243       346... 
92419       1
88325       1
82182       1
84231       1
157427      1
Name: name, Length: 164312, dtype: int64
regDate
20000010    306
20000001    288
20000002    288
20000007    279
20000008    278... 
19850904      1
19851010      1
19750511      1
19870912      1
19400705      1
Name: regDate, Length: 7537, dtype: int64
model
0.0      20344
6.0      17741
4.0      13837
1.0      13634
12.0      8841...  
226.0        5
245.0        5
243.0        4
249.0        4
250.0        1
Name: model, Length: 251, dtype: int64
brand
0     53699
4     27109
11    26944
10    23762
1     22144
6     17202
9     12210
5      7343
15     6500
12     4704
7      3839
3      3831
17     3543
13     3502
8      3374
28     3161
19     2561
18     2451
16     2274
22     2264
23     2088
14     1892
24     1678
25     1611
20     1610
27     1392
29     1259
34      963
30      604
2       570
31      540
21      522
38      516
35      415
32      406
36      377
33      368
37      324
26      307
39      141
Name: brand, dtype: int64
bodyType
7.0    64571
3.0    53858
4.0    45646
5.0    20343
6.0    15290
2.0    12755
1.0     9882
0.0     2275
Name: bodyType, dtype: int64
fuelType
0.0    150664
5.0     72494
4.0      3577
3.0       385
2.0       183
1.0       147
6.0        60
Name: fuelType, dtype: int64
gearbox
1.0    184645
0.0     51842
Name: gearbox, dtype: int64
power
0        27280
75       16158
60       10765
150      10373
140       9145...  
1986         1
1090         1
10311        1
960          1
3454         1
Name: power, Length: 703, dtype: int64
kilometer
15.0    162161
12.5     25743
10.0     10777
9.0       8424
8.0       7434
7.0       6642
6.0       5859
5.0       5100
0.5       4634
4.0       4204
3.0       4021
2.0       3749
1.0       1252
Name: kilometer, dtype: int64
notRepairedDamage
1.0    176922
0.0     24542
Name: notRepairedDamage, dtype: int64
regionCode
487     550
868     424
149     236
539     227
32      216... 
7959      1
8002      1
6715      1
7117      1
4144      1
Name: regionCode, Length: 8081, dtype: int64
creatDate
20160403    9758
20160404    9521
20160320    9176
20160312    8946
20160321    8895... 
20150618       1
20160114       1
20160201       1
20150611       1
20140310       1
Name: creatDate, Length: 107, dtype: int64
price
0        7312
500      3815
1500     3587
1000     3149
1200     3071... 
11320       1
7230        1
11448       1
9529        1
8188        1
Name: price, Length: 4585, dtype: int64
v_0
71.666307    2
72.346416    2
78.107692    2
71.715545    2
73.734706    2..
71.161494    1
70.253614    1
70.797686    1
74.588185    1
77.825581    1
Name: v_0, Length: 249747, dtype: int64
v_1
-1.470958     2
-3.128523     2
-3.224945     2
-3.293795     2
-3.322763     2..
-3.970355     111.487790    1
-3.456756     1
-3.746283     1
-3.579301     1
Name: v_1, Length: 249747, dtype: int64
v_2
-0.527186    2
-0.998414    20.652201    20.356107    2
-9.312859    2..
-0.815970    1
-6.729062    1
-1.683035    10.171102    10.852139    1
Name: v_2, Length: 249747, dtype: int64
v_33.580573     2
-0.633228     2
-2.541859     20.161395     220.571558    2..1.067454     1
-0.826230     1
-6.306510     10.201140     14.146853     1
Name: v_3, Length: 249747, dtype: int64
v_42.038620     2
-0.591751     2
-12.603294    2
-0.321072     2
-0.429618     2..0.742918     12.722358     1
-0.317880     1
-0.356648     11.327513     1
Name: v_4, Length: 249747, dtype: int64
v_51.273623    2
-1.589295    2
-2.350140    20.080770    2
-2.434300    2..
-1.374641    1
-2.369201    1
-2.194464    11.226827    1
-1.218480    1
Name: v_5, Length: 249747, dtype: int64
v_63.854950    2
-2.337177    2
-2.840736    2
-2.988814    20.912034    2..
-8.718013    13.185567    13.443525    1
-2.653621    1
-3.138425    1
Name: v_6, Length: 249747, dtype: int64
v_7
-2.915058    2
-2.518469    2
-1.175198    2
-3.672233    2
-2.563102    2..
-1.171334    1
-2.324847    14.015706    1
-1.895407    1
-2.156468    1
Name: v_7, Length: 249747, dtype: int64
v_8
0.000000    48244
0.315924        2
0.315905        2
0.314498        2
0.315560        2...  
0.315494        1
0.289243        1
0.316095        1
0.316209        1
0.315702        1
Name: v_8, Length: 201543, dtype: int64
v_9
1.101174    2
0.118624    2
0.164335    2
0.114609    2
0.112811    2..
1.110851    1
1.101634    1
0.116084    1
0.090707    1
0.112558    1
Name: v_9, Length: 249747, dtype: int64
v_10
0.000000    25342
0.081665        2
0.086726        2
0.081616        2
0.081701        2...  
0.089640        1
0.091852        1
0.082066        1
0.081448        1
0.087517        1
Name: v_10, Length: 224427, dtype: int64
v_11
0.000000    7421
0.121584       2
0.102037       2
0.166840       2
0.134519       2... 
0.092895       1
0.108411       1
0.131894       1
0.075781       1
0.078286       1
Name: v_11, Length: 242335, dtype: int64
v_12
0.000000    22426
0.053098        2
0.053437        2
0.055474        2
0.053432        2...  
0.053485        1
0.053471        1
0.055616        1
0.053447        1
0.053329        1
Name: v_12, Length: 227338, dtype: int64
v_13
0.000000    13495
0.130205        2
0.123467        2
0.123337        2
0.130232        2...  
0.123242        1
0.123755        1
0.123252        1
0.123047        1
0.123567        1
Name: v_13, Length: 236266, dtype: int64
v_14
0.000000    53857
0.003751        2
0.000746        2
0.002838        2
0.002283        2...  
0.094690        1
0.000690        1
0.086957        1
0.002928        1
0.083676        1
Name: v_14, Length: 195953, dtype: int64
v_15
0.000000    97223
0.010717        2
0.012704        2
0.143362        2
0.005417        2...  
0.001720        1
0.003263        1
0.007882        1
0.005242        1
0.094839        1
Name: v_15, Length: 152622, dtype: int64
v_16
-3.254226     2
-2.855248     2
-4.373334     27.744677     2
-2.847659     2..10.816862    1
-6.670231     1
-6.291694     1
-4.147668     1
-6.389964     1
Name: v_16, Length: 249747, dtype: int64
v_170.498971     20.217974     2
-13.000712    2
-0.675390     2
-1.530593     2..1.397213     10.610112     12.335480     1
-1.500048     15.289472     1
Name: v_17, Length: 249747, dtype: int64
v_18
-3.753102    27.731945    2
-0.058593    2
-1.171759    2
-2.045338    2..
-0.860834    10.643066    15.023034    1
-2.016881    16.565301    1
Name: v_18, Length: 249747, dtype: int64
v_190.082562    2
-0.469708    2
-0.138257    2
-0.657417    20.862429    2..
-1.249647    1
-0.664831    1
-0.660867    10.040847    10.700206    1
Name: v_19, Length: 249747, dtype: int64
v_20
-1.214032    2
-2.031659    2
-2.426898    2
-1.542005    2
-0.657360    2..
-2.098785    10.725159    1
-4.682086    10.342639    11.612570    1
Name: v_20, Length: 249747, dtype: int64
v_21
-3.244933    2
-3.440059    2
-4.070917    2
-3.001142    2
-4.153741    2..
-3.289808    14.942931    12.670356    1
-3.793230    13.226273    1
Name: v_21, Length: 249747, dtype: int64
v_22
-2.957315    24.311760    2
-2.101273    2
-0.936764    2
-2.562937    2..6.155533    1
-2.379816    1
-1.419529    14.872987    15.062041    1
Name: v_22, Length: 249747, dtype: int64
v_23
-1.044909    2
-1.259228    2
-1.183081    2
-0.989739    2
-0.958179    2..
-1.946367    1
-1.503159    1
-1.175261    1
-0.908192    1
-1.182790    1
Name: v_23, Length: 249747, dtype: int64
  • bodyType : 八个类别
  • fuelType : 七个类别
  • gearbox : 两个类别
  • kilometer : 12个类别
  • notRepairedDamage : 两个类别
  • seller : 两个类别但是严重倾斜 **
  • offerType : 两个类别但是严重倾斜 **
  • V_8 V_10 V_11 V_12 V_13 V_14 V_15 各有一个值特别大的类别特征

了解预测值的分布

type(Train_data['price'])
pandas.core.series.Series
Train_data['price'].value_counts()
0        7312
500      3815
1500     3587
1000     3149
1200     3071... 
11320       1
7230        1
11448       1
9529        1
8188        1
Name: price, Length: 4585, dtype: int64
## 1) price的分布情况(无界约尔逊分布等)
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
<AxesSubplot:title={'center':'Log Normal'}, xlabel='price'>

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

figure语法及操作

(1)figure语法说明

figure(num=None, figsize=None, dpi=None, facecolor=None, edgecolor=None, frameon=True)

  • num:图像编号或名称,数字为编号 ,字符串为名称
  • figsize:指定figure的宽和高,单位为英寸
  • dpi参数指定绘图对象的分辨率,即每英寸多少个像素,缺省值为 1英寸等于2.5cm,A4纸是 21*30cm的纸张
  • facecolor:背景颜色
  • edgecolor:边框颜色
  • frameon:是否显示边框

(2) 示例:

fig=plt.figure(figsize=(4,3),facecolor='blue')
plt.plot([1,2,3,4],[3,5,7,9])
plt.show()

在这里插入图片描述

plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=True,rug=True, fit=st.lognorm)
# ked = True
<AxesSubplot:title={'center':'Log Normal'}, xlabel='price', ylabel='Density'>

在这里插入图片描述

sns画图

  • seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)

将kde设置为True

  • Kernel density estimaton核密度估计

  • 核密度估计是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。.由于核密度估计方法不利用有关数据分布的先验知识,对数据分布不附加任何假定,是一种从数据样本本身出发研究数据分布特征的方法,因而,在统计学理论和应用领域均受到高度的重视。

    • hist: bool, optional #控制是否显示条形图,默认为True
    • kde: bool, optional #控制是否显示核密度估计图,默认为True
    • rug: bool, optional #控制是否显示观测的小细条(边际毛毯)默认为false

对预测值分布进行处理

价格不服从正态分布,所以在进行回归之前,需要将其转换.虽然对数变换做的很好,但最佳拟合是无界约翰逊分布

## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.535346
Kurtosis: 21.230678

在这里插入图片描述

Train_data.skew(), Train_data.kurt()
(SaleID                0.001712name                  0.513079regDate              -1.540844model                 1.499765brand                 1.314846bodyType             -0.070459fuelType              0.701802gearbox              -1.357379power                58.590829kilometer            -1.557472notRepairedDamage    -2.312519regionCode            0.690405creatDate           -95.428563price                 3.535346v_0                  -1.504738v_1                   1.582428v_2                   1.198679v_3                   1.352193v_4                   0.217941v_5                   2.052749v_6                   0.090718v_7                   0.823610v_8                  -1.532964v_9                   1.529931v_10                 -2.584452v_11                 -0.906428v_12                 -2.842834v_13                 -3.869655v_14                  0.491706v_15                  1.308716v_16                  1.662893v_17                  0.233318v_18                  0.814453v_19                  0.100073v_20                  2.001253v_21                  0.180020v_22                  0.819133v_23                  1.357847dtype: float64,SaleID                  -1.201476name                    -1.084474regDate                 11.041006model                    1.741896brand                    1.814245bodyType                -1.070358fuelType                -1.495782gearbox                 -0.157525power                 4473.885260kilometer                1.250933notRepairedDamage        3.347777regionCode              -0.352973creatDate            11376.694263price                   21.230678v_0                      2.901641v_1                      1.098703v_2                      3.749872v_3                      4.294578v_4                      6.953348v_5                      6.489791v_6                     -0.564878v_7                     -0.729838v_8                      0.370812v_9                      0.377943v_10                     4.796855v_11                     1.547812v_12                     6.136342v_13                    13.199575v_14                    -1.597532v_15                    -0.029594v_16                     2.240928v_17                     2.569341v_18                     2.967738v_19                     6.923953v_20                     6.852809v_21                    -0.759948v_22                    -0.741708v_23                     0.143713dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
# 通过axlabel、label设置标签
<AxesSubplot:xlabel='Skewness', ylabel='Density'>

在这里插入图片描述

sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
<AxesSubplot:xlabel='Kurtness', ylabel='Density'>

在这里插入图片描述


这篇关于二手车价格预测task02:数据探索性分析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/574535

相关文章

MySQL表锁、页面锁和行锁的作用及其优缺点对比分析

《MySQL表锁、页面锁和行锁的作用及其优缺点对比分析》MySQL中的表锁、页面锁和行锁各有特点,适用于不同的场景,表锁锁定整个表,适用于批量操作和MyISAM存储引擎,页面锁锁定数据页,适用于旧版本... 目录1. 表锁(Table Lock)2. 页面锁(Page Lock)3. 行锁(Row Lock

javaScript在表单提交时获取表单数据的示例代码

《javaScript在表单提交时获取表单数据的示例代码》本文介绍了五种在JavaScript中获取表单数据的方法:使用FormData对象、手动提取表单数据、使用querySelector获取单个字... 方法 1:使用 FormData 对象FormData 是一个方便的内置对象,用于获取表单中的键值

Rust中的BoxT之堆上的数据与递归类型详解

《Rust中的BoxT之堆上的数据与递归类型详解》本文介绍了Rust中的BoxT类型,包括其在堆与栈之间的内存分配,性能优势,以及如何利用BoxT来实现递归类型和处理大小未知类型,通过BoxT,Rus... 目录1. Box<T> 的基础知识1.1 堆与栈的分工1.2 性能优势2.1 递归类型的问题2.2

Python使用Pandas对比两列数据取最大值的五种方法

《Python使用Pandas对比两列数据取最大值的五种方法》本文主要介绍使用Pandas对比两列数据取最大值的五种方法,包括使用max方法、apply方法结合lambda函数、函数、clip方法、w... 目录引言一、使用max方法二、使用apply方法结合lambda函数三、使用np.maximum函数

Springboot中分析SQL性能的两种方式详解

《Springboot中分析SQL性能的两种方式详解》文章介绍了SQL性能分析的两种方式:MyBatis-Plus性能分析插件和p6spy框架,MyBatis-Plus插件配置简单,适用于开发和测试环... 目录SQL性能分析的两种方式:功能介绍实现方式:实现步骤:SQL性能分析的两种方式:功能介绍记录

最长公共子序列问题的深度分析与Java实现方式

《最长公共子序列问题的深度分析与Java实现方式》本文详细介绍了最长公共子序列(LCS)问题,包括其概念、暴力解法、动态规划解法,并提供了Java代码实现,暴力解法虽然简单,但在大数据处理中效率较低,... 目录最长公共子序列问题概述问题理解与示例分析暴力解法思路与示例代码动态规划解法DP 表的构建与意义动

Redis的数据过期策略和数据淘汰策略

《Redis的数据过期策略和数据淘汰策略》本文主要介绍了Redis的数据过期策略和数据淘汰策略,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一... 目录一、数据过期策略1、惰性删除2、定期删除二、数据淘汰策略1、数据淘汰策略概念2、8种数据淘汰策略

轻松上手MYSQL之JSON函数实现高效数据查询与操作

《轻松上手MYSQL之JSON函数实现高效数据查询与操作》:本文主要介绍轻松上手MYSQL之JSON函数实现高效数据查询与操作的相关资料,MySQL提供了多个JSON函数,用于处理和查询JSON数... 目录一、jsON_EXTRACT 提取指定数据二、JSON_UNQUOTE 取消双引号三、JSON_KE

Python给Excel写入数据的四种方法小结

《Python给Excel写入数据的四种方法小结》本文主要介绍了Python给Excel写入数据的四种方法小结,包含openpyxl库、xlsxwriter库、pandas库和win32com库,具有... 目录1. 使用 openpyxl 库2. 使用 xlsxwriter 库3. 使用 pandas 库

SpringBoot定制JSON响应数据的实现

《SpringBoot定制JSON响应数据的实现》本文主要介绍了SpringBoot定制JSON响应数据的实现,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们... 目录前言一、如何使用@jsonView这个注解?二、应用场景三、实战案例注解方式编程方式总结 前言