本文主要是介绍二手车价格预测task02:数据探索性分析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
- task02学习了数据的分析画图
-
- 学习了sns.pairplot()用法
-
- 学习了sns.distplot()方法的使用
-
- 敲了一遍task数据分析,加了些注释说明
-
- 删除了两个类别特征异常的列和是三个和price相关性非常的列后进行预测,结果如图,效果并没有提高.应该做进一步的处理和特征工程(task03)
-
以下是按照教程进行数据分析的过程
# 导包
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
- 读取数据
Train_data = pd.read_csv('car_train_0110.csv', sep=' ')
Test_data = pd.read_csv('car_testA_0110.csv', sep=' ')
Train_data.head().append(Train_data.tail())
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 134890 | 734 | 20160002 | 13.0 | 9 | NaN | 0.0 | 1.0 | 0 | 15.0 | ... | 0.092139 | 0.000000 | 18.763832 | -1.512063 | -1.008718 | -12.100623 | -0.947052 | 9.077297 | 0.581214 | 3.945923 |
1 | 306648 | 196973 | 20080307 | 72.0 | 9 | 7.0 | 5.0 | 1.0 | 173 | 15.0 | ... | 0.001070 | 0.122335 | -5.685612 | -0.489963 | -2.223693 | -0.226865 | -0.658246 | -3.949621 | 4.593618 | -1.145653 |
2 | 340675 | 25347 | 20020312 | 18.0 | 12 | 3.0 | 0.0 | 1.0 | 50 | 12.5 | ... | 0.064410 | 0.003345 | -3.295700 | 1.816499 | 3.554439 | -0.683675 | 0.971495 | 2.625318 | -0.851922 | -1.246135 |
3 | 57332 | 5382 | 20000611 | 38.0 | 8 | 7.0 | 0.0 | 1.0 | 54 | 15.0 | ... | 0.069231 | 0.000000 | -3.405521 | 1.497826 | 4.782636 | 0.039101 | 1.227646 | 3.040629 | -0.801854 | -1.251894 |
4 | 265235 | 173174 | 20030109 | 87.0 | 0 | 5.0 | 5.0 | 1.0 | 131 | 3.0 | ... | 0.000099 | 0.001655 | -4.475429 | 0.124138 | 1.364567 | -0.319848 | -1.131568 | -3.303424 | -1.998466 | -1.279368 |
249995 | 10556 | 9332 | 20170003 | 13.0 | 9 | NaN | NaN | 1.0 | 58 | 15.0 | ... | 0.079119 | 0.001447 | 11.782508 | 20.402576 | -2.722772 | 0.462388 | -4.429385 | 7.883413 | 0.698405 | -1.082013 |
249996 | 146710 | 102110 | 20030511 | 29.0 | 17 | 3.0 | 0.0 | 0.0 | 61 | 15.0 | ... | 0.000000 | 0.002342 | -2.988272 | 1.500532 | 3.502201 | -0.761715 | -2.484556 | -2.532968 | -0.940266 | -1.106426 |
249997 | 116066 | 82802 | 20130312 | 124.0 | 16 | 6.0 | 0.0 | 1.0 | 122 | 3.0 | ... | 0.003358 | 0.100760 | -6.939560 | -1.144959 | -5.337949 | 0.896026 | -0.592565 | -3.872725 | 2.135984 | 3.807554 |
249998 | 90082 | 65971 | 20121212 | 111.0 | 4 | 7.0 | 5.0 | 0.0 | 184 | 9.0 | ... | 0.002974 | 0.008251 | -7.222167 | -1.383696 | -5.402794 | -0.409451 | -1.891556 | -3.104789 | -3.777374 | 3.186218 |
249999 | 76453 | 56954 | 20051111 | 13.0 | 9 | 3.0 | 0.0 | 1.0 | 58 | 12.5 | ... | 0.000000 | 0.009071 | 10.491312 | -11.270043 | -0.272595 | -0.026478 | -2.168249 | -0.980042 | -0.955164 | -1.169593 |
10 rows × 40 columns
- name - 汽车编码
- regDate - 汽车注册时间 – ***
- model - 车型编码
- brand - 品牌
- bodyType - 车身类型
- fuelType - 燃油类型
- gearbox - 变速箱
- power - 汽车功率
- kilometer - 汽车行驶公里 –
- notRepairedDamage - 汽车有尚未修复的损坏 – ***
- regionCode - 看车地区编码
- seller - 销售方
- offerType - 报价类型
- creatDate - 广告发布时间
- price - 汽车价格
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21','v_22', 'v_23'],dtype='object')
Train_data_part = Train_data.cloumns=['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price']
Train_data_part
['SaleID','name','regDate','model','brand','bodyType','fuelType','gearbox','power','kilometer','notRepairedDamage','regionCode','seller','offerType','creatDate','price']
Train_data.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 250000.000000 | 250000.000000 | 2.500000e+05 | 250000.000000 | 250000.000000 | 224620.000000 | 227510.000000 | 236487.000000 | 250000.000000 | 250000.000000 | ... | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 |
mean | 185351.790768 | 83153.362172 | 2.003401e+07 | 44.911480 | 7.785236 | 4.563271 | 1.665008 | 0.780783 | 115.528412 | 12.577418 | ... | 0.032489 | 0.030408 | 0.014725 | 0.000915 | 0.006273 | 0.006604 | -0.001374 | 0.000609 | -0.004025 | 0.001834 |
std | 107121.188763 | 72540.799964 | 7.770250e+04 | 50.640081 | 7.694010 | 1.912515 | 2.339646 | 0.413717 | 196.141828 | 3.990632 | ... | 0.038792 | 0.049333 | 8.779163 | 5.771081 | 4.880981 | 4.124722 | 3.803626 | 3.555353 | 2.864713 | 2.323680 |
min | 1.000000 | 0.000000 | 1.910000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | -10.412444 | -15.538236 | -21.009214 | -13.989955 | -9.599285 | -11.181255 | -7.671327 | -2.350888 |
25% | 92501.750000 | 14500.000000 | 1.999061e+07 | 6.000000 | 1.000000 | 3.000000 | 0.000000 | 1.000000 | 70.000000 | 12.500000 | ... | 0.000129 | 0.000000 | -5.552269 | -0.901181 | -3.150385 | -0.478173 | -1.727237 | -3.067073 | -2.092178 | -1.402804 |
50% | 185264.500000 | 65314.500000 | 2.003111e+07 | 27.000000 | 6.000000 | 4.000000 | 0.000000 | 1.000000 | 105.000000 | 15.000000 | ... | 0.001961 | 0.002567 | -3.821770 | 0.223181 | -0.058502 | 0.038427 | -0.995044 | -0.880587 | -1.199807 | -1.145588 |
75% | 278128.500000 | 143761.250000 | 2.008081e+07 | 70.000000 | 11.000000 | 7.000000 | 5.000000 | 1.000000 | 150.000000 | 15.000000 | ... | 0.075672 | 0.056568 | 3.599747 | 1.263737 | 2.800475 | 0.569198 | 1.563382 | 3.269987 | 2.737614 | 0.044865 |
max | 370946.000000 | 233044.000000 | 2.019121e+07 | 250.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 20000.000000 | 15.000000 | ... | 0.130785 | 0.184340 | 36.756878 | 26.134561 | 23.055660 | 16.576027 | 20.324572 | 14.039422 | 8.764597 | 8.574730 |
8 rows × 40 columns
Test_data.describe()|
File "<ipython-input-8-b48c1a6ece76>", line 1Test_data.describe()|^
SyntaxError: invalid syntax
power这里的max好像异常
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 250000 non-null int64 1 name 250000 non-null int64 2 regDate 250000 non-null int64 3 model 250000 non-null float644 brand 250000 non-null int64 5 bodyType 224620 non-null float646 fuelType 227510 non-null float647 gearbox 236487 non-null float648 power 250000 non-null int64 9 kilometer 250000 non-null float6410 notRepairedDamage 201464 non-null float6411 regionCode 250000 non-null int64 12 seller 250000 non-null int64 13 offerType 250000 non-null int64 14 creatDate 250000 non-null int64 15 price 250000 non-null int64 16 v_0 250000 non-null float6417 v_1 250000 non-null float6418 v_2 250000 non-null float6419 v_3 250000 non-null float6420 v_4 250000 non-null float6421 v_5 250000 non-null float6422 v_6 250000 non-null float6423 v_7 250000 non-null float6424 v_8 250000 non-null float6425 v_9 250000 non-null float6426 v_10 250000 non-null float6427 v_11 250000 non-null float6428 v_12 250000 non-null float6429 v_13 250000 non-null float6430 v_14 250000 non-null float6431 v_15 250000 non-null float6432 v_16 250000 non-null float6433 v_17 250000 non-null float6434 v_18 250000 non-null float6435 v_19 250000 non-null float6436 v_20 250000 non-null float6437 v_21 250000 non-null float6438 v_22 250000 non-null float6439 v_23 250000 non-null float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 50000 non-null int64 1 name 50000 non-null int64 2 regDate 50000 non-null int64 3 model 50000 non-null float644 brand 50000 non-null int64 5 bodyType 44890 non-null float646 fuelType 45598 non-null float647 gearbox 47287 non-null float648 power 50000 non-null int64 9 kilometer 50000 non-null float6410 notRepairedDamage 40372 non-null float6411 regionCode 50000 non-null int64 12 seller 50000 non-null int64 13 offerType 50000 non-null int64 14 creatDate 50000 non-null int64 15 v_0 50000 non-null float6416 v_1 50000 non-null float6417 v_2 50000 non-null float6418 v_3 50000 non-null float6419 v_4 50000 non-null float6420 v_5 50000 non-null float6421 v_6 50000 non-null float6422 v_7 50000 non-null float6423 v_8 50000 non-null float6424 v_9 50000 non-null float6425 v_10 50000 non-null float6426 v_11 50000 non-null float6427 v_12 50000 non-null float6428 v_13 50000 non-null float6429 v_14 50000 non-null float6430 v_15 50000 non-null float6431 v_16 50000 non-null float6432 v_17 50000 non-null float6433 v_18 50000 non-null float6434 v_19 50000 non-null float6435 v_20 50000 non-null float6436 v_21 50000 non-null float6437 v_22 50000 non-null float6438 v_23 50000 non-null float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB
# 查看每列的存在nan情况
Train_data.isnull()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | True | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
249995 | False | False | False | False | False | True | True | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
249996 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
249997 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
249998 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
249999 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
250000 rows × 40 columns
Train_data.isnull().sum() # sum是对每一列的数据进行求和
SaleID 0
name 0
regDate 0
model 0
brand 0
bodyType 25380
fuelType 22490
gearbox 13513
power 0
kilometer 0
notRepairedDamage 48536
regionCode 0
seller 0
offerType 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
v_15 0
v_16 0
v_17 0
v_18 0
v_19 0
v_20 0
v_21 0
v_22 0
v_23 0
dtype: int64
NAN值的可视化
missing = Train_data.isnull().sum() # 为NAN的个数
missing = missing[missing > 0] # 只剩下空值的missing了
type(missing)
pandas.core.series.Series
missing
bodyType 25380
fuelType 22490
gearbox 13513
notRepairedDamage 48536
dtype: int64
# inplace=True 是在原数据上进行修改
missing.sort_values(inplace=True)
missing # 排序前
gearbox 13513
fuelType 22490
bodyType 25380
notRepairedDamage 48536
dtype: int64
missing # 排序后
gearbox 13513
fuelType 22490
bodyType 25380
notRepairedDamage 48536
dtype: int64
# 画出图 : 横轴为特征的名字,纵轴为数值
missing.plot.bar()
通过以上两句可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于 nan存在的个数是
否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺,让树自己去优化,但如果nan存在的
过多、可以考虑删掉
# 可视化查看缺省值
msno.matrix(Train_data.sample(250))
msno.bar(Train_data.sample(1000))
# 可以看出1000个数据内有哪些数据不足1000,上面还有标出有多少条数据
# 可视化看下缺省值
msno.matrix(Test_data)
msno.bar(Test_data.sample(1000))
- 可以看出训练集和测试集数据不一致的分布也是非常相似的
异常值检测
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 250000 non-null int64 1 name 250000 non-null int64 2 regDate 250000 non-null int64 3 model 250000 non-null float644 brand 250000 non-null int64 5 bodyType 224620 non-null float646 fuelType 227510 non-null float647 gearbox 236487 non-null float648 power 250000 non-null int64 9 kilometer 250000 non-null float6410 notRepairedDamage 201464 non-null float6411 regionCode 250000 non-null int64 12 seller 250000 non-null int64 13 offerType 250000 non-null int64 14 creatDate 250000 non-null int64 15 price 250000 non-null int64 16 v_0 250000 non-null float6417 v_1 250000 non-null float6418 v_2 250000 non-null float6419 v_3 250000 non-null float6420 v_4 250000 non-null float6421 v_5 250000 non-null float6422 v_6 250000 non-null float6423 v_7 250000 non-null float6424 v_8 250000 non-null float6425 v_9 250000 non-null float6426 v_10 250000 non-null float6427 v_11 250000 non-null float6428 v_12 250000 non-null float6429 v_13 250000 non-null float6430 v_14 250000 non-null float6431 v_15 250000 non-null float6432 v_16 250000 non-null float6433 v_17 250000 non-null float6434 v_18 250000 non-null float6435 v_19 250000 non-null float6436 v_20 250000 non-null float6437 v_21 250000 non-null float6438 v_22 250000 non-null float6439 v_23 250000 non-null float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
- .value_counts 获取该特征列数据的种类|
# .value_counts 获取该特征列数据的种类
Train_data['notRepairedDamage'].value_counts()
1.0 176922
0.0 24542
Name: notRepairedDamage, dtype: int64
# Train_data.value_counts()
# 二手车原数据中这个特征为类别型特征,且 - 也表示为空值,这里是# 将 - 替换为nan
# Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
以下两个类别特征严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然你也可以继续挖掘,但是一般意义不大
Train_data["seller"].value_counts()
1 249999
0 1
Name: seller, dtype: int64
Test_data["seller"].value_counts()
1 50000
Name: seller, dtype: int64
Train_data["offerType"].value_counts()
0 249991
1 9
Name: offerType, dtype: int64
Test_data['offerType'].value_counts()
0 49999
1 1
Name: offerType, dtype: int64
del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]
所有特征的value_counts()
for f in Train_data.columns:print(f)print(Train_data[f].value_counts())
SaleID
2049 1
265515 1
277805 1
271662 1
312626 1..
107105 1
113250 1
111203 1
98917 1
2047 1
Name: SaleID, Length: 250000, dtype: int64
name
451 452
73 429
1791 428
821 391
243 346...
92419 1
88325 1
82182 1
84231 1
157427 1
Name: name, Length: 164312, dtype: int64
regDate
20000010 306
20000001 288
20000002 288
20000007 279
20000008 278...
19850904 1
19851010 1
19750511 1
19870912 1
19400705 1
Name: regDate, Length: 7537, dtype: int64
model
0.0 20344
6.0 17741
4.0 13837
1.0 13634
12.0 8841...
226.0 5
245.0 5
243.0 4
249.0 4
250.0 1
Name: model, Length: 251, dtype: int64
brand
0 53699
4 27109
11 26944
10 23762
1 22144
6 17202
9 12210
5 7343
15 6500
12 4704
7 3839
3 3831
17 3543
13 3502
8 3374
28 3161
19 2561
18 2451
16 2274
22 2264
23 2088
14 1892
24 1678
25 1611
20 1610
27 1392
29 1259
34 963
30 604
2 570
31 540
21 522
38 516
35 415
32 406
36 377
33 368
37 324
26 307
39 141
Name: brand, dtype: int64
bodyType
7.0 64571
3.0 53858
4.0 45646
5.0 20343
6.0 15290
2.0 12755
1.0 9882
0.0 2275
Name: bodyType, dtype: int64
fuelType
0.0 150664
5.0 72494
4.0 3577
3.0 385
2.0 183
1.0 147
6.0 60
Name: fuelType, dtype: int64
gearbox
1.0 184645
0.0 51842
Name: gearbox, dtype: int64
power
0 27280
75 16158
60 10765
150 10373
140 9145...
1986 1
1090 1
10311 1
960 1
3454 1
Name: power, Length: 703, dtype: int64
kilometer
15.0 162161
12.5 25743
10.0 10777
9.0 8424
8.0 7434
7.0 6642
6.0 5859
5.0 5100
0.5 4634
4.0 4204
3.0 4021
2.0 3749
1.0 1252
Name: kilometer, dtype: int64
notRepairedDamage
1.0 176922
0.0 24542
Name: notRepairedDamage, dtype: int64
regionCode
487 550
868 424
149 236
539 227
32 216...
7959 1
8002 1
6715 1
7117 1
4144 1
Name: regionCode, Length: 8081, dtype: int64
creatDate
20160403 9758
20160404 9521
20160320 9176
20160312 8946
20160321 8895...
20150618 1
20160114 1
20160201 1
20150611 1
20140310 1
Name: creatDate, Length: 107, dtype: int64
price
0 7312
500 3815
1500 3587
1000 3149
1200 3071...
11320 1
7230 1
11448 1
9529 1
8188 1
Name: price, Length: 4585, dtype: int64
v_0
71.666307 2
72.346416 2
78.107692 2
71.715545 2
73.734706 2..
71.161494 1
70.253614 1
70.797686 1
74.588185 1
77.825581 1
Name: v_0, Length: 249747, dtype: int64
v_1
-1.470958 2
-3.128523 2
-3.224945 2
-3.293795 2
-3.322763 2..
-3.970355 111.487790 1
-3.456756 1
-3.746283 1
-3.579301 1
Name: v_1, Length: 249747, dtype: int64
v_2
-0.527186 2
-0.998414 20.652201 20.356107 2
-9.312859 2..
-0.815970 1
-6.729062 1
-1.683035 10.171102 10.852139 1
Name: v_2, Length: 249747, dtype: int64
v_33.580573 2
-0.633228 2
-2.541859 20.161395 220.571558 2..1.067454 1
-0.826230 1
-6.306510 10.201140 14.146853 1
Name: v_3, Length: 249747, dtype: int64
v_42.038620 2
-0.591751 2
-12.603294 2
-0.321072 2
-0.429618 2..0.742918 12.722358 1
-0.317880 1
-0.356648 11.327513 1
Name: v_4, Length: 249747, dtype: int64
v_51.273623 2
-1.589295 2
-2.350140 20.080770 2
-2.434300 2..
-1.374641 1
-2.369201 1
-2.194464 11.226827 1
-1.218480 1
Name: v_5, Length: 249747, dtype: int64
v_63.854950 2
-2.337177 2
-2.840736 2
-2.988814 20.912034 2..
-8.718013 13.185567 13.443525 1
-2.653621 1
-3.138425 1
Name: v_6, Length: 249747, dtype: int64
v_7
-2.915058 2
-2.518469 2
-1.175198 2
-3.672233 2
-2.563102 2..
-1.171334 1
-2.324847 14.015706 1
-1.895407 1
-2.156468 1
Name: v_7, Length: 249747, dtype: int64
v_8
0.000000 48244
0.315924 2
0.315905 2
0.314498 2
0.315560 2...
0.315494 1
0.289243 1
0.316095 1
0.316209 1
0.315702 1
Name: v_8, Length: 201543, dtype: int64
v_9
1.101174 2
0.118624 2
0.164335 2
0.114609 2
0.112811 2..
1.110851 1
1.101634 1
0.116084 1
0.090707 1
0.112558 1
Name: v_9, Length: 249747, dtype: int64
v_10
0.000000 25342
0.081665 2
0.086726 2
0.081616 2
0.081701 2...
0.089640 1
0.091852 1
0.082066 1
0.081448 1
0.087517 1
Name: v_10, Length: 224427, dtype: int64
v_11
0.000000 7421
0.121584 2
0.102037 2
0.166840 2
0.134519 2...
0.092895 1
0.108411 1
0.131894 1
0.075781 1
0.078286 1
Name: v_11, Length: 242335, dtype: int64
v_12
0.000000 22426
0.053098 2
0.053437 2
0.055474 2
0.053432 2...
0.053485 1
0.053471 1
0.055616 1
0.053447 1
0.053329 1
Name: v_12, Length: 227338, dtype: int64
v_13
0.000000 13495
0.130205 2
0.123467 2
0.123337 2
0.130232 2...
0.123242 1
0.123755 1
0.123252 1
0.123047 1
0.123567 1
Name: v_13, Length: 236266, dtype: int64
v_14
0.000000 53857
0.003751 2
0.000746 2
0.002838 2
0.002283 2...
0.094690 1
0.000690 1
0.086957 1
0.002928 1
0.083676 1
Name: v_14, Length: 195953, dtype: int64
v_15
0.000000 97223
0.010717 2
0.012704 2
0.143362 2
0.005417 2...
0.001720 1
0.003263 1
0.007882 1
0.005242 1
0.094839 1
Name: v_15, Length: 152622, dtype: int64
v_16
-3.254226 2
-2.855248 2
-4.373334 27.744677 2
-2.847659 2..10.816862 1
-6.670231 1
-6.291694 1
-4.147668 1
-6.389964 1
Name: v_16, Length: 249747, dtype: int64
v_170.498971 20.217974 2
-13.000712 2
-0.675390 2
-1.530593 2..1.397213 10.610112 12.335480 1
-1.500048 15.289472 1
Name: v_17, Length: 249747, dtype: int64
v_18
-3.753102 27.731945 2
-0.058593 2
-1.171759 2
-2.045338 2..
-0.860834 10.643066 15.023034 1
-2.016881 16.565301 1
Name: v_18, Length: 249747, dtype: int64
v_190.082562 2
-0.469708 2
-0.138257 2
-0.657417 20.862429 2..
-1.249647 1
-0.664831 1
-0.660867 10.040847 10.700206 1
Name: v_19, Length: 249747, dtype: int64
v_20
-1.214032 2
-2.031659 2
-2.426898 2
-1.542005 2
-0.657360 2..
-2.098785 10.725159 1
-4.682086 10.342639 11.612570 1
Name: v_20, Length: 249747, dtype: int64
v_21
-3.244933 2
-3.440059 2
-4.070917 2
-3.001142 2
-4.153741 2..
-3.289808 14.942931 12.670356 1
-3.793230 13.226273 1
Name: v_21, Length: 249747, dtype: int64
v_22
-2.957315 24.311760 2
-2.101273 2
-0.936764 2
-2.562937 2..6.155533 1
-2.379816 1
-1.419529 14.872987 15.062041 1
Name: v_22, Length: 249747, dtype: int64
v_23
-1.044909 2
-1.259228 2
-1.183081 2
-0.989739 2
-0.958179 2..
-1.946367 1
-1.503159 1
-1.175261 1
-0.908192 1
-1.182790 1
Name: v_23, Length: 249747, dtype: int64
- bodyType : 八个类别
- fuelType : 七个类别
- gearbox : 两个类别
- kilometer : 12个类别
- notRepairedDamage : 两个类别
- seller : 两个类别但是严重倾斜 **
- offerType : 两个类别但是严重倾斜 **
- V_8 V_10 V_11 V_12 V_13 V_14 V_15 各有一个值特别大的类别特征
了解预测值的分布
type(Train_data['price'])
pandas.core.series.Series
Train_data['price'].value_counts()
0 7312
500 3815
1500 3587
1000 3149
1200 3071...
11320 1
7230 1
11448 1
9529 1
8188 1
Name: price, Length: 4585, dtype: int64
## 1) price的分布情况(无界约尔逊分布等)
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
<AxesSubplot:title={'center':'Log Normal'}, xlabel='price'>
figure语法及操作
(1)figure语法说明
figure(num=None, figsize=None, dpi=None, facecolor=None, edgecolor=None, frameon=True)
- num:图像编号或名称,数字为编号 ,字符串为名称
- figsize:指定figure的宽和高,单位为英寸
- dpi参数指定绘图对象的分辨率,即每英寸多少个像素,缺省值为 1英寸等于2.5cm,A4纸是 21*30cm的纸张
- facecolor:背景颜色
- edgecolor:边框颜色
- frameon:是否显示边框
(2) 示例:
fig=plt.figure(figsize=(4,3),facecolor='blue')
plt.plot([1,2,3,4],[3,5,7,9])
plt.show()
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=True,rug=True, fit=st.lognorm)
# ked = True
<AxesSubplot:title={'center':'Log Normal'}, xlabel='price', ylabel='Density'>
sns画图
- seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)
将kde设置为True
-
Kernel density estimaton核密度估计
-
核密度估计是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。.由于核密度估计方法不利用有关数据分布的先验知识,对数据分布不附加任何假定,是一种从数据样本本身出发研究数据分布特征的方法,因而,在统计学理论和应用领域均受到高度的重视。
- hist: bool, optional #控制是否显示条形图,默认为True
- kde: bool, optional #控制是否显示核密度估计图,默认为True
- rug: bool, optional #控制是否显示观测的小细条(边际毛毯)默认为false
对预测值分布进行处理
价格不服从正态分布,所以在进行回归之前,需要将其转换.虽然对数变换做的很好,但最佳拟合是无界约翰逊分布
## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.535346
Kurtosis: 21.230678
Train_data.skew(), Train_data.kurt()
(SaleID 0.001712name 0.513079regDate -1.540844model 1.499765brand 1.314846bodyType -0.070459fuelType 0.701802gearbox -1.357379power 58.590829kilometer -1.557472notRepairedDamage -2.312519regionCode 0.690405creatDate -95.428563price 3.535346v_0 -1.504738v_1 1.582428v_2 1.198679v_3 1.352193v_4 0.217941v_5 2.052749v_6 0.090718v_7 0.823610v_8 -1.532964v_9 1.529931v_10 -2.584452v_11 -0.906428v_12 -2.842834v_13 -3.869655v_14 0.491706v_15 1.308716v_16 1.662893v_17 0.233318v_18 0.814453v_19 0.100073v_20 2.001253v_21 0.180020v_22 0.819133v_23 1.357847dtype: float64,SaleID -1.201476name -1.084474regDate 11.041006model 1.741896brand 1.814245bodyType -1.070358fuelType -1.495782gearbox -0.157525power 4473.885260kilometer 1.250933notRepairedDamage 3.347777regionCode -0.352973creatDate 11376.694263price 21.230678v_0 2.901641v_1 1.098703v_2 3.749872v_3 4.294578v_4 6.953348v_5 6.489791v_6 -0.564878v_7 -0.729838v_8 0.370812v_9 0.377943v_10 4.796855v_11 1.547812v_12 6.136342v_13 13.199575v_14 -1.597532v_15 -0.029594v_16 2.240928v_17 2.569341v_18 2.967738v_19 6.923953v_20 6.852809v_21 -0.759948v_22 -0.741708v_23 0.143713dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
# 通过axlabel、label设置标签
<AxesSubplot:xlabel='Skewness', ylabel='Density'>
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
<AxesSubplot:xlabel='Kurtness', ylabel='Density'>
这篇关于二手车价格预测task02:数据探索性分析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!