本文主要是介绍7.房价预测新一轮空值填充,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
- 代码地址:appke/Los-House-Prices: 洛杉矶房价预测
新一轮空值填充
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# 忽略警告信息
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv('datas/los_data.csv')
train = data
查看空值情况
def getNullCount():x = train.isnull().sum()print(x[x>0])x[x>0].sort_values().plot.bar()
getNullCount()
LotFrontage 259
Alley 1369
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406
dtype: int64
1. LotFrontage 填充
思路1:房子宽度,是否和小区有关?和小区设计有关?
取不同的Neighborhood的均值和中位数
sns.violinplot('LotFrontage', 'Neighborhood', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x115c14da0>
# getNullCount()
neighborhood_group=train.groupby('Neighborhood')
lot_median=neighborhood_group['LotFrontage'].median()
lot_mean=neighborhood_group['LotFrontage'].mean()
# 对空值情况
# train[train['LotFrontage'].isnull()]['Neighborhood']
思路2:是否和LotArea 有关呢?
房子宽度 和 房屋面积(不缺失)
train['LotFrontage'].corr(train['LotArea'])
0.42609501877180833
train['LotFrontage'].corr(np.sqrt(train['LotArea']))
0.6020022167939361
# 添加1列
train['SqrtLotArea']=np.sqrt(train['LotArea'])
train.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | SqrtLotArea | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 | 91.923882 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 | 97.979590 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 | 106.066017 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 | 97.724101 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 | 119.415242 |
5 rows × 82 columns
# 相关度较大
# 拟合的曲线 y = 0.6 * x
sns.lmplot('SqrtLotArea','LotFrontage', data=train)
<seaborn.axisgrid.FacetGrid at 0x115e61d68>
sns.jointplot('SqrtLotArea', 'LotFrontage', data=train)
<seaborn.axisgrid.JointGrid at 0x115e541d0>
# 把空值拿出来
filter = train['LotFrontage'].isnull()
train.LotFrontage[filter] = 0.6 * train.SqrtLotArea[filter]
train['LotFrontage'].isnull().sum()
0
2. MasVnrType 和 MasVnrArea 的填充
砖石镶板种类
# getNullCount()
# MasVnrArea 和 SalePrice 的关系
# 基本没什么关系
plt.scatter(train['MasVnrArea'], train['SalePrice'])
<matplotlib.collections.PathCollection at 0x115bdec18>
train['MasVnrArea'].corr(train['SalePrice'])
0.4774930470957163
# 什么都没有价格也不低,相关性不太强
sns.boxplot(train['MasVnrType'], train['SalePrice'], data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x115bfdf98>
# TypeError: unsupported operand type(s) for /: 'str' and 'int'
# train['MasVnrType'].corr(train['SalePrice'])
# 只有8个空值,有高频填充即可
# 进行聚类,看它 个类 有多少个
train.groupby(['MasVnrType']).count()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | SqrtLotArea | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MasVnrType | |||||||||||||||||||||
BrkCmn | 15 | 15 | 15 | 15 | 15 | 15 | 0 | 15 | 15 | 15 | ... | 0 | 7 | 2 | 15 | 15 | 15 | 15 | 15 | 15 | 15 |
BrkFace | 445 | 445 | 445 | 445 | 445 | 445 | 10 | 445 | 445 | 445 | ... | 2 | 80 | 11 | 445 | 445 | 445 | 445 | 445 | 445 | 445 |
None | 864 | 864 | 864 | 864 | 864 | 864 | 77 | 864 | 864 | 864 | ... | 4 | 188 | 40 | 864 | 864 | 864 | 864 | 864 | 864 | 864 |
Stone | 128 | 128 | 128 | 128 | 128 | 128 | 3 | 128 | 128 | 128 | ... | 1 | 6 | 1 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |
4 rows × 81 columns
# 把None的过滤出来
# 看 MasVnrType 和 MasVnrArea的相关性
train[train.MasVnrType=='None'][['MasVnrType', 'MasVnrArea']].head()
MasVnrType | MasVnrArea | |
---|---|---|
1 | None | 0.0 |
3 | None | 0.0 |
5 | None | 0.0 |
8 | None | 0.0 |
9 | None | 0.0 |
# 散点图 + 回归
# 斜率比较缓,相关性比较低
sns.lmplot('MasVnrArea', 'SalePrice', hue='MasVnrType', data=train)
<seaborn.axisgrid.FacetGrid at 0x113245b38>
filter=train['MasVnrArea'].isnull()
# getNullCount()
(filter==True).sum()
8
train.MasVnrArea[filter]=0.0
filter=train['MasVnrType'].isnull()
train.MasVnrType[filter]='None'
后续在EDA的特征工程中,可以使用如下填充办法:
分类进行预测MasVnrType,使用回归MasVnrAre
getNullCount()
Alley 1369
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406
dtype: int64
3. Electrical的填充
电力系统,保险丝、空气匝
# 先看和真正目标的 有多大的相关性
sns.boxplot('Electrical', 'SalePrice', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x11362d438>
# 统计下,用众数填充
train.groupby('Electrical').count()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | SqrtLotArea | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Electrical | |||||||||||||||||||||
FuseA | 94 | 94 | 94 | 94 | 94 | 94 | 12 | 94 | 94 | 94 | ... | 0 | 23 | 2 | 94 | 94 | 94 | 94 | 94 | 94 | 94 |
FuseF | 27 | 27 | 27 | 27 | 27 | 27 | 4 | 27 | 27 | 27 | ... | 0 | 2 | 0 | 27 | 27 | 27 | 27 | 27 | 27 | 27 |
FuseP | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 3 | ... | 0 | 0 | 0 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
Mix | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
SBrkr | 1334 | 1334 | 1334 | 1334 | 1334 | 1334 | 73 | 1334 | 1334 | 1334 | ... | 7 | 256 | 52 | 1334 | 1334 | 1334 | 1334 | 1334 | 1334 | 1334 |
5 rows × 81 columns
# 用众数,在朴素贝叶斯中概率也大,用其它的分类模型学习它的数量已经决定它的可能性了
# 数量上已经不均衡了
filter=train['Electrical'].isnull()
train.Electrical[filter]='SBrkr'
# getNullCount()
4. Alley 小路 的填充
EDA 后半段考虑删掉
80%以上是空值,可以删除了
# 小巷的数据分布,只有2个
# 推断 NA不填 本身就是None
train['Alley'].value_counts()
Grvl 50
Pave 41
Name: Alley, dtype: int64
# train['Alley'].value_counts() 与groupby().count()效果一样?
sns.stripplot('Alley', 'SalePrice', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1131eb2e8>
# 大概率都是None
train['Alley']=train['Alley'].fillna('None')
getNullCount()
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406
dtype: int64
5.BaseMent 群填充
TotalBsmtSF 是一个完整的关于Basement的列,可以拿出来与SalePrice相关性分析
# train.columns
# 本身和房价做分析,相关性挺高
plt.scatter(train['TotalBsmtSF'], train['SalePrice'])
<matplotlib.collections.PathCollection at 0x1139304a8>
# 未缺失的值 和 房价做分析
basement_cols=['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','BsmtFinSF1','BsmtFinSF2']
# 把为空的 筛选出来
# 代表没有地下室
print(train[basement_cols+['TotalBsmtSF']][train['BsmtQual'].isnull()==True])
Empty DataFrame
Columns: [BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, BsmtFinSF1, BsmtFinSF2, TotalBsmtSF]
Index: []
# train[basement_cols+['TotalBsmtSF']]
# [train['BsmtQual'].isnull()==True]
# 本身没有地下室填 NA的,偷懒没有填
train[train.BsmtCond=='NA']
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | SqrtLotArea |
---|
0 rows × 82 columns
for col in basement_cols:if 'FinSF' not in col:train[col]=train[col].fillna('None')
getNullCount()
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406
dtype: int64
6.FireplaceQu填充
火炉
sns.lmplot('Fireplaces', 'SalePrice', data=train)
<seaborn.axisgrid.FacetGrid at 0x11393b208>
sns.lmplot('Fireplaces', 'SalePrice', data=train, hue='FireplaceQu')
<seaborn.axisgrid.FacetGrid at 0x113a06ef0>
# 上图只给了5个类型,快查表中有6类包括None,推断少了一类None
train['FireplaceQu']=train['FireplaceQu'].fillna('None')
getNullCount()
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406
dtype: int64
7.Carage 车库列群填充
# 相关性,跟房价关系挺大的
sns.lmplot('GarageArea', 'SalePrice', data=train)
<seaborn.axisgrid.FacetGrid at 0x116914be0>
# 分布,拟正太分布,本身特征的分布
# 核概率分布曲线
sns.distplot(train['GarageArea'], color='y', kde=True)
<matplotlib.axes._subplots.AxesSubplot at 0x11691b668>
# 能放多少量车,3个车位
sns.violinplot(train['GarageCars'], train['SalePrice'])
<matplotlib.axes._subplots.AxesSubplot at 0x116babf98>
# train.columns
garage_cols=['GarageType','GarageQual','GarageCond','GarageYrBlt','GarageFinish','GarageCars','GarageArea']
train[garage_cols][train['GarageType'].isnull()].head()
# print(train[basement_cols+['TotalBsmtSF']][train['BsmtQual'].isnull()==True])
GarageType | GarageQual | GarageCond | GarageYrBlt | GarageFinish | GarageCars | GarageArea | |
---|---|---|---|---|---|---|---|
39 | NaN | NaN | NaN | NaN | NaN | 0 | 0 |
48 | NaN | NaN | NaN | NaN | NaN | 0 | 0 |
78 | NaN | NaN | NaN | NaN | NaN | 0 | 0 |
88 | NaN | NaN | NaN | NaN | NaN | 0 | 0 |
89 | NaN | NaN | NaN | NaN | NaN | 0 | 0 |
for col in garage_cols:if train[col].dtype==np.object:train[col]=train[col].fillna('None')else:train[col]=train[col].fillna(0)
getNullCount()
PoolQC 1453
Fence 1179
MiscFeature 1406
dtype: int64
8.PoolQC 泳池质量 填充
train.filter(like='Pool', axis=1).head()
PoolArea | PoolQC | |
---|---|---|
0 | 0 | NaN |
1 | 0 | NaN |
2 | 0 | NaN |
3 | 0 | NaN |
4 | 0 | NaN |
sns.distplot(train['PoolArea'], color='g', kde=True)
<matplotlib.axes._subplots.AxesSubplot at 0x116a60438>
# 根本没有泳池
train.PoolQC=train['PoolQC'].fillna('None')
getNullCount()
Fence 1179
MiscFeature 1406
dtype: int64
9.Fence 围栏填充
sns.violinplot(train['Fence'], train['SalePrice'])
<matplotlib.axes._subplots.AxesSubplot at 0x1170339e8>
总共有5个类型,然而可以统计出来的类型一共只有4类,可以断定最后一类NA用空值代替了
有没有围栏,价格都差不多
train.Fence = train['Fence'].fillna('None')
# 回看填充后的数据分布 与 SalePrice的关系
sns.violinplot(train['Fence'], train['SalePrice'])
<matplotlib.axes._subplots.AxesSubplot at 0x1168f93c8>
getNullCount()
MiscFeature 1406
dtype: int64
10.MiscFeature 其余特征,备注特征填充
# 看看它的类型与 SalePrice 全不全
sns.violinplot(train['MiscFeature'], train['SalePrice'])
<matplotlib.axes._subplots.AxesSubplot at 0x116901160>
因为MiscFeature本来有5类,然而只统计出四类,因此可以断定第五类NA其实是空值代表的
train.MiscFeature = train['MiscFeature'].fillna('None')
# getNullCount() 没空值了
保存处理后的数据
#不保存行索引
train.to_csv('datas/house_data.csv', index=0)
这篇关于7.房价预测新一轮空值填充的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!