AI竞赛4-Kaggle实战之海难生死预测

本文主要是介绍AI竞赛4-Kaggle实战之海难生死预测，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

AI竞赛4-Kaggle实战之海难生死预测

1.项目介绍
2.数据加载
3.数据介绍
4.数据探索
- 4.1 数据查看
- 4.2 数据合并
- 4.3 缺失值查看
- 4.4 特征及标签关系
- - 4.4.1 港口及生还率
  - 4.4.2 家长孩子数量及生存率
  - 4.4.3 同行同辈数量及生存率
  - 4.4.4 客舱等级及生存率
  - 4.4.5 性别及生存率
  - 4.4.6 年龄及生存率
  - 4.4.7 乘客花费及生存率
5.数据预处理
- 5.1 数据预处理
- 5.2 数据清洗
- 5.3 特征工程
- - 5.3.1 头衔Title
  - 5.3.2 家庭规模Family Size
  - 5.3.3 船舱类型-Deck
  - 5.3.4 同票号乘客数量-Ticket
  - 5.3.5 Age缺失值填充-随机森林
- 5.4 同组识别
- 5.5 筛选子集
6.模型选择
- 6.1 模型建立
- 6.2 算法比较
- 6.3 算法可视化
7.模型调优
- 7.1 梯度提升树
- 7.2 逻辑回归
- 7.3 SVC支持向量机
- 7.4 LDA 模型
8.模型评估
- 8.1 准确率
- 8.2 ROC曲线
- 8.3 混淆矩阵
9.模型预测
- 9.1 预测结果上传
- 9.1 项目总结

1.项目介绍

泰坦尼克号于1909年3月31日在爱尔兰动工建造，1911年5月31日下水，次年4月2日完工试航。她是当时世界上体积最庞大、内部设施最豪华的客运轮船，有“永不沉没”的美誉。然而讽刺的是，泰坦尼克号首航便遭遇厄运：1912年4月10日她从英国南安普顿出发，途径法国瑟堡和爱尔兰昆士敦，驶向美国纽约。在14日晚23时40分左右，泰坦尼克号与一座冰山相撞，导致船体裂缝进水。次日凌晨2时20分左右，泰坦尼克号断为两截后沉入大西洋，其搭载的2224名船员及乘客，在本次海难中逾1500人丧生
在学习机器学习相关项目时，Titanic生存率预测项目也通常是入门练习的经典案例。Kaggle平台为我们提供了一个竞赛案例“Titanic:Machine Learning from Disaster”，在该案例中，我们将探究什么样的人在此次海难中幸存的几率更高，并通过构建预测模型来预测乘客生存率
本项目通过数据可视化理解数据，并利用特征工程等方法挖掘更多有价值的特征，然后利用同组效应找出共性较强的群体并对其数据进行修正，在选择模型时分别比较了Gradient Boosting Classifier、Logistic Regression等多种方法，最终利用Gradient Boosting Classifier对乘客的生存率进行预测

2.数据加载

import warnings 
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style='white',context='notebook',palette='muted')  # 设置sns样式
import matplotlib.pyplot as plt
train=pd.read_csv('./train.csv')
test=pd.read_csv('./test.csv')
display(train.head())

在这里插入图片描述

3.数据介绍

(1)Survived：是否存活 (label)0-用户死亡1-用户存活(2)Pclass：用户阶级1-1stclass：高等用户2-2ndclass：中等用户3-3rdclass：低等用户(3)SibSp：泰坦尼克号上与乘客同行的兄弟姐妹(Siblings)和配偶(Spouse)数目(4)Parch：泰坦尼克号上与乘客同行的家长(Parents)和孩子(Children)数目(5)Cabin(船舱)：描述用户所住的船舱编号。由两部分组成即仓位号和房间编号 如C88中C和88分别对应C仓位和88号房间(6)Embarked(港口)：描述乘客上船时港口，包含三种类型：S：Southampton (南安普顿,英国)C：Cherbourg (瑟堡,法国)Q：Queenstown (昆士敦,英国)

4.数据探索

4.1 数据查看

print('训练数据大小:',train.shape)  # 查看形状
print('预测数据大小:',test.shape)
训练数据大小: (891, 12)
预测数据大小: (418, 11)
display(train.head(),test.head())  # 查看前五行数据

在这里插入图片描述

4.2 数据合并

full=train.append(test,ignore_index=True) # 将训练数据和预测数据合并便于一起处理
full.describe()

在这里插入图片描述

4.3 缺失值查看

full.info()  # Age/Cabin/Embarked/Fare四项数据有缺失值，其中Cabin字段缺失近四分之三数据
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):#   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  0   PassengerId  1309 non-null   int64  1   Survived     891 non-null    float642   Pclass       1309 non-null   int64  3   Name         1309 non-null   object 4   Sex          1309 non-null   object 5   Age          1046 non-null   float646   SibSp        1309 non-null   int64  7   Parch        1309 non-null   int64  8   Ticket       1309 non-null   object 9   Fare         1308 non-null   float6410  Cabin        295 non-null    object 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

4.4 特征及标签关系

4.4.1 港口及生还率

sns.barplot(data=train,x='Embarked',y='Survived')

在这里插入图片描述

s = full.groupby('Embarked')['Survived'].value_counts().to_frame() # 不同类型港口乘客其生存率为多少
display(s)
s2 = s/s.sum(level=0)  # 生存率
display(s2)
pd.merge(s,s2,left_index=True,right_index=True,suffixes=['_num','_rate'])

在这里插入图片描述

  sns.catplot('Pclass',col='Embarked',data=train,kind='count',size=3)

在这里插入图片描述

4.4.2 家长孩子数量及生存率

sns.barplot(data=train,x='Parch',y='Survived')  # 当乘客同行的父母及子女数量适中时，生存率较高

在这里插入图片描述

4.4.3 同行同辈数量及生存率

sns.barplot(data=train,x='SibSp',y='Survived')  # 当乘客同行的同辈数量适中时生存率较高

在这里插入图片描述

4.4.4 客舱等级及生存率

sns.barplot(data=train,x='Pclass',y='Survived')   # 乘客客舱等级越高，生存率越高

在这里插入图片描述

4.4.5 性别及生存率

sns.barplot(data=train,x='Sex',y='Survived')    # 女性的生存率远高于男性

在这里插入图片描述

4.4.6 年龄及生存率

ageFacet=sns.FacetGrid(train,hue='Survived',aspect=3)  # 创建坐标轴
ageFacet.map(sns.kdeplot,'Age',shade=True)  # 作图 选择图形类型 
ageFacet.set(xlim=(0,train['Age'].max()))  # 其他信息：坐标轴范围、标签等
ageFacet.add_legend()  # 当乘客年龄段在0-10岁期间时生存率会较高

在这里插入图片描述

4.4.7 乘客花费及生存率

ageFacet=sns.FacetGrid(train,hue='Survived',aspect=3)   # aspect每个图片的纵横比
ageFacet.map(sns.kdeplot,'Fare',shade=True)
ageFacet.set(xlim=(0,150))
ageFacet.add_legend() # fare分布呈左偏形态，其偏度skewness=4.37较大说明数据偏移平均值较多，因此需要对数据进行对数化处理，防止数据权重分布不均匀

在这里插入图片描述

farePlot=sns.distplot(full['Fare'][full['Fare'].notnull()],label='skewness:%.2f'%(full['Fare'].skew()))
farePlot.legend(loc='best')   # 查看票价分布

在这里插入图片描述

full['Fare']=full['Fare'].map(lambda x: np.log(x) if x > 0 else x)   # 对数化处理fare值
farePlot=sns.distplot(full['Fare'][full['Fare'].notnull()],label='skewness:%.2f'%(full['Fare'].skew())) 
farePlot.legend(loc='best')    # 处理之后票价Fare分布
plt.savefig('./10-Fare票价分布.png',dpi = 200)

在这里插入图片描述

5.数据预处理

5.1 数据预处理

(1) 数据清洗 (缺失值以及异常值的处理)
(2) 特征工程 (基于对现有数据特征的理解构造的新特征，以挖掘数据的更多特点)
(3) 同组识别 (找出具有明显同组效应且违背整体规律的数据，对其进行规整)
(4) 筛选子集 (对数据进行降维，选择子集)

5.2 数据清洗

# 对缺失值异常值进行处理。本数据集有四个字段数据存在缺失情况即Cabin/Embarked/Fare/Age，未发现数据存在明显异常情况。其中Age字段缺失较多且为连续型数值
full.info() 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):#   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  0   PassengerId  1309 non-null   int64  1   Survived     891 non-null    float642   Pclass       1309 non-null   int64  3   Name         1309 non-null   object 4   Sex          1309 non-null   object 5   Age          1046 non-null   float646   SibSp        1309 non-null   int64  7   Parch        1309 non-null   int64  8   Ticket       1309 non-null   object 9   Fare         1308 non-null   float6410  Cabin        295 non-null    object 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

### 5.2.1 cabin(船舱)缺失值填充
full['Cabin']=full['Cabin'].fillna('U')  # 利用U(Unknown)填充缺失值
full['Cabin'].head()
0       U
1     C85
2       U
3    C123
4       U
Name: Cabin, dtype: object

### 5.2.2 embarked(港口)缺失值填充
display(full[full['Embarked'].isnull()])
display(full['Embarked'].value_counts())
full['Embarked']=full['Embarked'].fillna('S')  # 利用S填充缺失值 查看Embarked数据分布情况

在这里插入图片描述

### 5.2.3 fare(乘客费用)缺失值填充
display(full[full['Fare'].isnull()])  # 查看缺失数据情况，该乘客乘坐3等舱，登船港口为法国，舱位未知
price = full[(full['Pclass']==3)&(full['Embarked']=='S')&(full['Cabin']=='U')]['Fare'].mean()
full['Fare']=full['Fare'].fillna(price) # 利用3等舱，登船港口为英国，舱位未知旅客的平均票价来填充缺失值
full.info()

在这里插入图片描述

5.3 特征工程

在理解原数据特征基础上，特征工程通过对原有数据进行整合处理，得到新特征以反映数据更多信息

5.3.1 头衔Title

# 旅客姓名数据中包含头衔信息，不同头衔也可反映旅客身份，而不同身份旅客其生存率有可能会出现较大差异。因此通过Name特征提取旅客头衔Title信息，并分析Title与Survived之间关系
#(1) 构造新特征Title
full['Title']=full['Name'].map(lambda x:x.split(',')[1].split('.')[0].strip())
full['Title'].value_counts()  # 查看title数据分布
Mr              757
Miss            260
...
Dona              1
Jonkheer          1
Name: Title, dtype: int64#(2) title信息整合
TitleDict={}
TitleDict['Mr']='Mr'
TitleDict['Mlle']='Miss'
TitleDict['Miss']='Miss'
TitleDict['Master']='Master'
TitleDict['Jonkheer']='Master'
TitleDict['Mme']='Mrs'
TitleDict['Ms']='Mrs'
TitleDict['Mrs']='Mrs'
TitleDict['Don']='Royalty'
TitleDict['Sir']='Royalty'
TitleDict['the Countess']='Royalty'
TitleDict['Dona']='Royalty'
TitleDict['Lady']='Royalty'
TitleDict['Capt']='Officer'
TitleDict['Col']='Officer'
TitleDict['Major']='Officer'
TitleDict['Dr']='Officer'
TitleDict['Rev']='Officer'
full['Title']=full['Title'].map(TitleDict)
full['Title'].value_counts()
Mr         757
Miss       262
Mrs        200
Master      62
Officer     23
Royalty      5
Name: Title, dtype: int64#(3) 新特征与标签间关系可视化
sns.barplot(data=full,x='Title',y='Survived')  # 头衔为'Mr'及'Officer'的乘客生存率明显较低

在这里插入图片描述

5.3.2 家庭规模Family Size

# 将Parch及SibSp字段整合得到一名乘客同行家庭成员总人数Family Num字段，再根据家庭成员具体人数多少得到家庭规模Family Size这个新字段
# SibSp：泰坦尼克号上与乘客同行的兄弟姐妹(Siblings)和配偶(Spouse)数目
# Parch：泰坦尼克号上与乘客同行的家长(Parents)和孩子(Children)数目
full['familyNum']=full['Parch']+full['SibSp'] + 1
sns.barplot(data=full,x='familyNum',y='Survived') # 家庭成员人数在2-4人时乘客生存率较高，当没有家庭成员同行或家庭成员人数过多时生存率较低

def familysize(familyNum):    # 按照家庭成员人数多少，将家庭规模分为小(0)、中(1)、大(2)三类if familyNum== 1 :return 0elif (familyNum>=2)&(familyNum<=4):return 1else:return 2
full['familySize']=full['familyNum'].map(familysize)
full['familySize'].value_counts()
0    790
1    437
2     82
Name: familySize, dtype: int64
sns.barplot(data=full,x='familySize',y='Survived')  # familySize与Survived可视化  家庭规模适中乘客生存率更高

在这里插入图片描述

5.3.3 船舱类型-Deck

# Cabin字段首字母代表客舱类型也反映不同乘客群体特点，可能也与乘客生存率相关。泰坦尼克号撞击冰山时也跟客舱位置有一定关系
full['Cabin'].unique()
array(['U', 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6','C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33','F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
...'C51', 'C97', 'D22', 'B10', 'E45', 'E52', 'A11', 'B11', 'C80','C89', 'F E46', 'B26', 'F E57', 'A18', 'E60', 'E39 E41','B52 B54 B56', 'C39', 'B24', 'D40', 'D38', 'C105'], dtype=object)
full['Deck']=full['Cabin'].map(lambda x:x[0])   # 提取Cabin字段首字母
sns.barplot(data=full,x='Deck',y='Survived')  # 查看不同Deck类型乘客生存率
# plt.savefig('./14-Deck与Survived关系.png',dpi = 200)  # 乘客客舱为B/D/E生存率较高,类型为U/T生存率较低

在这里插入图片描述

5.3.4 同票号乘客数量-Ticket

# 同一票号的乘客数量可能不同，可能也与乘客生存率有关系
TickCountDict=full['Ticket'].value_counts()  # 提取各票号乘客数量
TickCountDict.head(20) 
CA. 2343        11
CA 2144          8
1601             8
...
349909           5
PC 17760         4
Name: Ticket, dtype: int64
full['TickCom']=full['Ticket'].map(TickCountDict)  # 将同票号乘客数量数据并入数据集中
full['TickCom'].head()   # 查看TickCom与Survived之间关系
sns.barplot(data=full,x='TickCom',y='Survived')  #  当TickCom大小适中时乘客生存率较高

在这里插入图片描述

def TickCountGroup(num):  # 按照TickCom大小将TickGroup分为三类if (num>=2)&(num<=4):return 0elif (num==1)|((num>=5)&(num<=8)):return 1else :return 2
full['TickGroup']=full['TickCom'].map(TickCountGroup)  # 得到各位乘客TickGroup类别
sns.barplot(data=full,x='TickGroup',y='Survived') # 查看TickGroup与Survived之间关系

在这里插入图片描述

5.3.5 Age缺失值填充-随机森林

# 查看Age与Parch、Pclass、Sex、SibSp、Title、familyNum、familySize、Deck、TickCom、TickGroup等变量相关系数大小，筛选出相关性较高的变量构建预测模型
full[full['Age'].notnull()].corr()

在这里插入图片描述

#(1) 筛选数据集
agePre=full[['Age','Parch','Pclass','SibSp','familyNum','TickCom','Title']]  
agePre=pd.get_dummies(agePre)  # one-hot编码
ageCorrDf=agePre.corr()
ageCorrDf['Age'].sort_values()
Pclass          -0.408106
Title_Master    -0.385380
Title_Miss      -0.282977
SibSp           -0.243699
familyNum       -0.240229
TickCom         -0.185284
Parch           -0.150917
Title_Royalty    0.057337
Title_Officer    0.166771
Title_Mr         0.183965
Title_Mrs        0.215091
Age              1.000000
Name: Age, dtype: float64
agePre.head()

在这里插入图片描述

#(2)  拆分数据
ageKnown=agePre[agePre['Age'].notnull()]   # 根据非空数据，规律
ageUnKnown=agePre[agePre['Age'].isnull()]  # 空数据填充
ageKnown_X=ageKnown.drop(['Age'],axis=1)
ageKnown_y=ageKnown['Age']
ageUnKnown_X=ageUnKnown.drop(['Age'],axis=1) # 生成预测数据的特征#(3) 模型构建
from sklearn.ensemble import RandomForestRegressor # 随机森林模型
rfr=RandomForestRegressor(random_state=None,n_estimators=500,n_jobs=-1)
rfr.fit(ageKnown_X,ageKnown_y)
RandomForestRegressor(n_estimators=500, n_jobs=-1)#(4) 模型得分
score = rfr.score(ageKnown_X,ageKnown_y)
print('模型预测年龄得分是：',score)#(5) 预测年龄
ageUnKnown_predict = rfr.predict(ageUnKnown_X)#(6)  预测数据填充
full.loc[full['Age'].isnull(),['Age']] = ageUnKnown_predict
full.info()   # 此时已无缺失值
模型预测年龄得分是： 0.5861009056104122
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 18 columns):#   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  0   PassengerId  1309 non-null   int64  1   Survived     891 non-null    float642   Pclass       1309 non-null   int64  3   Name         1309 non-null   object 4   Sex          1309 non-null   object 5   Age          1309 non-null   float646   SibSp        1309 non-null   int64  7   Parch        1309 non-null   int64  8   Ticket       1309 non-null   object 9   Fare         1309 non-null   float6410  Cabin        1309 non-null   object 11  Embarked     1309 non-null   object 12  Title        1309 non-null   object 13  familyNum    1309 non-null   int64  14  familySize   1309 non-null   int64  15  Deck         1309 non-null   object 16  TickCom      1309 non-null   int64  17  TickGroup    1309 non-null   int64  
dtypes: float64(3), int64(8), object(7)
memory usage: 184.2+ KB

5.4 同组识别

5.4.1 虽然通过分析数据已有特征与标签的关系可以构建有效的预测模型，但是部分具有明显共同特征的用户可能与整体模型逻辑并不一致。如果将这部分具有同组效应的用户识别出来并对其数据加以修正，就可以有效提高模型的准确率。在Titanic案例中主要探究相同姓氏的乘客是否存在明显的同组效应。

5.4.2 提取两部分数据分别查看其“姓氏”是否存在同组效应(因为性别和年龄与乘客生存率关系最为密切，因此用这两个特征作为分类条件)：
(1) 12岁以上男性：找出男性中同姓氏均获救的部分；
(2) 女性以及年龄在12岁以下儿童：找出女性及儿童中同姓氏均遇难的部分
full.head()

在这里插入图片描述

#(1) 提取乘客的姓氏及相应的乘客数
full['Surname']=full['Name'].map(lambda x:x.split(',')[0].strip())
SurNameDict=full['Surname'].value_counts()
full['SurnameNum']=full['Surname'].map(SurNameDict)
#(2) 12岁以上男性：找出男性中同姓氏均获救的部分
MaleDf=full[(full['Sex']=='male')&(full['Age']>12)&(full['familyNum']>=2)]
#(3) 分析男性同组效应
MSurNamDf=MaleDf['Survived'].groupby(MaleDf['Surname']).mean()
MSurNamDf.head()
MSurNamDf.value_counts()
0.0    89
1.0    19
0.5     3
Name: Survived, dtype: int64
# 大多数同姓氏的男性存在“同生共死”的特点，因此利用该同组效应，对生存率为1的姓氏里的男性数据进行修正，提升其预测为“可以幸存”的概率

5.4.3 女性及儿童同组效应分析
#(1) 提取乘客的姓氏及相应的乘客数
full['Surname']=full['Name'].map(lambda x:x.split(',')[0].strip())
SurNameDict=full['Surname'].value_counts()
full['SurnameNum']=full['Surname'].map(SurNameDict)
#(2)  将数据分为两组
FemChildDf=full[((full['Sex']=='female')|(full['Age']<=12))&(full['familyNum']>=2)]
FCSurNamDf=FemChildDf['Survived'].groupby(FemChildDf['Surname']).mean()
FCSurNamDf.head()
FCSurNamDf.value_counts()
#(3)  与男性组特征相似，女性及儿童也存在明显的“同生共死”的特点，因此利用同组效应，对生存率为0的姓氏里的女性及儿童数据进行修正，提升其预测为“并未幸存”的概率。

5.4.4 对数据集中这些姓氏的两组数据数据分别进行修正：
(1)男性数据修正为：1.性别改为女；         2.年龄改为5
(2)女性及儿童数据修正为：1.性别改为男；   2.年龄改为60
#(1)  获得生存率为1的姓氏
MSurNamDict=MSurNamDf[MSurNamDf.values==1].index
MSurNamDict
#(2)  获得生存率为0的姓氏
FCSurNamDict=FCSurNamDf[FCSurNamDf.values==0].index
FCSurNamDict
#(3) 对数据集中这些姓氏的男性数据进行修正：1.性别改为女； 2.年龄改为5
full.loc[(full['Survived'].isnull())&(full['Surname'].isin(MSurNamDict))&(full['Sex']=='male'),'Sex']='female'
full.loc[(full['Survived'].isnull())&(full['Surname'].isin(MSurNamDict))&(full['Sex']=='male'),'Age']=5
#(4) 对数据集中这些姓氏的女性及儿童的数据进行修正：1.性别改为男； 2.年龄改为60
full.loc[(full['Survived'].isnull())&(full['Surname'].isin(FCSurNamDict))&((full['Sex']=='female')|(full['Age']<=12)),'Sex']='male'
full.loc[(full['Survived'].isnull())&(full['Surname'].isin(FCSurNamDict))&((full['Sex']=='female')|(full['Age']<=12)),'Age']=60

5.5 筛选子集

5.5.1 在对数据进行分析处理过程中数据维度更高了，为提升数据有效性需要对数据降维处理。通过找出与乘客生存率“Survived”相关性更高特征，剔除重复且相关性较低特征，从而实现数据降维
fullSel=full.drop(['Cabin','Name','Ticket','PassengerId','Surname','SurnameNum'],axis=1)  # 人工筛选
corrDf=pd.DataFrame()  # 查看各特征与标签的相关性
corrDf=fullSel.corr()
corrDf['Survived'].sort_values(ascending=True)
Pclass       -0.338481
TickGroup    -0.319278
Age          -0.059466
SibSp        -0.035322
familyNum     0.016639
TickCom       0.064962
Parch         0.081629
familySize    0.108631
Fare          0.331805
Survived      1.000000
Name: Survived, dtype: float64

5.5.2 热力图查看Survived与其他特征间相关性大小
plt.figure(figsize=(8,8))   
sns.heatmap(fullSel[['Survived','Age','Embarked','Fare','Parch','Pclass','Sex','SibSp','Title','familyNum','familySize','Deck','TickCom','TickGroup']].corr(),cmap='BrBG',annot=True,linewidths=.5)
_ = plt.xticks(rotation=45)

fullSel=fullSel.drop(['Age','Parch','SibSp','familyNum','TickCom'],axis=1)  # 删除相关性系数低的属性
fullSel=pd.get_dummies(fullSel)  # one-hot编码
fullSel.head()

在这里插入图片描述

6.模型选择

本项目比较SCV/Decision Tree/Gradient Boosting/LDA/KNN/Logistic Regression等多种机器学习算法结果，并对表现较好算法做进一步对比，最终选择Gradient Boosting对乘客生存率进行预测

6.1 模型建立

算法参考：
(1) SCV
(2) DecisionTree
(3) ExtraTrees
(4) GradientBoosting
(5) RandomForest
(6) KNN
(7) LogisticRegression
(8) LinearDiscriminantAnalysis
#(1) 拆分训练数据与预测数据
experData=fullSel[fullSel['Survived'].notnull()]  # 训练数据
preData=fullSel[fullSel['Survived'].isnull()]   # 预测数据
experData_X=experData.drop('Survived',axis=1)
experData_y=experData['Survived']
preData_X=preData.drop('Survived',axis=1)  # 空数据#(2) 导库
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,ExtraTreesClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold#(3) 设置kfold  交叉采样法拆分数据集
kfold=StratifiedKFold(n_splits=10)#(4) 汇总不同模型算法
classifiers=[]
classifiers.append(SVC())
classifiers.append(DecisionTreeClassifier())
classifiers.append(RandomForestClassifier())
classifiers.append(ExtraTreesClassifier())
classifiers.append(GradientBoostingClassifier())
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression())
classifiers.append(LinearDiscriminantAnalysis())
classifiers.append(XGBClassifier())

6.2 算法比较

#(1) 不同机器学习交叉验证结果汇总
cv_results=[]
for classifier in classifiers:cv_results.append(cross_val_score(classifier,experData_X,experData_y,scoring='accuracy',cv=kfold,n_jobs=-1))   #(2) 求模型得分均值和标准差
cv_means=[]
cv_std=[]
for cv_result in cv_results:cv_means.append(cv_result.mean())cv_std.append(cv_result.std()) #(3) 汇总数据
cvResDf=pd.DataFrame({'cv_mean':cv_means,'cv_std':cv_std,'algorithm':['SVC','DecisionTreeCla','RandomForestCla','ExtraTreesCla','GradientBoostingCla','KNN','LR','LDA','Xgboost']})
cvResDf

在这里插入图片描述

6.3 算法可视化

bar = sns.barplot(data=cvResDf.sort_values(by='cv_mean',ascending=False),x='cv_mean',y='algorithm',**{'xerr':cv_std})
bar.set(xlim = (0.7,0.9))  # SVC、LR、LDA及Gradient BoostingCla模型在该问题中表现较好

在这里插入图片描述

7.模型调优

综合以上表现选择SVC、LDA、GradientBoostingCla、LR四种模型进一步对比。分别建立对应模型并进行模型调优

7.1 梯度提升树

GBC = GradientBoostingClassifier()
gb_param_grid = {'loss' : ["deviance"],'n_estimators' : [100,200,300],'learning_rate': [0.1, 0.05, 0.01],'max_depth': [4, 8],'min_samples_leaf': [100,150],'max_features': [0.3, 0.1]}
modelgsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)
modelgsGBC.fit(experData_X,experData_y)
modelgsGBC.best_score_
Fitting 10 folds for each of 72 candidates, totalling 720 fits
0.838414481897628

7.2 逻辑回归

modelLR=LogisticRegression()
LR_param_grid = {'C' : [1,2,3],'penalty':['l1','l2']}
modelgsLR = GridSearchCV(modelLR,param_grid = LR_param_grid, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)
modelgsLR.fit(experData_X,experData_y)
modelgsLR.best_score_
Fitting 10 folds for each of 6 candidates, totalling 60 fits
0.830561797752809

7.3 SVC支持向量机

svc = SVC()
gb_param_grid = {'C' : [0.1,0.5,1,2,3,5,10],'kernel':['rbf','poly','sigmoid']}
modelgsSVC = GridSearchCV(svc,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)
modelgsSVC.fit(experData_X,experData_y)
modelgsSVC.best_score_
Fitting 10 folds for each of 21 candidates, totalling 210 fits
0.8350187265917602

7.4 LDA 模型

lda = LinearDiscriminantAnalysis()
gb_param_grid = {'solver' : ['svd', 'lsqr', 'eigen'],'tol':[0.000001,0.00001,0.0001,0.001,0.01]}
modelgsLDA = GridSearchCV(lda,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)
modelgsLDA.fit(experData_X,experData_y)
modelgsLDA.best_score_
Fitting 10 folds for each of 15 candidates, totalling 150 fits
0.8283270911360798

8.模型评估

8.1 准确率

modelgsGBCtestpre_y=modelgsGBC.predict(experData_X).astype(int) # 求测试数据模型预测值
from sklearn.metrics import roc_curve, auc
fpr,tpr,threshold = roc_curve(experData_y, modelgsGBCtestpre_y) # 计算真正率和假正率
roc_auc = auc(fpr,tpr) # 计算auc值面积
plt.figure()
lw = 2
plt.figure(figsize=(10,10))
plt.plot(fpr, tpr, color='r',lw=lw, label='ROC curve (area = %0.3f)' % roc_auc) # 假正率为横坐标 真正率为纵坐标
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Titanic GradientBoostingClassifier Model')
plt.legend(loc="lower right")  # 图例
plt.show()  # GBC模型得分(即模型准确性)更高

在这里插入图片描述

8.2 ROC曲线

modelgsGBCtestpre_y=modelgsGBC.predict(experData_X).astype(int)   # GBDT模型roc曲线
from sklearn.metrics import roc_curve, auc  # 计算roc和auc
fpr,tpr,threshold = roc_curve(experData_y, modelgsGBCtestpre_y) # 计算真正率和假正率
roc_auc = auc(fpr,tpr) # 计算auc值
plt.figure()
lw = 2
plt.figure(figsize=(10,10))
plt.plot(fpr, tpr, color='r',lw=lw, label='ROC curve (area = %0.3f)' % roc_auc) # 假正率为横坐标 真正率为纵坐标
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Titanic GradientBoostingClassifier Model')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

modelgsLRtestpre_y=modelgsLR.predict(experData_X).astype(int)   # LR模型roc曲线
from sklearn.metrics import roc_curve, auc 
fpr,tpr,threshold = roc_curve(experData_y, modelgsLRtestpre_y) 
roc_auc = auc(fpr,tpr)
plt.figure() 
lw = 2 
plt.figure(figsize=(10,10)) 
plt.plot(fpr, tpr, color='r', 
lw=lw, label='ROC curve (area = %0.3f)' % roc_auc) 
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') 
plt.xlim([0.0, 1.0]) 
plt.ylim([0.0, 1.0]) 
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate') 
plt.title('Titanic LogisticRegression Model') 
plt.legend(loc="lower right") 
plt.show()

在这里插入图片描述

modelgsSVCtestpre_y=modelgsSVC.predict(experData_X).astype(int)  # SVC模型roc曲线
from sklearn.metrics import roc_curve, auc  
fpr,tpr,threshold = roc_curve(experData_y, modelgsSVCtestpre_y)
roc_auc = #(1)(fpr,tpr)
plt.figure()
lw = 2
plt.figure(figsize=(10,10))
plt.plot(fpr, tpr, color='r',lw=lw, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Titanic SVC Model')
plt.legend(loc="lower right")
plt.show() # GBDT、LR和SVC模型ROC曲线均左上偏，AUC分别为0.838和0.825、0.818，即GBDT模型效果较好

在这里插入图片描述

8.3 混淆矩阵

from sklearn.metrics import confusion_matrix
print('GradientBoostingClassifier模型混淆矩阵为\n',confusion_matrix(experData_y,modelgsGBCtestpre_y))
print('LogisticRegression模型混淆矩阵为\n',confusion_matrix(experData_y,modelgsLRtestpre_y))
print('SVC模型混淆矩阵为\n',confusion_matrix(experData_y,modelgsSVCtestpre_y))
GradientBoostingClassifier模型混淆矩阵为[[502  47][ 82 260]]
LogisticRegression模型混淆矩阵为[[480  69][ 77 265]]
SVC模型混淆矩阵为[[492  57][ 89 253]]
hus in binary classification, the count of
true negatives is :math:C_{0,0},false negatives is :math:C_{1,0},true positives is:math:C_{1,1}
and false positives is :math:C_{0,1}.

8.3.1 通过混淆矩阵可看出
(1) GBDT模型：真正率TPR为503/(503+46)=912,假正率FPR为0.236
(2) LR模型：真正率TPR为0.874,假正率FPR为0.225
(3) SVC模型：真正率TPR为0.896,假正率FPR为0.260
(4) 说明GBS找出正例能力很强同时也不易将负例错判为正例。综合考虑本项目中将利用GBC方法进行模型预测

9.模型预测

y_ =modelgsGBC.predict(preData_X)
y_ = y_.astype(int)
GBCpreResultDf=pd.DataFrame()  # 导出预测结果
GBCpreResultDf['PassengerId']=full['PassengerId'][full['Survived'].isnull()]
GBCpreResultDf['Survived']= y_
GBCpreResultDf
GBCpreResultDf.to_csv('./lufengkun_titanic.csv',index=False) # 将预测结果导出为csv文件
display(GBCpreResultDf.head())

在这里插入图片描述

test

在这里插入图片描述

9.1 预测结果上传

将结果上传至Kaggle中最终预测得分为0.79186排名约TOP3%

9.1 项目总结

本次kaggle项目过程中，参考很多其他竞赛方案分析思路以及数据处理技巧，如：考虑同组效应、数据对数化处理、多种模型比较结果优劣等等。在项目中主要从以下三个方面对模型改进以提升准确率：
(1) 模型选优：分别选取多种模型进行建模，根据模型评分进行初步比较，最终综合考虑多个性能指标来选择合适的预测模型
(2 ) 特征挖掘与筛选：通过挖掘新的特征并测试选择不同特征时模型预测的准确性，来选择最终训练模型的特征集合
数据整容：缺失值的填充方法以及“不合群”数据的处理也直接影响模型的最终预测结果

这篇关于AI竞赛4-Kaggle实战之海难生死预测的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！