本文主要是介绍泰坦尼克号——“十年生死两茫茫”,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
机器学习——泰坦尼克号生死预测案例
引言:学习机器学习已经有一段时间了,在Kaggle里看到一个针对初学者练手的一个案例——关于泰坦尼克号之灾,今天我也拿它来练练手,顺便记录一下。
一、先从Kaggle官网上下载一些数据:
下载完,我们得到压缩包,挤压后得到3个文件,一个是训练数据集 train.csv,一个是测试数据集test.csv,还有一个是记录乘客Id和是否存活的文件gender_submission.csv。
这样,我们项目数据已经准备好了。
二、特征提取(对数据分析和清洗)
重要提示:我之前是做开发的,也有很多跟我一样做开发的童鞋,平常拿到的配表之类的文件,都是配好的,拿来可以直接带入到逻辑代码里面使用,但是,机器学习不一样。做机器学习的主要流程:特征提取、建立模型、训练模型和评估模型,其中特征提取,非常关键。我们后面的所有都是在特征提取的基础上来完成的,所有特征提取就显得尤为关键。
1、创建一个项目文件夹taitan**,把三个数据文件和**一个代码文件(taitan.py)放在文件夹下,后面的代码我就都写在taitan.py里面了(小项目,不需要多个文件)。
导入我们要做特征提取的一些类库:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
2、首先,我们先分析一下train.csv里的数据
trainData = pd.read_csv("train.csv")
trainData.info()
输出信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
从上面输出,我们可以得到一下信息:
a、 除去Survived,因为Survived就是我们要预测的值。这样我们一共有11个特征信息,其他特征重要性后面我们再看。
b、 这里面一共有891条乘客数据,Age 缺少177条;Cabin只有204条,缺少较多;Embarked只缺少2条。
3、我们逐条看看 各个特征 和 Survived 之间的关系
a、PassengerId:乘客的编号,这个特征信息应该与Survived之间的关系不大
Name(名字先不看)它有点特殊
b、Pclass:社会等级,先看看Pclass里的数据
print(trainData['Pclass'])
0 3
1 1
2 3
3 1
4 3..
886 2
887 1
888 3
889 1
890 3
Name: Pclass, Length: 891, dtype: int64
Pclass都是一些int型的数据,数值都是{1,2,3}。
我们来看看Pclass和Survived之间的关系:
pclass = trainData['Pclass'].groupby(trainData['Survived'])
print(pclass.value_counts().unstack())
输出数据:
Pclass 1 2 3
Survived
0 80 97 372
1 136 87 119
pclass = trainData['Pclass'].groupby(trainData['Survived'])
pclass.value_counts().unstack().plot(kind='bar')
plt.show()
Pcalss = 1时候,存活的人数 > 死亡的人数;
Pcalss = 2时候,存活的人数 和 死亡的人数 几乎一样;
Pcalss = 3时候,存活的人数 < 死亡的人数;
看来,Pcalss 和 Survived有关系,估计有钱的人,在船舱的位置比较好,环境里面人员少,比较好逃生。
c、Sex:性别,跟Pclass一样,直接看数据和图
pclass = trainData['Sex'].groupby(trainData['Survived'])
print(pclass.value_counts().unstack())
输出数据:
Sex female male
Survived
0 81 468
1 233 109
pclass = trainData['Sex'].groupby(trainData['Survived'])
pclass.value_counts().unstack().plot(kind='bar')
plt.show()
图和数据结合发现:死亡中,女的少,男的多;存活中,女的多,男的少;而且差异非常大。说明性别对获救的影响非常高。
这让我想起一句话:让妇女和儿童先走!
Sex:它不是一个数值,做机器学习,是对数据进行分析,因此,这里把Sex用one-hot编码处理,新建2个列(”Sex_female“,“Sex_male”)。Sex_female=1,Sex_male=0 代表女性;Sex_female=0,Sex_male=1代表男性,并加入到原来的数据集里。
代码如下:
dummies_Sex = pd.get_dummies(trainData['Sex'], prefix= 'Sex').astype("int64")
trainData = pd.concat([trainData,dummies_Sex],axis=1)#整合到原来数据**trainData**里面
trainData.info()
输入:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 889 non-null object 12 Sex_female 891 non-null int64 13 Sex_male 891 non-null int64
dtypes: float64(2), int64(7), object(5)
memory usage: 97.6+ KB
## (本来这边要分析Age(年龄)的,但是Age有缺损,放在后面)
d、SibSp和Parch:这两个分别是 直系亲友 和 旁系亲友
先来看看 SibSp和Survived:
pclass = trainData['SibSp'].groupby(trainData['Survived'])
print(pclass.value_counts().unstack())
输入:
SibSp 0 1 2 3 4 5 8
Survived
0 398.0 97.0 15.0 12.0 15.0 5.0 7.0
1 210.0 112.0 13.0 4.0 3.0 NaN NaN
pclass = trainData['SibSp'].groupby(trainData['Survived'])
pclass.value_counts().unstack().plot(kind='bar')
plt.show()
直系亲友越多,存活率越高。
我们再看看 Parch和Survived:
pclass = trainData['Parch'].groupby(trainData['Survived'])
print(pclass.value_counts().unstack())
输入:
Parch 0 1 2 3 4 5 6
Survived
0 445.0 53.0 40.0 2.0 4.0 4.0 1.0
1 233.0 65.0 40.0 3.0 NaN 1.0 NaN
pclass = trainData['Parch'].groupby(trainData['Survived'])
pclass.value_counts().unstack().plot(kind='bar')
plt.show()
旁系亲友越多,存活率越高。
我们把这Parch和SibSp两列的值合并,生成一个s_p特征(当然自己也要加上),看看:
trainData["s_p"] = trainData["SibSp"] + trainData["Parch"]+1#都是没有缺损值的int64类型数据,不需要其他处理
trainData.info()
输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 889 non-null object 12 Sex_female 891 non-null int64 13 Sex_male 891 non-null int64 14 s_p 891 non-null int64
dtypes: float64(2), int64(8), object(5)
memory usage: 104.5+ KB
这时候,来看看s_p和Survived的关系:
从上面对 Parch 、SibSp和s_p来分析,亲友越多,存活率越高;但是当人数大于4的时候,存活率就下降了。可以想象逃生的时候有多个人一起帮忙,获救率应该会高很多,但是人太多了,如果一个人救1、2个人还好,再继续救人,搞不好会把自己搭上。
e、Embarked:上船时的港口编号:
直接看数据和图:
pclass = trainData['Embarked'].groupby(trainData['Survived'])
print(pclass.value_counts().unstack())
输出:
Embarked C Q S
Survived
0 75 47 427
1 93 30 217
pclass = trainData['Embarked'].groupby(trainData['Survived'])
pclass.value_counts().unstack().plot(kind='bar')
plt.show()
从图中发现,C港口存活率是1/2多一些,Q港口存活率略小于1/2,S港口存活率1/3多一些;看来Embarked和Survived还是有关系的。
但是Embarked不是数据,我们要把Embarked转换成数据才能训练,和Sex一样,但是首先要把Embarked缺少的2个值补齐。
Pclass(社会地位)、Fare(票价)应该和Embarked(上船时的港口编号),是有联系的。社会地位高,Embarked肯定会好些,下面我们来看看他们三者之间的关系:
print(trainData.groupby(by=["Pclass","Embarked"]).Fare.median())
Pclass Embarked
1 C 78.2667Q 90.0000S 52.0000
2 C 24.0000Q 12.3500S 13.5000
3 C 7.8958Q 7.7500S 8.0500
Name: Fare, dtype: float64
再看看,2个缺损Embarked乘客的信息:
print(trainData[pd.isna(trainData["Embarked"])])
PassengerId Survived Pclass ... Embarked_C Embarked_Q Embarked_S
61 62 1 1 ... 0 0 0
829 830 1 1 ... 0 0 0[2 rows x 18 columns]
由于2人的Pclass都是1,并且2人的Fare都是80,结合上面Pclass(社会地位)、Fare(票价)应该和Embarked(上船时的港口编号)的关系数据,我们可以把这2个人缺损的Embarked都用C来代替。
trainData['Embarked'] = trainData['Embarked'].fillna('C')
trainData.info()
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 891 non-null object 12 Sex_female 891 non-null int64 13 Sex_male 891 non-null int64 14 s_p 891 non-null int64
dtypes: float64(2), int64(8), object(5)
memory usage: 104.5+ KB
接着把Embarked进行one-hot编码处理:
dummies_Embarked = pd.get_dummies(trainData['Embarked'], prefix= 'Embarked').astype("int64")
trainData = pd.concat([trainData,dummies_Embarked],axis=1)#整合到原来数据**trainData**里面
trainData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 18 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 891 non-null object 12 Sex_female 891 non-null int64 13 Sex_male 891 non-null int64 14 s_p 891 non-null int64 15 Embarked_C 891 non-null int64 16 Embarked_Q 891 non-null int64 17 Embarked_S 891 non-null int64
dtypes: float64(2), int64(11), object(5)
memory usage: 125.4+ KB
f、Name:名字
一般名字与大部分事物预测没啥联系,但是这里给定特征Name不一样,它带有一些专属特称,如:Mr、Mrs、Master等,具有一定社会地位的标记。我们把这些标记做一个分类:
‘Capt’, ‘Col’, ‘Major’, ‘Dr’, ‘Rev’——>“Officer”
‘Don’, ‘Sir’, ‘the Countess’, ‘Dona’, ‘Lady’——>“Royalty”
‘Mme’, ‘Ms’, ‘Mrs’——>“Mrs”
‘Mlle’, ‘Miss’——>“Miss”
‘Master’,‘Jonkheer’——>“Master”
“Mr”——>‘Mr’
trainData['Name_flag'] = trainData['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
trainData['Name_flag'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'],'Officer', inplace=True)
trainData['Name_flag'].replace(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty', inplace=True)
trainData['Name_flag'].replace(['Mme', 'Ms', 'Mrs'],'Mrs', inplace=True)
trainData['Name_flag'].replace(['Mlle', 'Miss'], 'Miss', inplace=True)
trainData['Name_flag'].replace(['Master','Jonkheer'],'Master', inplace=True)
trainData['Name_flag'].replace(['Mr'], 'Mr', inplace=True)
trainData.info()
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 19 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 891 non-null object 12 Sex_female 891 non-null int64 13 Sex_male 891 non-null int64 14 s_p 891 non-null int64 15 Embarked_C 891 non-null int64 16 Embarked_Q 891 non-null int64 17 Embarked_S 891 non-null int64 18 Name_flag 891 non-null object
dtypes: float64(2), int64(11), object(6)
memory usage: 132.4+ KB
下面我们来看看Name_flag和Survived的相关性:
pclass = trainData['Name_flag'].groupby(trainData['Survived'])
print(pclass.value_counts().unstack())
Name_flag Master Miss Mr Mrs Officer Royalty
Survived
0 18 55 436 26 13 1
1 23 129 81 101 5 3
pclass = trainData['Name_flag'].groupby(trainData['Survived'])
pclass.value_counts().unstack().plot(kind='bar')
plt.show()
从上面数据我们可以看出,Miss和Mrs存活率比其他的高了很多。
下面我们把Name_flag进行one-hot编码:
dummies_Name = pd.get_dummies(trainData['Name_flag'], prefix= 'Name_flag').astype("int64")
trainData = pd.concat([trainData,dummies_Name],axis=1)#整合到原来数据**trainData**里面
trainData.info()
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 25 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 891 non-null object 12 Sex_female 891 non-null int64 13 Sex_male 891 non-null int64 14 s_p 891 non-null int64 15 Embarked_C 891 non-null int64 16 Embarked_Q 891 non-null int64 17 Embarked_S 891 non-null int64 18 Name_flag 891 non-null object 19 Name_flag_Master 891 non-null int64 20 Name_flag_Miss 891 non-null int64 21 Name_flag_Mr 891 non-null int64 22 Name_flag_Mrs 891 non-null int64 23 Name_flag_Officer 891 non-null int64 24 Name_flag_Royalty 891 non-null int64
dtypes: float64(2), int64(17), object(6)
memory usage: 174.1+ KB
g、Age:年龄
年龄的缺失很大,用众数或者平均数填充是不合理的,因此这里使用随机森林的方法来估测缺损的年龄数据。
from sklearn.ensemble import RandomForestRegressor
train_part = trainData[["Age",'Pclass', 'Sex_female','Sex_male','Name_flag_Master',"Name_flag_Miss","Name_flag_Mr","Name_flag_Mrs","Name_flag_Officer","Name_flag_Royalty"]]
train_part=pd.get_dummies(train_part)
age_has = train_part[train_part.Age.notnull()].values
age_no = train_part[train_part.Age.isnull()].values
y = age_has[:, 0]
X = age_has[:, 1:]
rfr_clf = RandomForestRegressor(random_state=60, n_estimators=100, n_jobs=-1)
rfr_clf.fit(X, y)
result = rfr_clf.predict(age_no[:, 1:])
trainData.loc[ (trainData.Age.isnull()), 'Age' ] = result
trainData.info()
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 25 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 891 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 891 non-null object 12 Sex_female 891 non-null int64 13 Sex_male 891 non-null int64 14 s_p 891 non-null int64 15 Embarked_C 891 non-null int64 16 Embarked_Q 891 non-null int64 17 Embarked_S 891 non-null int64 18 Name_flag 891 non-null object 19 Name_flag_Master 891 non-null int64 20 Name_flag_Miss 891 non-null int64 21 Name_flag_Mr 891 non-null int64 22 Name_flag_Mrs 891 non-null int64 23 Name_flag_Officer 891 non-null int64 24 Name_flag_Royalty 891 non-null int64
dtypes: float64(2), int64(17), object(6)
memory usage: 174.1+ KB
这样,缺少的Age信息我们也补全了
h、Cabin这个数据缺损台多了,这里我舍弃了
那么现在,测试数据里面的特征以及清洗、提取了,后面我们要开始训练我们的模型了。
三、训练模型
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import SGDClassifier
from sklearn.base import clonetrainData = pd.read_csv("train.csv")dummies_Sex = pd.get_dummies(trainData['Sex'], prefix= 'Sex').astype("int64")
trainData = pd.concat([trainData,dummies_Sex],axis=1)trainData["s_p"] = trainData["SibSp"] + trainData["Parch"]+1trainData['Embarked'] = trainData['Embarked'].fillna('C')dummies_Embarked = pd.get_dummies(trainData['Embarked'], prefix= 'Embarked').astype("int64")
trainData = pd.concat([trainData,dummies_Embarked],axis=1)trainData['Name_flag'] = trainData['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
trainData['Name_flag'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'],'Officer', inplace=True)
trainData['Name_flag'].replace(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty', inplace=True)
trainData['Name_flag'].replace(['Mme', 'Ms', 'Mrs'],'Mrs', inplace=True)
trainData['Name_flag'].replace(['Mlle', 'Miss'], 'Miss', inplace=True)
trainData['Name_flag'].replace(['Master','Jonkheer'],'Master', inplace=True)
trainData['Name_flag'].replace(['Mr'], 'Mr', inplace=True)dummies_Name = pd.get_dummies(trainData['Name_flag'], prefix= 'Name_flag').astype("int64")
trainData = pd.concat([trainData,dummies_Name],axis=1)from sklearn.ensemble import RandomForestRegressor
train_part = trainData[["Age",'Pclass', 'Sex_female','Sex_male','Name_flag_Master',"Name_flag_Miss","Name_flag_Mr","Name_flag_Mrs","Name_flag_Officer","Name_flag_Royalty"]]
train_part=pd.get_dummies(train_part)
age_has = train_part[train_part.Age.notnull()].values
age_no = train_part[train_part.Age.isnull()].values
y = age_has[:, 0]
X = age_has[:, 1:]
rfr_clf = RandomForestRegressor(random_state=60, n_estimators=100, n_jobs=-1)
rfr_clf.fit(X, y)
result = rfr_clf.predict(age_no[:, 1:])
trainData.loc[ (trainData.Age.isnull()), 'Age' ] = resulttrain_Model_data = trainData[["Survived","Age",'Pclass', 'Sex_female','Sex_male','Name_flag_Master',"Name_flag_Miss","Name_flag_Mr","Name_flag_Mrs","Name_flag_Officer","Name_flag_Royalty","s_p","Embarked_C","Embarked_S","Embarked_Q"]].valuesdataLen = len(train_Model_data)train_x = train_Model_data[:,1:]
train_y = train_Model_data[:,0]shuffle_index = np.random.permutation(dataLen)train_x = train_x[shuffle_index]
train_y = train_y[shuffle_index]
skfolds = StratifiedKFold(n_splits=3, random_state=42)sgd_clf = SGDClassifier(loss='log', random_state=42, max_iter=1000, tol=1e-4)for train_index, test_index in skfolds.split(train_x, train_y):clone_clf = clone(sgd_clf)X_train_folds = train_x[train_index]y_train_folds = train_y[train_index]X_test_folds = train_x[test_index]y_test_folds = train_y[test_index]clone_clf.fit(X_train_folds, y_train_folds)y_pred = clone_clf.predict(X_test_folds)print(y_pred)n_correct = sum(y_pred == y_test_folds)print(n_correct / len(y_pred))
输出:
[0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0.0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0.0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0.0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0.0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1.1. 0. 1. 1. 1. 1. 0. 1. 0.]
0.7676767676767676
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 1.1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1.1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0.0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0.0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0.1. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0.0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1.1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0.1. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 1. 1.0. 1. 1. 1. 1. 1. 1. 1. 1.]
0.8249158249158249
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 1.0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0.0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.1. 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1. 0.1. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0.0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0.0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1.0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1.0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1.0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0.0. 1. 1. 0. 1. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0.0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1.1. 0. 0. 0. 1. 0. 1. 0. 0.]
0.8282828282828283
这个模型我没有用 test.csv里的数据测试,仅仅用了train.csv里的数据进行了3折的交叉验证。
最终:我的模型得分为 0.8282828282828283
这篇关于泰坦尼克号——“十年生死两茫茫”的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!