本文主要是介绍Titanic_Data Analysis,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
本项目提供了两份数据:train.csv文件作为训练构建与生存相关的模型;另一份test.csv文件则用于测试集,用我们构建出来的模型预测生存情况;
PassengerId --Id,具有唯一标识的作用,即每个人对应一个Id
survived --是否幸存 1表示是 0表示否
pclass --船舱等级 1:一等舱 2:二等舱 3:三等舱
Name --姓名,通常西方人的姓名
Sex --性别,female女性 male 男性
Age --年龄
SibSp --同船配偶以及兄弟姐妹的人数
Parch --同船父母或子女的人数
Ticket --船票
Fare --票价
Cabin --舱位
Embarked --登船港口
#读取数据
import pandas as pd
df_train,df_test = pd.read_csv('train.csv'),pd.read_csv('test.csv')
从训练集开始
#查看前五行数据
df_train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
#查看后5行数据
df_train.tail()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.00 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.00 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
#查看数据信息,其中包含数据纬度、数据类型、所占空间等信息
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
数据纬度:891行 X 12列
缺失字段:Age,Cabin,Embarked
数据类型:两个64位浮点型,5个64位整型,5个python对象
#描述性统计
df_train.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
1.除了python对象之外的数据类型,均参与了计算
2.38.4%的人幸存,死亡率很高
3.年龄现有数据714,缺失占比714/891 = 20%
4.同船兄弟姐妹与配偶人数最大为8,同船父母或子女最大数则为6,看来有大家庭小家庭之分;
5.票价最小为0,最大为512.3,均值为32.20,中位数为14.45,正偏,贫富差距不小;
#那么python对象对应的数据查看
df_train[['Name','Sex','Ticket','Cabin','Embarked']].describe()
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
count | 891 | 891 | 891 | 204 | 889 |
unique | 891 | 2 | 681 | 147 | 3 |
top | Lam, Mr. Ali | male | 1601 | B96 B98 | S |
freq | 1 | 577 | 7 | 4 | 644 |
特征分析
# 1.PassengerId,id仅仅是来标识乘客的唯一性,必然与幸存无关
# 2.Pclass
#船舱等级,一等级是整个船最昂贵奢华的地方,有钱人才能享受,有没有可能一等舱有钱人比三等舱的穷人更容易幸存呢?
import numpy as np
import matplotlib.pyplot as plt
#生成pclass-survive的列联表
Pclass_Survived = pd.crosstab(df_train['Pclass'],df_train['Survived'])
Pclass_Survived
Survived | 0 | 1 |
---|---|---|
Pclass | ||
1 | 80 | 136 |
2 | 97 | 87 |
3 | 372 | 119 |
Pclass_Survived.count()
Survived
0 3
1 3
dtype: int64
Pclass_Survived.index
Int64Index([1, 2, 3], dtype='int64', name='Pclass')
#绘制堆积柱形图
Pclass_Survived.plot(kind = 'bar',stacked = True)
Survived_len = len(Pclass_Survived.count())
print(Survived_len)
Pclass_index = np.arange(len(Pclass_Survived.index))
print(Pclass_index)plt.xticks(Pclass_Survived.index-1,Pclass_Survived.index,rotation = 360)
plt.title('Survived status by pclass')
2
[0 1 2]Text(0.5,1,'Survived status by pclass')
其中列联表就等于一下操作
#生成Survived为0时,每个Pclass的总计数
Pclass_Survived_0 = df_train.Pclass[df_train['Survived'] == 0].value_counts()
#生成Survived为1时,每个Pclass的总计数
Pclass_Survived_1 = df_train.Pclass[df_train['Survived'] == 1].value_counts()
#将两个状况合并为一个dataFram
Pclass_Survived = pd.DataFrame({0:Pclass_Survived_0,1:Pclass_Survived_1})
Pclass_Survived
3 372
2 97
1 80
Name: Pclass, dtype: int64
0 | 1 | |
---|---|---|
1 | 80 | 136 |
2 | 97 | 87 |
3 | 372 | 119 |
#Name
#姓名,总数为891个且有891种不同的结果,没多大意义,但值得注意的是性命中有头衔存在的,头衔是身份地位的象征,是否身份地位越高更容易生存?
#首先提取出头衔
import re
df_train['Appellation'] = df_train.Name.apply(lambda x : re.search('\w+\.',x).group()).str.replace('.','')
#查看多钟不同的结果
df_train.Appellation.unique()
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms','Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'Countess','Jonkheer'], dtype=object)
#头衔解读:Mr既可用于已婚男性,也可用于未婚男性;Mrs已婚女性;Miss通常用来称呼未婚女性,但有时也用于称呼自己不了解的年龄较大的妇女;
#Master: 男童或男婴;Don:大学教师;Rev: 牧师;Dr:医生或博士;Mme:女士;Ms:既可用于已婚女性也可用于未婚女性;Major:陆军少校;
#Lady : 公侯伯爵的女儿;Sir:常用来称呼上级长官;Mlle:小姐;Col:上校;Capt:船长;Countess:伯爵夫人;Jonkheer:乡绅;
#性别与头衔的对应的人数
Appellation_Sex = pd.crosstab(df_train.Appellation,df_train.Sex)
Appellation_Sex.T
Appellation | Capt | Col | Countess | Don | Dr | Jonkheer | Lady | Major | Master | Miss | Mlle | Mme | Mr | Mrs | Ms | Rev | Sir |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sex | |||||||||||||||||
female | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 182 | 2 | 1 | 0 | 125 | 1 | 0 | 0 |
male | 1 | 2 | 0 | 1 | 6 | 1 | 0 | 2 | 40 | 0 | 0 | 0 | 517 | 0 | 0 | 6 | 1 |
df_train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Appellation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Mr |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Mrs |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | Miss |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | Mrs |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | Mr |
#将少数部分用Rare表示,将‘Mlle’,‘Ms’用‘MIss’代替,将‘Mme’用‘Mrs’代替
df_train['Appellation'] = df_train['Appellation'].replace(['Capt','Col','Countess','Don','Dr','Jonkheer','Lady','Major','Rev','Sir'],'Rare')
df_train['Appellation'] = df_train['Appellation'].replace(['Mlle','Ms'],'Miss')
df_train['Appellation'] = df_train['Appellation'].replace('Mme','Mrs')
df_train.Appellation.unique()
array(['Mr', 'Mrs', 'Miss', 'Master', 'Rare'], dtype=object)
#头衔和幸存者相关吗?
#绘制柱状图
Appellation_Survived = pd.crosstab(df_train['Appellation'],df_train['Survived'])
Appellation_Survived.plot(kind = 'bar')
print(np.arange(len(Appellation_Survived.index)-1))
plt.xticks(np.arange(len(Appellation_Survived.index)),Appellation_Survived.index,rotation = 360)
plt.title('Survived status by Appelation')
[0 1 2 3]Text(0.5,1,'Survived status by Appelation')
Sex
#性别,女士优先,但这种紧急关头,会让女士先上救生艇吗
#生成列联表
Sex_Survived = pd.crosstab(df_train['Sex'],df_train['Survived'])
Survived_len = len(Sex_Survived.count())
print(Survived_len)
Sex_index = np.arange(len(Sex_Survived.index))
print(Sex_survived.index)
print(Sex_index)
single_width = 0.35for i in range(Survived_len):SurvivedName = Sex_Survived.columns[i]print(SurvivedName)SexCount = Sex_Survived[SurvivedName]print(SexCount)SexLocation = Sex_index * 1.05 + (i - 1/2)*single_widthprint(SexLocation)#绘制柱状图plt.bar(SexLocation,SexCount,width = single_width)for x,y in zip(SexLocation,SexCount):#添加数据标签plt.text(x,y,'%.0f'%y,ha = 'center',va= 'bottom')
index = Sex_index * 1.05
plt.xticks(index,Sex_Survived.index,rotation = 360)
plt.title('Survived status by sex')
2
Index(['female', 'male'], dtype='object', name='Sex')
[0 1]
0
Sex
female 81
male 468
Name: 0, dtype: int64
[-0.175 0.875]
1
Sex
female 233
male 109
Name: 1, dtype: int64
[0.175 1.225]Text(0.5,1,'Survived status by sex')
#结果可以看出,女性的幸存率远高于男性
Age
#由于Age特征存在缺失值,处理完缺失值,再对其进行分析
SibSp --同船配偶以及兄弟姐妹的人数
#从之前的描述性统计了解到,兄弟姐妹与配偶的人数最多的为8,最少为0,哪个更容易生存呢?
#生成列联表
SibSp_Survived = pd.crosstab(df_train['SibSp'],df_train['Survived'])
#print(SibSp_Survived)
#print(np.arange(len(SibSp_Survived.index)))
SibSp_Survived.plot(kind = 'bar')
plt.xticks(np.arange(len(SibSp_Survived.index)),SibSp_Survived.index,rotation = 360)
plt.title('Survived status by SibSp')
Text(0.5,1,'Survived status by SibSp')
Parch --同船父母或子女的人数
#通过上面的描述性统计了解到,同样也可以分为大家庭,小家庭Parch_Survived = pd.crosstab(df_train['Parch'],df_train['Survived'])
Parch_Survived.plot(kind = 'bar')
plt.xticks(np.arange(len(Parch_Survived.index)),Parch_Survived.index,rotation = 360)
plt.title('Survived status by Parch')
Text(0.5,1,'Survived status by Parch')
Parch_Survived = pd.crosstab(df_train[df_train.Parch >= 3]['Parch'],df_train['Survived'])
Parch_Survived.plot(kind = 'bar')
plt.xticks(np.arange(len(Parch_Survived.index)),Parch_Survived.index,rotation = 360)
plt.title('Survived status by Parch')
Text(0.5,1,'Survived status by Parch')
##可以看到,大部分Parch为0,幸存率不大,当为1,2,3时,有所增加,再往上又有所减小
Ticket --船票
#总人数891,船票有681种,说明部分人共用一张票,什么人能共用一张票呢?需要对他们进行归类,共用票的归位一类,独自使用的归位一类;
#计算每张船票的使用的人数
Ticket_Count = df_train.groupby('Ticket',as_index = False)['PassengerId'].count()
#获取使用人数为1的船票
Ticket_Count_0 = Ticket_Count[Ticket_Count.PassengerId == 1]['Ticket']
#当船票在已经筛选出使用人数为1的船票中时,将0赋值给GroupTicket,否则将1赋值给GroupTicket
df_train['GroupTicket'] = np.where(df_train.Ticket.isin(Ticket_Count_0),0,1)
#绘制柱形图
GroupTicket_Survived = pd.crosstab(df_train['GroupTicket'],df_train['Survived'])
GroupTicket_Survived.plot(kind = 'bar')
plt.xticks(GroupTicket_Survived.index,rotation = 360)
plt.title('Survived status by GroupTicket')
Text(0.5,1,'Survived status by GroupTicket')
#很明显,船上有同伴比孤身一人幸存机会大
Fare --票价
#对Fare进行分组,2**10>891 分成10组,组距为(最大值-最小值)/10取值60
bins = [0,60,120,180,240,300,360,420,480,540,600]
df_train['GroupFare'] = pd.cut(df_train.Fare,bins,right = False)
GroupFare_Survived = pd.crosstab(df_train['GroupFare'],df_train['Survived'])
GroupFare_Survived
GroupFare_Survived.plot(kind = 'bar')
plt.title('Survived status by GroupFare')GroupFare_Survived.iloc[2:].plot(kind = 'bar')
plt.title('Survived status by GroupFare(Fare > 120)')
Text(0.5,1,'Survived status by GroupFare(Fare > 120)')
#可以看到随着票价的增长,幸存机会也会变大
Cabin --舱位 #Embarked --登船港口
#由于含有大量缺失值,处理完缺失值再对其进行分析
四.特征工程
缺失值主要是由人为原因和机械原因造成的数据缺失,在pandas中用NaN或者NaT表示,它的处理方式有多种:
1.用某些集中趋势度量(平均数,众数)进行对缺失值进行填充;
2.用统计模型来预测缺失值,比如回归模型、决策树、随即森林;
3.删除缺失值;
4.保留缺失值;
#在处理缺失值之前,应当将数据拷贝一份,以保证原始数据的完整性;
train = df_train.copy()
1.Embarked缺失值处理
通过以上,我们已经知道Embarked字段中缺失2个,且数据中S最多,达到644,占比644/891=72%,那么就采用众数进行填充;
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode()[0])
# 0 or index :获取列的众数;1 or columns :获取行的众数
2.Cabin缺失值处理
Cabin缺失值687个,占比687/891=77%,缺失数据太多,是否删除呢?舱位缺失可能代表这些人没有舱位,不妨用‘NO’来填充;
train['Cabin'] = train['Cabin'].fillna('NO')
3.Age缺失值处理
Age缺失177个,占比177/891=20%,缺失数据也不少,而且Age在本次分析中也尤其重要,孩子和老人属于弱势群体,应当更容易获救,不能删除也不能保留;
采用头衔相对应的年龄中位数进行填充
#求出每个头衔对应的年龄的中位数
Age_Appellation_median = train.groupby('Appellation')['Age'].median()
#在当前表设置Appellation为索引
train.set_index('Appellation',inplace = True)
#在当前表填充缺失值
train.Age.fillna(Age_Appellation_median,inplace = True)
#重置索引
train.reset_index(inplace = True)
#检查一下是否有缺失值
#第一种方法:返回0即表示没有缺失值
train.Age.isnull().sum()
0
#第二种方法:返回False即表示没有缺失值
train.Age.isnull().any()
False
#第三种方法:描述性统计
train.Age.describe()
count 891.000000
mean 29.390202
std 13.265322
min 0.420000
25% 21.000000
50% 30.000000
75% 35.000000
max 80.000000
Name: Age, dtype: float64
对缺失特征分析
Embarked --登船港口
#绘制柱状图
Embarked_Survived = pd.crosstab(train['Embarked'],train['Survived'])
Embarked_Survived.plot(kind = 'bar')
plt.xticks(np.arange(len(Embarked_Survived.index)),Embarked_Survived.index,rotation = 360)
plt.title('Survived status by Embarked')
Text(0.5,1,'Survived status by Embarked')
#C港生存几率会明显高于Q,S港
#Cabin
#将没有舱位的归为0,有舱位的归位1
train['GroupCabin'] = np.where(train.Cabin == 'NO',0,1)
#绘制柱状图
GroupCabin_Survived = pd.crosstab(train['GroupCabin'],train['Survived'])
GroupCabin_Survived.plot(kind = 'bar')
plt.xticks(np.arange(len(GroupCabin_Survived.index)),GroupCabin_Survived.index,rotation = 360)
plt.title('Survived ststus by GroupCabin')
Text(0.5,1,'Survived ststus by GroupCabin')
#有舱位比没有舱位的生存几率大
Age
#对Age进行分组
bins = [0,9,18,27,36,45,54,63,72,81,90]
train['GroupAge'] = pd.cut(train.Age,bins)
GroupAge_Survived = pd.crosstab(train['GroupAge'],train['Survived'])
#绘制柱状图
GroupAge_Survived.plot(kind = 'bar')
plt.xticks(np.arange(len(GroupAge_Survived.index)),GroupAge_Survived.index,rotation = 90)
plt.title('Survived status by GroupAge')
Text(0.5,1,'Survived status by GroupAge')
#如图,孩子的幸存几率很大
这篇关于Titanic_Data Analysis的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!