Kaggle--泰坦尼克号失踪者生死情况预测源码(附Titanic数据集)

本文主要是介绍Kaggle--泰坦尼克号失踪者生死情况预测源码(附Titanic数据集),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

数据可视化分析

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as nptitanic=pd.read_csv('train.csv')
#print(titanic.head())
#设置某一列为索引
#print(titanic.set_index('PassengerId').head())# =============================================================================
# #绘制一个展示男女乘客比例的扇形图
# #sum the instances of males and females
# males=(titanic['Sex']=='male').sum()
# females=(titanic['Sex']=='female').sum()
# #put them into a list called proportions
# proportions=[males,females]
# #Create a pie chart
# plt.pie(
# #        using proportions
#         proportions,
# #        with the labels being officer names
#         labels=['Males','Females'],
# #        with no shadows
#         shadow=False,
# #        with colors
#         colors=['blue','red'],
#         explode=(0.15,0),
#         startangle=90,
#         autopct='%1.1f%%'
#         )
# plt.axis('equal')
# plt.title("Sex Proportion")
# plt.tight_layout()
# plt.show()
# =============================================================================# =============================================================================
# #绘制一个展示船票Fare,与乘客年龄和性别的散点图
# #creates the plot using
# lm=sns.lmplot(x='Age',y='Fare',data=titanic,hue='Survived',fit_reg=False)
# #set title
# lm.set(title='Fare x Age')
# #get the axes object and tweak it
# axes=lm.axes
# axes[0,0].set_ylim(-5,)
# axes[0,0].set_xlim(-5,85)
# =============================================================================# =============================================================================
# #绘制一个展示船票价格的直方图
# #sort the values from the top to least value and slice the first 5 items
# df=titanic.Fare.sort_values(ascending=False)
# #create bins interval using numpy
# binsVal=np.arange(0,600,10)
# #create the plot
# plt.hist(df,bins=binsVal)
# plt.xlabel('Fare')
# plt.ylabel('Frequency')
# plt.title('Fare Payed Histrogram')
# plt.show()
# =============================================================================#哪个性别的年龄的平均值更大
#print(titanic.groupby('Sex').Age.mean())
#打印出不同性别的年龄的描述性统计信息
#print(titanic.groupby('Sex').Age.describe())
#print(titanic.groupby(['Sex','Survived']).Fare.describe())
#先对Survived再Fare进行排序
#a=titanic.sort_values(['Survived','Fare'],ascending=False)
#print(a)
#选取名字以字母A开头的数据
#b=titanic[titanic.Name.str.startswith('A')]
#print(b)
#找到其中三个人的存活情况
#c=titanic.loc[titanic.Name.isin(['Youseff, Mr. Gerious','Saad, Mr. Amin','Yousif, Mr. Wazli'])\
#              ,['Name','Survived']]
#print(c)
# =============================================================================
# ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
# ts = ts.cumsum()
# ts.plot()
# plt.show()
# 
# df = pd.DataFrame(np.random.randn(1000, 4),index=ts.index,columns=['A', 'B', 'C', 'D'])
# df=df.cumsum()
# plt.figure()
# df.plot()
# plt.legend(loc='best')
# plt.show()
# =============================================================================
#对应每一个location,一共有多少数据值缺失
#print(titanic.isnull().sum())
#对应每一个location,一共有多少数据值完整
#print(titanic.shape[0]-titanic.isnull().sum())
#查看每个列的数据类型
#print(titanic.info())
#print(titanic.dtypes)

主程序
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 10 17:21:16 2018@author: CSH
"""import pandas as pd
titanic=pd.read_csv("train.csv")
#print(titanic.describe())titanic["Age"]=titanic["Age"].fillna(titanic["Age"].median())
#print(titanic.describe())#print(titanic["Sex"].unique())
titanic.loc[titanic["Sex"]=="male","Sex"]=0
titanic.loc[titanic["Sex"]=="female","Sex"]=1#print(titanic["Embarked"].value_counts())
titanic["Embarked"]=titanic["Embarked"].fillna("S")
titanic.loc[titanic["Embarked"]=="S","Embarked"]=0
titanic.loc[titanic["Embarked"]=="C","Embarked"]=1
titanic.loc[titanic["Embarked"]=="Q","Embarked"]=2
#线性回归
# =============================================================================
# from sklearn.linear_model import LinearRegression
# from sklearn.cross_validation import KFold
# predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
# alg=LinearRegression()
# kf=KFold(titanic.shape[0],n_folds=3,random_state=1)
# predictions=[]
# for train,test in kf:
#     train_predictors=(titanic[predictors].iloc[train,:])
#     train_target=titanic["Survived"].iloc[train]
#     alg.fit(train_predictors,train_target)
#     test_predictions=alg.predict(titanic[predictors].iloc[test,:])
#     predictions.append(test_predictions)
# 
# 
# import numpy as np
# predictions=np.concatenate(predictions,axis=0)
# predictions[predictions>.5]=1
# predictions[predictions<=.5]=0
# accuracy=sum(predictions==titanic["Survived"])/len(predictions)
# print(accuracy)
# =============================================================================
#逻辑回归
# =============================================================================
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation
# predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
# alg=LogisticRegression(random_state=1)
# scores=cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=3)
# print(scores.mean())
# =============================================================================
#随机森林
# =============================================================================
# from sklearn import cross_validation
# from sklearn.ensemble import RandomForestClassifier
# predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
# alg=RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=12,min_samples_leaf=1)
# kf=cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
# scores=cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
# print(scores.mean())
# =============================================================================titanic["FamilySize"]=titanic["SibSp"]+titanic["Parch"]
titanic["NameLength"]=titanic["Name"].apply(lambda x:len(x))#提取名字信息
import re
def get_title(name):title_search=re.search('([A-Za-z]+)\.',name)if title_search:return title_search.group(1)return ""titles=titanic["Name"].apply(get_title)
#print(pd.value_counts(titles))title_mapping={"Mr":1,"Miss":2,"Mrs":3,"Master":4,"Dr":5,"Rev":6,"Mlle":7,"Major":8,"Col":9,"Ms":10,"Mme":11,"Lady":12,"Sir":13,"Capt":14,"Don":15,"Jonkheer":16,"Countess":17}
for k,v in title_mapping.items():titles[titles==k]=v
#print(pd.value_counts(titles))
titanic["Title"]=titles
#特征选择
# =============================================================================
# import numpy as np
# from sklearn.feature_selection import SelectKBest,f_classif
# import matplotlib.pyplot as plt
# predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
# selector=SelectKBest(f_classif,k=5)
# selector.fit(titanic[predictors],titanic["Survived"])
# scores=-np.log10(selector.pvalues_)
# 
# plt.bar(range(len(predictors)),scores)
# plt.xticks(range(len(predictors)),predictors,rotation='vertical')
# plt.show()
# =============================================================================# =============================================================================
# from sklearn import cross_validation
# from sklearn.ensemble import RandomForestClassifier
# predictors=["Pclass","Sex","Fare","Title","NameLength"]
# alg=RandomForestClassifier(random_state=1,n_estimators=50,min_samples_split=12,min_samples_leaf=1)
# kf=cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
# scores=cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
# print(scores.mean())
# =============================================================================#集成学习
from sklearn.cross_validation import KFold
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
algorithms=[[GradientBoostingClassifier(random_state=1,n_estimators=25,max_depth=3),["Pclass","Sex","Fare","Title","NameLength"]],[LogisticRegression(random_state=1),["Pclass","Sex","Fare","Title","NameLength"]]]kf=KFold(titanic.shape[0],n_folds=3,random_state=1)
predictions=[]
for train,test in kf:train_target=titanic["Survived"].iloc[train]full_test_predictions=[]for alg,predictors in algorithms:alg.fit(titanic[predictors].iloc[train,:],train_target)test_predictions=alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]full_test_predictions.append(test_predictions)test_predictions=(full_test_predictions[0]+full_test_predictions[1])/2test_predictions[test_predictions<=.5]=0test_predictions[test_predictions>.5]=1predictions.append(test_predictions)predictions=np.concatenate(predictions,axis=0)
accuracy=sum(predictions==titanic["Survived"])/len(predictions)
print(accuracy)

附:链接:https://pan.baidu.com/s/1K1USWVQQOEM9OLr3M1pniw 密码:n8wz

这篇关于Kaggle--泰坦尼克号失踪者生死情况预测源码(附Titanic数据集)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/563037

相关文章

Java中注解与元数据示例详解

《Java中注解与元数据示例详解》Java注解和元数据是编程中重要的概念,用于描述程序元素的属性和用途,:本文主要介绍Java中注解与元数据的相关资料,文中通过代码介绍的非常详细,需要的朋友可以参... 目录一、引言二、元数据的概念2.1 定义2.2 作用三、Java 注解的基础3.1 注解的定义3.2 内

将sqlserver数据迁移到mysql的详细步骤记录

《将sqlserver数据迁移到mysql的详细步骤记录》:本文主要介绍将SQLServer数据迁移到MySQL的步骤,包括导出数据、转换数据格式和导入数据,通过示例和工具说明,帮助大家顺利完成... 目录前言一、导出SQL Server 数据二、转换数据格式为mysql兼容格式三、导入数据到MySQL数据

C++中使用vector存储并遍历数据的基本步骤

《C++中使用vector存储并遍历数据的基本步骤》C++标准模板库(STL)提供了多种容器类型,包括顺序容器、关联容器、无序关联容器和容器适配器,每种容器都有其特定的用途和特性,:本文主要介绍C... 目录(1)容器及简要描述‌php顺序容器‌‌关联容器‌‌无序关联容器‌(基于哈希表):‌容器适配器‌:(

C#提取PDF表单数据的实现流程

《C#提取PDF表单数据的实现流程》PDF表单是一种常见的数据收集工具,广泛应用于调查问卷、业务合同等场景,凭借出色的跨平台兼容性和标准化特点,PDF表单在各行各业中得到了广泛应用,本文将探讨如何使用... 目录引言使用工具C# 提取多个PDF表单域的数据C# 提取特定PDF表单域的数据引言PDF表单是一

一文详解Python中数据清洗与处理的常用方法

《一文详解Python中数据清洗与处理的常用方法》在数据处理与分析过程中,缺失值、重复值、异常值等问题是常见的挑战,本文总结了多种数据清洗与处理方法,文中的示例代码简洁易懂,有需要的小伙伴可以参考下... 目录缺失值处理重复值处理异常值处理数据类型转换文本清洗数据分组统计数据分箱数据标准化在数据处理与分析过

Go中sync.Once源码的深度讲解

《Go中sync.Once源码的深度讲解》sync.Once是Go语言标准库中的一个同步原语,用于确保某个操作只执行一次,本文将从源码出发为大家详细介绍一下sync.Once的具体使用,x希望对大家有... 目录概念简单示例源码解读总结概念sync.Once是Go语言标准库中的一个同步原语,用于确保某个操

大数据小内存排序问题如何巧妙解决

《大数据小内存排序问题如何巧妙解决》文章介绍了大数据小内存排序的三种方法:数据库排序、分治法和位图法,数据库排序简单但速度慢,对设备要求高;分治法高效但实现复杂;位图法可读性差,但存储空间受限... 目录三种方法:方法概要数据库排序(http://www.chinasem.cn对数据库设备要求较高)分治法(常

Python将大量遥感数据的值缩放指定倍数的方法(推荐)

《Python将大量遥感数据的值缩放指定倍数的方法(推荐)》本文介绍基于Python中的gdal模块,批量读取大量多波段遥感影像文件,分别对各波段数据加以数值处理,并将所得处理后数据保存为新的遥感影像... 本文介绍基于python中的gdal模块,批量读取大量多波段遥感影像文件,分别对各波段数据加以数值处

使用MongoDB进行数据存储的操作流程

《使用MongoDB进行数据存储的操作流程》在现代应用开发中,数据存储是一个至关重要的部分,随着数据量的增大和复杂性的增加,传统的关系型数据库有时难以应对高并发和大数据量的处理需求,MongoDB作为... 目录什么是MongoDB?MongoDB的优势使用MongoDB进行数据存储1. 安装MongoDB

Python MySQL如何通过Binlog获取变更记录恢复数据

《PythonMySQL如何通过Binlog获取变更记录恢复数据》本文介绍了如何使用Python和pymysqlreplication库通过MySQL的二进制日志(Binlog)获取数据库的变更记录... 目录python mysql通过Binlog获取变更记录恢复数据1.安装pymysqlreplicat