Kaggle--泰坦尼克号失踪者生死情况预测源码(附Titanic数据集)

本文主要是介绍Kaggle--泰坦尼克号失踪者生死情况预测源码(附Titanic数据集),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

数据可视化分析

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as nptitanic=pd.read_csv('train.csv')
#print(titanic.head())
#设置某一列为索引
#print(titanic.set_index('PassengerId').head())# =============================================================================
# #绘制一个展示男女乘客比例的扇形图
# #sum the instances of males and females
# males=(titanic['Sex']=='male').sum()
# females=(titanic['Sex']=='female').sum()
# #put them into a list called proportions
# proportions=[males,females]
# #Create a pie chart
# plt.pie(
# #        using proportions
#         proportions,
# #        with the labels being officer names
#         labels=['Males','Females'],
# #        with no shadows
#         shadow=False,
# #        with colors
#         colors=['blue','red'],
#         explode=(0.15,0),
#         startangle=90,
#         autopct='%1.1f%%'
#         )
# plt.axis('equal')
# plt.title("Sex Proportion")
# plt.tight_layout()
# plt.show()
# =============================================================================# =============================================================================
# #绘制一个展示船票Fare,与乘客年龄和性别的散点图
# #creates the plot using
# lm=sns.lmplot(x='Age',y='Fare',data=titanic,hue='Survived',fit_reg=False)
# #set title
# lm.set(title='Fare x Age')
# #get the axes object and tweak it
# axes=lm.axes
# axes[0,0].set_ylim(-5,)
# axes[0,0].set_xlim(-5,85)
# =============================================================================# =============================================================================
# #绘制一个展示船票价格的直方图
# #sort the values from the top to least value and slice the first 5 items
# df=titanic.Fare.sort_values(ascending=False)
# #create bins interval using numpy
# binsVal=np.arange(0,600,10)
# #create the plot
# plt.hist(df,bins=binsVal)
# plt.xlabel('Fare')
# plt.ylabel('Frequency')
# plt.title('Fare Payed Histrogram')
# plt.show()
# =============================================================================#哪个性别的年龄的平均值更大
#print(titanic.groupby('Sex').Age.mean())
#打印出不同性别的年龄的描述性统计信息
#print(titanic.groupby('Sex').Age.describe())
#print(titanic.groupby(['Sex','Survived']).Fare.describe())
#先对Survived再Fare进行排序
#a=titanic.sort_values(['Survived','Fare'],ascending=False)
#print(a)
#选取名字以字母A开头的数据
#b=titanic[titanic.Name.str.startswith('A')]
#print(b)
#找到其中三个人的存活情况
#c=titanic.loc[titanic.Name.isin(['Youseff, Mr. Gerious','Saad, Mr. Amin','Yousif, Mr. Wazli'])\
#              ,['Name','Survived']]
#print(c)
# =============================================================================
# ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
# ts = ts.cumsum()
# ts.plot()
# plt.show()
# 
# df = pd.DataFrame(np.random.randn(1000, 4),index=ts.index,columns=['A', 'B', 'C', 'D'])
# df=df.cumsum()
# plt.figure()
# df.plot()
# plt.legend(loc='best')
# plt.show()
# =============================================================================
#对应每一个location,一共有多少数据值缺失
#print(titanic.isnull().sum())
#对应每一个location,一共有多少数据值完整
#print(titanic.shape[0]-titanic.isnull().sum())
#查看每个列的数据类型
#print(titanic.info())
#print(titanic.dtypes)

主程序
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 10 17:21:16 2018@author: CSH
"""import pandas as pd
titanic=pd.read_csv("train.csv")
#print(titanic.describe())titanic["Age"]=titanic["Age"].fillna(titanic["Age"].median())
#print(titanic.describe())#print(titanic["Sex"].unique())
titanic.loc[titanic["Sex"]=="male","Sex"]=0
titanic.loc[titanic["Sex"]=="female","Sex"]=1#print(titanic["Embarked"].value_counts())
titanic["Embarked"]=titanic["Embarked"].fillna("S")
titanic.loc[titanic["Embarked"]=="S","Embarked"]=0
titanic.loc[titanic["Embarked"]=="C","Embarked"]=1
titanic.loc[titanic["Embarked"]=="Q","Embarked"]=2
#线性回归
# =============================================================================
# from sklearn.linear_model import LinearRegression
# from sklearn.cross_validation import KFold
# predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
# alg=LinearRegression()
# kf=KFold(titanic.shape[0],n_folds=3,random_state=1)
# predictions=[]
# for train,test in kf:
#     train_predictors=(titanic[predictors].iloc[train,:])
#     train_target=titanic["Survived"].iloc[train]
#     alg.fit(train_predictors,train_target)
#     test_predictions=alg.predict(titanic[predictors].iloc[test,:])
#     predictions.append(test_predictions)
# 
# 
# import numpy as np
# predictions=np.concatenate(predictions,axis=0)
# predictions[predictions>.5]=1
# predictions[predictions<=.5]=0
# accuracy=sum(predictions==titanic["Survived"])/len(predictions)
# print(accuracy)
# =============================================================================
#逻辑回归
# =============================================================================
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation
# predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
# alg=LogisticRegression(random_state=1)
# scores=cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=3)
# print(scores.mean())
# =============================================================================
#随机森林
# =============================================================================
# from sklearn import cross_validation
# from sklearn.ensemble import RandomForestClassifier
# predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
# alg=RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=12,min_samples_leaf=1)
# kf=cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
# scores=cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
# print(scores.mean())
# =============================================================================titanic["FamilySize"]=titanic["SibSp"]+titanic["Parch"]
titanic["NameLength"]=titanic["Name"].apply(lambda x:len(x))#提取名字信息
import re
def get_title(name):title_search=re.search('([A-Za-z]+)\.',name)if title_search:return title_search.group(1)return ""titles=titanic["Name"].apply(get_title)
#print(pd.value_counts(titles))title_mapping={"Mr":1,"Miss":2,"Mrs":3,"Master":4,"Dr":5,"Rev":6,"Mlle":7,"Major":8,"Col":9,"Ms":10,"Mme":11,"Lady":12,"Sir":13,"Capt":14,"Don":15,"Jonkheer":16,"Countess":17}
for k,v in title_mapping.items():titles[titles==k]=v
#print(pd.value_counts(titles))
titanic["Title"]=titles
#特征选择
# =============================================================================
# import numpy as np
# from sklearn.feature_selection import SelectKBest,f_classif
# import matplotlib.pyplot as plt
# predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
# selector=SelectKBest(f_classif,k=5)
# selector.fit(titanic[predictors],titanic["Survived"])
# scores=-np.log10(selector.pvalues_)
# 
# plt.bar(range(len(predictors)),scores)
# plt.xticks(range(len(predictors)),predictors,rotation='vertical')
# plt.show()
# =============================================================================# =============================================================================
# from sklearn import cross_validation
# from sklearn.ensemble import RandomForestClassifier
# predictors=["Pclass","Sex","Fare","Title","NameLength"]
# alg=RandomForestClassifier(random_state=1,n_estimators=50,min_samples_split=12,min_samples_leaf=1)
# kf=cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
# scores=cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
# print(scores.mean())
# =============================================================================#集成学习
from sklearn.cross_validation import KFold
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
algorithms=[[GradientBoostingClassifier(random_state=1,n_estimators=25,max_depth=3),["Pclass","Sex","Fare","Title","NameLength"]],[LogisticRegression(random_state=1),["Pclass","Sex","Fare","Title","NameLength"]]]kf=KFold(titanic.shape[0],n_folds=3,random_state=1)
predictions=[]
for train,test in kf:train_target=titanic["Survived"].iloc[train]full_test_predictions=[]for alg,predictors in algorithms:alg.fit(titanic[predictors].iloc[train,:],train_target)test_predictions=alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]full_test_predictions.append(test_predictions)test_predictions=(full_test_predictions[0]+full_test_predictions[1])/2test_predictions[test_predictions<=.5]=0test_predictions[test_predictions>.5]=1predictions.append(test_predictions)predictions=np.concatenate(predictions,axis=0)
accuracy=sum(predictions==titanic["Survived"])/len(predictions)
print(accuracy)

附:链接:https://pan.baidu.com/s/1K1USWVQQOEM9OLr3M1pniw 密码:n8wz

这篇关于Kaggle--泰坦尼克号失踪者生死情况预测源码(附Titanic数据集)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/563037

相关文章

Python获取中国节假日数据记录入JSON文件

《Python获取中国节假日数据记录入JSON文件》项目系统内置的日历应用为了提升用户体验,特别设置了在调休日期显示“休”的UI图标功能,那么问题是这些调休数据从哪里来呢?我尝试一种更为智能的方法:P... 目录节假日数据获取存入jsON文件节假日数据读取封装完整代码项目系统内置的日历应用为了提升用户体验,

Java利用JSONPath操作JSON数据的技术指南

《Java利用JSONPath操作JSON数据的技术指南》JSONPath是一种强大的工具,用于查询和操作JSON数据,类似于SQL的语法,它为处理复杂的JSON数据结构提供了简单且高效... 目录1、简述2、什么是 jsONPath?3、Java 示例3.1 基本查询3.2 过滤查询3.3 递归搜索3.4

Python实现无痛修改第三方库源码的方法详解

《Python实现无痛修改第三方库源码的方法详解》很多时候,我们下载的第三方库是不会有需求不满足的情况,但也有极少的情况,第三方库没有兼顾到需求,本文将介绍几个修改源码的操作,大家可以根据需求进行选择... 目录需求不符合模拟示例 1. 修改源文件2. 继承修改3. 猴子补丁4. 追踪局部变量需求不符合很

MySQL大表数据的分区与分库分表的实现

《MySQL大表数据的分区与分库分表的实现》数据库的分区和分库分表是两种常用的技术方案,本文主要介绍了MySQL大表数据的分区与分库分表的实现,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有... 目录1. mysql大表数据的分区1.1 什么是分区?1.2 分区的类型1.3 分区的优点1.4 分

Mysql删除几亿条数据表中的部分数据的方法实现

《Mysql删除几亿条数据表中的部分数据的方法实现》在MySQL中删除一个大表中的数据时,需要特别注意操作的性能和对系统的影响,本文主要介绍了Mysql删除几亿条数据表中的部分数据的方法实现,具有一定... 目录1、需求2、方案1. 使用 DELETE 语句分批删除2. 使用 INPLACE ALTER T

Python Dash框架在数据可视化仪表板中的应用与实践记录

《PythonDash框架在数据可视化仪表板中的应用与实践记录》Python的PlotlyDash库提供了一种简便且强大的方式来构建和展示互动式数据仪表板,本篇文章将深入探讨如何使用Dash设计一... 目录python Dash框架在数据可视化仪表板中的应用与实践1. 什么是Plotly Dash?1.1

Redis 中的热点键和数据倾斜示例详解

《Redis中的热点键和数据倾斜示例详解》热点键是指在Redis中被频繁访问的特定键,这些键由于其高访问频率,可能导致Redis服务器的性能问题,尤其是在高并发场景下,本文给大家介绍Redis中的热... 目录Redis 中的热点键和数据倾斜热点键(Hot Key)定义特点应对策略示例数据倾斜(Data S

Python实现将MySQL中所有表的数据都导出为CSV文件并压缩

《Python实现将MySQL中所有表的数据都导出为CSV文件并压缩》这篇文章主要为大家详细介绍了如何使用Python将MySQL数据库中所有表的数据都导出为CSV文件到一个目录,并压缩为zip文件到... python将mysql数据库中所有表的数据都导出为CSV文件到一个目录,并压缩为zip文件到另一个

SpringBoot整合jasypt实现重要数据加密

《SpringBoot整合jasypt实现重要数据加密》Jasypt是一个专注于简化Java加密操作的开源工具,:本文主要介绍详细介绍了如何使用jasypt实现重要数据加密,感兴趣的小伙伴可... 目录jasypt简介 jasypt的优点SpringBoot使用jasypt创建mapper接口配置文件加密

使用Python高效获取网络数据的操作指南

《使用Python高效获取网络数据的操作指南》网络爬虫是一种自动化程序,用于访问和提取网站上的数据,Python是进行网络爬虫开发的理想语言,拥有丰富的库和工具,使得编写和维护爬虫变得简单高效,本文将... 目录网络爬虫的基本概念常用库介绍安装库Requests和BeautifulSoup爬虫开发发送请求解