本文主要是介绍kaggle竞赛——Titanic:Machine Learning from Disaster,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
题目地址:https://www.kaggle.com/c/titanic
根据所提供的乘客信息,判断该乘客Survived or not?
Introduction
机器学习这块停了整整三个月,主要原因是一方面课题需要推进尤其是修改论文,进度特别慢,终于知道为啥修改论文要至少三个月了!另一方面,学习了数据结构与算法方面的内容,目的是提高自己编程能力,之前从未想过手编梯度下降、逻辑回归、BP等算法,现在基本上都可以尝试编写(虽然有点麻烦)。说一嘴,笔试和面试都比较看重编程能力,所以还是建议有事没事刷刷题,编编程,防止手生!闲话少说,这次主要总结第一次做kaggle题的思路,尤其是如何写Baseline,如何优化!(使用Jupyter Notebook讲解更好)其中主要参考了七月在线中寒老师的讲解内容,受益匪浅!再次强调,这次内容非常简单,是最基本的数据挖掘过程,kaggle——>kernels里有各种大牛,可以参考他们进一步学习,当然,前提是知道要学他们什么地方?如何数据清洗?如何做特征工程?如何画图?用什么模型?如何做模型选择等等,有目的的看才有效果!好了,废话不多少,开始Baseline的累码!
method
1、载入基本包
import numpy as np
import pandas as pd
import math
importmatplotlib.pyplot as plt
2、导入并查看数据
df_origin =pd.read_csv('train.csv')
defread_data(df):
print(df.head(10))
print(df.describe())
print(df.info())
cols = df.columns
print(cols)
for c in cols:
print (c, df[c].unique())
read_data(df_origin)
3、画图,看看每个/多个 属性和最后的Survived之间有着什么样的关系
fig = plt.figure()
fig.set(alpha = 0.8)
plt.rcParams['font.sans-serif']= ['SimHei'] #指定默认字体
plt.rcParams['axes.unicode_minus']= False
#解决保存图像是负号'-'显示为方块的问题
'''
**************查看Survived分布****************
'''
plt.subplot(331)
plt.subplots_adjust(wspace=0.4,hspace=0.4)
df_origin.Survived.value_counts().plot(kind= 'bar')
plt.title(u'获救情况 (1为获救)') #puts a title on our graph
plt.ylabel(u'人数')
'''
**************查看乘客等级分布****************
'''
plt.subplot(332)
df_origin.Pclass.value_counts().plot(kind= 'bar')
plt.title(u'乘客登机分布') # puts a title on our graph
plt.ylabel(u'人数')
'''
**************按年龄看获救分布****************
'''
plt.subplot(333)
plt.scatter(df_origin.Survived,df_origin.Age)
plt.title(u'乘客登机分布') # puts a title on our graph
plt.ylabel(u'人数')
'''
**************各等级的乘客年龄分布****************
'''
plt.subplot2grid((3,3),(1,0),colspan=2)
df_origin.Age[df_origin.Pclass== 1].plot(kind = 'kde')
df_origin.Age[df_origin.Pclass== 2].plot(kind = 'kde')
df_origin.Age[df_origin.Pclass== 3].plot(kind = 'kde')
plt.xlabel(u"年龄")# plots an axis lable
plt.ylabel(u"密度")
plt.title(u"各等级的乘客年龄分布")
plt.legend((u'头等舱', u'2等舱',u'3等舱'),loc='best') # sets our legend for our graph.
'''
**************各登船口岸人数分布****************
'''
plt.subplot(336)
df_origin.Embarked.value_counts().plot(kind='bar')
plt.title(u"各登船口岸上船人数")
plt.ylabel(u"人数")
'''
**************性别分布****************
'''
plt.subplot(337)
df_origin.Sex.value_counts().plot(kind='bar')
plt.title(u"不同性别人数")
plt.ylabel(u"人数")
'''
**************SibSp个数****************
'''
plt.subplot(338)
df_origin.SibSp.value_counts().plot(kind='bar')
plt.title(u"各登船口岸上船人数")
plt.ylabel(u"人数")
'''
**************Parch个数****************
'''
plt.subplot(339)
df_origin.Parch.value_counts().plot(kind='bar')
plt.title(u"Parch人数")
plt.ylabel(u"人数")
plt.show()
'''
**************看看各属性与最后的获救情况的关系****************
'''
#看看各乘客等级的获救情况
fig = plt.figure()
fig.set(alpha=0.2) # 设定图表颜色alpha参数
plt.rcParams['font.sans-serif']= ['SimHei'] #指定默认字体
plt.rcParams['axes.unicode_minus']= False #解决保存图像是负号'-'显示为方块的问题
#看看各登船口岸的获救情况
Survived_1 =df_origin.Pclass[df_origin.Survived ==1].value_counts()
Survived_0 =df_origin.Pclass[df_origin.Survived ==0].value_counts()
pd.DataFrame({u"获救":Survived_1,u"未获救":Survived_0}).plot(kind = 'bar')
plt.title(u"各乘客等级的获救情况")
plt.xlabel(u"乘客等级")
plt.ylabel(u"人数")
#看看各乘客等级的获救情况
Survived_1 =df_origin.Embarked[df_origin.Survived ==1].value_counts()
Survived_0 =df_origin.Embarked[df_origin.Survived ==0].value_counts()
df =pd.DataFrame({u"获救":Survived_1,u"未获救":Survived_0})
df.plot(kind = 'bar')
plt.title(u"各登录港口乘客的获救情况")
plt.xlabel(u"登录港口")
plt.ylabel(u"人数")
#看看不同性别的获救情况
Survived_1 =df_origin.Sex[df_origin.Survived ==1].value_counts()
Survived_0 =df_origin.Sex[df_origin.Survived ==0].value_counts()
df =pd.DataFrame({u"获救":Survived_1,u"未获救":Survived_0})
df.plot(kind = 'bar')
plt.title(u"不同性别的获救情况")
plt.xlabel(u"性别")
plt.ylabel(u"人数")
#看看兄弟姐妹个数与的获救情况的关系
Survived_1 =df_origin.SibSp[df_origin.Survived ==1].value_counts()
Survived_0 =df_origin.SibSp[df_origin.Survived ==0].value_counts()
df =pd.DataFrame({u"获救":Survived_1,u"未获救":Survived_0})
df.plot(kind = 'bar')
plt.title(u"不同SibSp的获救情况")
plt.xlabel(u"SibSp")
plt.ylabel(u"人数")
# #看看父母个数与的获救情况的关系
Survived_1 =df_origin.Parch[df_origin.Survived ==1].value_counts()
Survived_0 =df_origin.Parch[df_origin.Survived ==0].value_counts()
df =pd.DataFrame({u"获救":Survived_1,u"未获救":Survived_0})
df.plot(kind = 'bar')
plt.title(u"不同Parch的获救情况")
plt.xlabel(u"Parch")
plt.ylabel(u"人数")
plt.show()
4、经过上述分析,女性,高级舱可能获救概率更大,是不是这样?做个组合看一看。
'''
***********对特征进行组合再与最后的获救情况的关系****************
可视化版
'''
#看看舱等级和性别的获救情况
fig =plt.figure()fig.set(alpha = 0.2)#显示图的背景灰度颜色plt.title(u"根据舱等级和性别的获救情况")
plt.subplot(141)
df_origin.Survived[df_origin.Pclass!= 3][df_origin.Sex == 'female'].value_counts().plot(kind = 'bar',label="female highclass", color='#FA2479')
plt.xlabel([u'获救',u'未获救'],rotation = 0)
plt.legend(['女性/高级舱'],loc='best')
plt.subplot(142)
df_origin.Survived[df_origin.Pclass== 3][df_origin.Sex == 'female'].value_counts().plot(kind = 'bar',label="female lowclass", color='pink')plt.xlabel([u'获救',u'未获救'],rotation = 0) plt.legend(['女性/低级舱'], loc='best')
plt.subplot(143)
df_origin.Survived[df_origin.Pclass!= 3][df_origin.Sex == 'male'].value_counts().plot(kind = 'bar',label="male highclass", color='lightblue')
plt.xlabel([u'未获救',u'获救'],rotation = 0)
plt.legend(['男性/高级舱'], loc='best')
plt.subplot(144)
df_origin.Survived[df_origin.Pclass== 3][df_origin.Sex == 'male'].value_counts().plot(kind = 'bar',label="male lowclass", color='steelblue')
plt.xlabel([u'未获救',u'获救'],rotation = 0)
plt.legend(['男性/低级舱'], loc='best')
plt.show()
'''
********对特征进行组合再与最后的获救情况的关系*******
groupby版
'''
g =df_origin.groupby(['SibSp', 'Survived'])df =pd.DataFrame(g.count['PassengerId'])
print(df)
5、对特殊特征进行处理含有缺失值的量:Cabin,Age
对类目型,缺失较多的Cabin处理由于缺失较多,采用有无缺失值办法分类
df_origin.loc[(df_origin.Cabin.isnull()),'Cabin']= 'No'
df_origin.loc[(df_origin.Cabin.notnull()),'Cabin']= 'Yes'
Age为连续值,且缺失值较少,考虑拟合补全或者离散化,对不同年龄段加权重(考虑老人和孩子获救的可能性大)
这里先采用拟合方法
from sklearn.ensembleimport RandomForestRegressor
age_df =df_origin[['Age', 'Fare','Pclass', 'SibSp', 'Parch','Survived']]
Know_age =age_df[age_df.Age.notnull()].as_matrix()
unKnow_age =age_df[age_df.Age.isnull()].as_matrix()
y = Know_age[:,0]
X = Know_age[:,1:]
rfr =RandomForestRegressor(random_state=0, n_estimators = 2000, n_jobs = -1)
rfr.fit(X, y)
predictAge =rfr.predict(unKnow_age[:,1::])
df_origin.loc[(df_origin.Age.isnull()),'Age'] = predictAge
6、对类别型特征进行处理
类别型转化成数值型
方法:离散化,因子化,one-hot编码,哑变量,pd.getdummies
以Embarked为例,原本一个属性维度,因为其取值可以是[‘S’,’C’,’Q‘],而将其平展开为’Embarked_C’,’Embarked_S’, ‘Embarked_Q’三个属性
原本Embarked取值为S的,在此处的”Embarked_S”下取值为1,在’Embarked_C’, ‘Embarked_Q’下取值为0
原本Embarked取值为C的,在此处的”Embarked_C”下取值为1,在’Embarked_S’, ‘Embarked_Q’下取值为0
原本Embarked取值为Q的,在此处的”Embarked_Q”下取值为1,在’Embarked_C’, ‘Embarked_S’下取值为0
dummies_Cabin =pd.get_dummies(df_origin['Cabin'], prefix= 'Cabin')
dummies_Embarked =pd.get_dummies(df_origin['Embarked'], prefix= 'Embarked')
dummies_Sex =pd.get_dummies(df_origin['Sex'], prefix= 'Sex')
dummies_Pclass =pd.get_dummies(df_origin['Pclass'], prefix= 'Pclass')
#将编码后的值添加到原df中,并去除原变量
df_origin.drop(['Cabin','Name','Embarked', 'Ticket','Sex','Pclass'], axis=1)
df =pd.concat([df_origin,dummies_Cabin,dummies_Embarked,dummies_Sex,dummies_Pclass],axis = 1)
7、对数值型特征进行处理
数值幅度偏差太大,各属性值之间scale差距太大,将对收敛速度造成几万点伤害值!甚至不收敛!
做一个scaling,所谓scaling,其实就是将一些变化幅度较大的特征化到[-1,1]之内。
方法:sklearn.preprocessing
fromsklearn.preprocessing import StandardScaler
df['Age_Scaled'] =StandardScaler().fit_transform(df['Age'])
df['Fare_Scaled'] =StandardScaler().fit_transform(df['Fare'])
8、构造新的训练集包含label和feature,以Matrix形式,开始寻找学习算法训练模型,这里采用逻辑回归
from sklearn importlinear_model
train_df =df.filter(regex ='PassengerId|Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
train_np =train_df.as_matrix()
train_df.to_csv('train_feature_1.csv',index = False)
#设定输入输出矩阵,X,y
y = train_np[:,1]
X = train_np[:,2:]
lm =linear_model.LogisticRegression(C = 1.0, penalty='l1', tol = 1e-6)
lm.fit(X,y)
9、同样在测试集中进行相同操作,获取同类型特征
data_test =pd.read_csv("test.csv")
data_test.loc[(data_test.Fare.isnull()), 'Fare' ] = np.mean(data_test.Fare.notnull())
# 接着我们对test_data做和train_data中一致的特征变换
# 首先用同样的RandomForestRegressor模型填上丢失的年龄
tmp_df =data_test[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
null_age =tmp_df[data_test.Age.isnull()].as_matrix()
notnull_age =tmp_df[data_test.Age.notnull()].as_matrix()
y = notnull_age[:,0]
X = notnull_age[:,1:]
rfr =RandomForestRegressor(random_state=0, n_estimators = 2000, n_jobs = -1)
rfr.fit(X, y)
X = null_age[:, 1:]
predictedAges =rfr.predict(X)
# 根据特征属性X预测年龄并补上
data_test.loc[(data_test.Age.isnull()), 'Age' ] = predictedAges
data_test.loc[(data_test.Cabin.isnull()),'Cabin']= 'No'
data_test.loc[(data_test.Cabin.notnull()),'Cabin']= 'Yes'
dummies_Cabin =pd.get_dummies(data_test['Cabin'], prefix= 'Cabin')
dummies_Embarked =pd.get_dummies(data_test['Embarked'], prefix= 'Embarked')
dummies_Sex =pd.get_dummies(data_test['Sex'], prefix= 'Sex')
dummies_Pclass =pd.get_dummies(data_test['Pclass'], prefix= 'Pclass')
df_test =pd.concat([data_test, dummies_Cabin, dummies_Embarked, dummies_Sex,dummies_Pclass], axis=1)
df_test.drop(['Pclass','Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
df_test['Age_Scaled']= StandardScaler().fit_transform(df_test['Age'])
df_test['Fare_Scaled']= StandardScaler().fit_transform(df_test['Fare'])
test =df_test.filter(regex='PassengerId|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
test.to_csv('test_feature_1.csv',index = False)
***********************至此,前期工作完成,下面开始特征层面的工作**********************
1、模型建立的好不好怎么判断?
交叉验证!交叉验证!交叉验证!。。。
train_data =pd.read_csv('train_feature_1.csv')
train_data.head(10)
train_data.describe()
train_data =train_data.as_matrix()
y = train_data[:,1]#因为train_data,第二列才是survived
X = train_data[:,2:]
C = 1.0
'''
交叉验证,即验证阶段,而不是优化阶段
model_selection
Cross-validation:evaluating estimator performance
'''
fromsklearn.model_selection import KFold
fromsklearn.model_selection import cross_val_score
kf =KFold(n_splits=5)
scores =cross_val_score(lr, X, y, scoring='f1', cv=kf)
# Take the mean ofthe scores (because we have one for each fold)
#验证集上的得分
print(scores.mean())
这里能知道在验证集上准确率是多少?
2、但模型到底如何,欠拟合还是过拟合?利用学习曲线看看!!!
著名的learning curve可以帮我们判定我们的模型现在所处的状态。
我们以样本数为横坐标,训练和交叉验证集上的错误率作为纵坐标
欠拟合Underfit 高偏差High bias
正常拟合Just right 偏差和方差均较小
过拟合Overfit 高方差High variance
overfitting的表现一般是训练集上得分高,而交叉验证集上要低很多,中间的gap比较大。
如果没有overfitting,再做些feature engineering的工作,添加一些新产出的特征或者组合特征到模型中。
(这里讲究很多,会专门介绍如何评价模型好坏?)
fromsklearn.model_selection import learning_curve
train_sizes_abs,train_scores, test_scores = learning_curve(lr, X, y, groups=None,train_sizes=np.linspace(.01, 1., 100),cv = kf)
train_scores_mean =np.mean(train_scores, axis=1)
test_scores_mean =np.mean(test_scores, axis=1)
train_scores_std =np.std(train_scores, axis=1)
test_scores_std =np.std(test_scores, axis=1)
title = '学习曲线'
plt.figure()
plt.rcParams['font.sans-serif']= ['SimHei'] #指定默认字体
plt.rcParams['axes.unicode_minus']= False #解决保存图像是负号'-'显示为方块的问题
plt.grid()
plt.plot(train_sizes_abs,train_scores_mean,color="b", label=u"训练集上得分")
plt.plot(train_sizes_abs,test_scores_mean,color="r", label=u"交叉验证集上得分")
plt.fill_between(train_sizes_abs,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.1,color="r")
plt.fill_between(train_sizes_abs,test_scores_mean-test_scores_std,test_scores_mean+test_scores_std,alpha=0.1,color="b")
plt.xlabel(u"训练样本数")
plt.ylabel(u"得分")
plt.title(title)
plt.show()
3、很明显underfitting了,因此需要对Baseline进行优化!
但是对哪些属性优化?如何进行优化?方法是交叉验证(得到可靠稳定的模型)
先得到的模型的系数(coef_),因为系数和它们最终的判定能力强弱是正相关的
print(pd.DataFrame({'coef':list(lr.coef_.T),'columns':list(train_data[1:].columns)}))
对训练数据分割,看测试集上的预测结果,查看badcase,对比bad case,我们仔细看看我们预测错的样本,到底是哪些特征有问题,咱们处理得还不够细?
# 分割数据
origin_data_train =pd.read_csv('train_feature_1.csv')
fromsklearn.model_selection import train_test_split
train_cv,test_cv =train_test_split(origin_data_train)
# 生成模型
train_df =train_cv.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
clf =linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
clf.fit(train_df.as_matrix()[:,1:],train_df.as_matrix()[:,0])
# 对cross validation数据进行预测
cv_df =test_cv.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
predictions =clf.predict(cv_df.as_matrix()[:,1:])
# 去除预测错误的case看原始dataframe数据
train_data =pd.read_csv('train_feature_1.csv')
bad_cases = train_data.loc[train_data['PassengerId'].isin(test_cv[predictions!=test_cv.as_matrix()[:0]]['PassengerId'].values)]
print(bad_cases)
********************************************至此,优化前的工作全部完成*********************************************
下面可以尝试其他学习算法,当然,在Baseline最好先用LR,之后再用模型融合算法,xgboost,GBDT等等
'''
各种算法尝试与比较
'''
#逻辑回归
from sklearn importlinear_model, model_selection
lr =linear_model.LogisticRegression(penalty='l2',tol=0.0001, C=C)
lr.fit(X,y)
#支持向量机
from sklearn.svmimport SVC
rbf_svc = SVC(C=C,kernel='rbf', degree=3)
rbf_svc.fit(X,y)
#神经网络
fromsklearn.neural_network import MLPClassifier
nn =MLPClassifier(hidden_layer_sizes=(100,100, ), activation='relu', solver='adam',alpha=0.0001, batch_size='auto', learning_rate='constant',learning_rate_init=0.001, power_t=0.5, max_iter=400, tol=0.0001)
nn.fit(X,y)
'''
xgboost:模型融合与优化
'''
'''
sklearn.ensemble:模型融合与优化
'''
#Random Forest
from sklearn.ensembleimport RandomForestClassifier
rf =RandomForestClassifier(n_estimators=10, max_depth=9, min_samples_split=6,min_samples_leaf=4)
rf.fit(X,y)
#Bagging
from sklearn.ensembleimport BaggingClassifier
clf = SVC(C=C,kernel='rbf', degree=3)
bagging_clf =BaggingClassifier(clf, n_estimators=20, max_samples=0.8, max_features=1.0,bootstrap=True, bootstrap_features=False)
bagging_clf.fit(X, y)
#AdaBoost
from sklearn.ensembleimport AdaBoostClassifier
adb =AdaBoostClassifier()
adb.fit(X,y)
#VotingClassifier
from sklearn.ensembleimport VotingClassifier
vc =VotingClassifier(estimators =[('lr', lr), ('rf', rf), ('adb', adb)],voting='soft', weights=None, n_jobs=1)
vc.fit(X,y)
比较以上算法在Baseline后,SVC效果更好点,在leaderboard中3000多名,哈哈,自嘲一下,很明显需要优化了!
总结图如下:之后会丰富
这篇关于kaggle竞赛——Titanic:Machine Learning from Disaster的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!