Kaggle - Titanic 生存预测

2023-10-09 22:20
文章标签 预测 生存 kaggle titanic

本文主要是介绍Kaggle - Titanic 生存预测,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

第一次参加Kaggle,以Titanic来入个门。本次竞赛的目的是根据Titanic的人员信息来预测最终的生存情况。采用Python3来完成本次竞赛。

一、数据总览

从Kaggle平台我们了解到,Training set一共有891条记录,Test set一共有418条记录。提供的相关变量有:

VariableDefinitionKey
survivalSurvival0 = No, 1 = Yes
pclassTicket class

1 = 1st, 2 = 2nd, 3 = 3rd

A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

sexSex 
AgeAge in yearsAge is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp# of siblings / spouses aboard the TitanicSibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch# of parents / children aboard the TitanicParent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
ticketTicket number 
farePassenger fare 
cabinCabin number 
embarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton

首先查看一下训练集和测试集的基本信息,对数据的规模、各个特征的数据类型以及是否有缺失,有一个总体的了解:

import pandas as pd 
import numpy as np
import re
from sklearn.feature_selection import chi2
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score#读取数据
train = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/train.csv')
test = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/test.csv')
train_test_combined = train.append(test,ignore_index=True)#查看基本信息
print (train.info())
print (test.info())

输出为:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

可知:训练集中Age、Cabin和Embarked这三个变量有缺失,测试集中Age、Cabin和Fare这三个变量有缺失。

接下来我们再查看一下数据的具体格式:

#默认打印出前5行数据
print (train.head())

我使用的是Sublime编辑器,因为列数太多,会分多行打印,输出结果不太美观。因此直接去Kaggle上查看数据,以下为Kaggle上的数据截图。

二、数据初步分析

1. 乘客基本属性分析

对于Survived、Sex、Pclass、Embarked这些分类变量,采用饼图来分析它们的构成比。对于Sibsp、Parch这些离散型数值变量,采用柱状图来显示它们的分布情况。对于Age、Fare这些连续型数值变量,采用直方图来显示它们的分布情况。

# 绘制分类变量的饼图
# labeldistance,文本的位置离远点有多远,1.1指1.1倍半径的位置
# autopct,圆里面的文本格式,%3.1f%%表示小数有三位,整数有一位的浮点数
# shadow,饼是否有阴影
# startangle,起始角度,0,表示从0开始逆时针转,为第一块。一般选择从90度开始比较好看
# pctdistance,百分比的text离圆心的距离plt.subplot(2,2,1)
survived_counts = train['Survived'].value_counts()
survived_labels = ['Died','Survived']
plt.pie(x=survived_counts,labels=survived_labels,autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Survived')
#设置显示的是一个正圆
plt.axis('equal')
#plt.show()plt.subplot(2,2,2)
gender_counts = train['Sex'].value_counts()
plt.pie(x=gender_counts,labels=gender_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Gender')
plt.axis('equal')plt.subplot(2,2,3)
pclass_counts = train['Pclass'].value_counts()
plt.pie(x=pclass_counts,labels=pclass_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Pclass')
plt.axis('equal')plt.subplot(2,2,4)
embarked_counts = train['Embarked'].value_counts()
plt.pie(x=embarked_counts,labels=embarked_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Embarked')
plt.axis('equal')plt.show()plt.subplot(2,2,1)
sibsp_counts = train['SibSp'].value_counts().to_dict()
plt.bar(list(sibsp_counts.keys()),list(sibsp_counts.values()))
plt.title('SibSp')plt.subplot(2,2,2)
parch_counts = train['Parch'].value_counts().to_dict()
plt.bar(list(parch_counts.keys()),list(parch_counts.values()))
plt.title('Parch')plt.style.use( 'ggplot')
plt.subplot(2,2,3)
plt.hist(train.Age,bins=np.arange(0,100,5),range=(0,100),color = 'steelblue', edgecolor = 'k')
plt.title('Age')plt.subplot(2,2,4)
plt.hist(train.Fare,bins=20,color = 'steelblue', edgecolor = 'k')
plt.title('Fare')plt.show()

2. 分析不同因素与生存情况之间的关系

(1)性别:

计算不同性别的生存率:

print (train.groupby('Sex')['Survived'].value_counts())
print (train.groupby('Sex')['Survived'].mean())

输出为:

Sex     Survived
female  1           2330            81
male    0           4681           109Sex
female    0.742038
male      0.188908

可知:女性的生存率为74.20%,男性的生存率仅为18.89%,女性的生存率远大于男性,因此性别是一个重要的影响因素。

(2)年龄:

计算不同年龄的生存率:

fig, axis1 = plt.subplots(1,1,figsize=(18,4))
train_age = train.dropna(subset=['Age'])
train_age["Age_int"] = train_age["Age"].astype(int)
train_age.groupby('Age_int')['Survived'].mean().plot(kind='bar')
plt.show()

输出为:

可知:小孩子的生存率较高,老年人中有好几个年龄段的生存率都为0,生存率较低。我们再看一下每个年龄段具体的幸存者和非幸存者的人数分布。

print (train_age.groupby('Age_int')['Survived'].value_counts())

输出为:

Age_int  Survived
0        1            7
1        1            50            2
2        0            71            3
3        1            50            1
4        1            70            3
5        1            4
6        1            20            1
7        0            21            1
8        0            21            2
9        0            61            2
10       0            2
11       0            31            1
12       1            1
13       1            2
14       0            41            3
15       1            40            1
16       0           111            6
17       0            71            6
18       0           171            9
19       0           161            9
20       0           131            3
21       0           191            5
22       0           161           11
23       0           111            5
24       0           161           15
25       0           171            6
26       0           121            6
27       1           110            7
28       0           201            7
29       0           121            8
30       0           171           10
31       0            91            8
32       0           101           10
33       0            91            6
34       0           101            6
35       1           110            7
36       0           121           11
37       0            51            1
38       0            61            5
39       0            91            5
40       0            91            6
41       0            41            2
42       0            71            6
43       0            41            1
44       0            61            3
45       0            91            5
46       0            3
47       0            81            1
48       1            60            3
49       1            40            2
50       0            51            5
51       0            51            2
52       0            31            3
53       1            1
54       0            51            3
55       0            21            1
56       0            21            2
57       0            2
58       1            30            2
59       0            2
60       0            21            2
61       0            3
62       0            21            2
63       1            2
64       0            2
65       0            3
66       0            1
70       0            3
71       0            2
74       0            1
80       1            1

接下来我们考虑将年龄分成几个年龄段,分别计算不同年龄段的生存率。0-1岁的小孩子生存率为100%,可以考虑将它单独作为一组,然后再分为1-15,15-55,>55这三个组。

train_age['Age_derived'] = pd.cut(train_age['Age'], bins=[0,0.99,14.99,54.99,100])
print (train_age.groupby('Age_derived')['Survived'].value_counts())
print (train_age.groupby('Age_derived')['Survived'].mean())

输出为:

Age_derived     Survived
(0.0, 0.99]     1             7
(0.99, 14.99]   1            380            33
(14.99, 54.99]  0           3621           232
(54.99, 100.0]  0            291            13Age_derived
(0.0, 0.99]       1.000000
(0.99, 14.99]     0.535211
(14.99, 54.99]    0.390572
(54.99, 100.0]    0.309524

可知:小孩子的生存率较成年人和老年人高。

(3) 船舱等级:

计算不同船舱等级的生存率:

print (train.groupby('Pclass')['Survived'].value_counts())
print (train.groupby('Pclass')['Survived'].mean())

输出为:

Pclass  Survived
1       1           1360            80
2       0            971            87
3       0           3721           119Pclass
1    0.629630
2    0.472826
3    0.242363

可知:一等舱的生存率为62.96%,二等舱的生存率为47.28%,三等舱的生存率为24.24%。因此船舱等级也是影响生存情况的一个重要因素。

(4)登船港口:

计算不同登船港口的乘客的生存率:

print (train.groupby('Embarked')['Survived'].value_counts())
print (train.groupby('Embarked')['Survived'].mean())

输出为:

Embarked  Survived
C         1            930            75
Q         0            471            30
S         0           4271           217Embarked
C    0.553571
Q    0.389610
S    0.336957

可知:港口C的生存率为55.36%,港口Q的生存率为38.96%,港口S的生存率为33.70%。港口C的生存率较高,因此登船港口可能为影响生存率的一个因素。

(5)有无兄弟姐妹及配偶 Sibsp、有无父母子女 Parch

计算不同Sibsp和不同Parch的生存率:

print (train.groupby('SibSp')['Survived'].value_counts())
print (train.groupby('SibSp')['Survived'].mean())print (train.groupby('Parch')['Survived'].value_counts())
print (train.groupby('Parch')['Survived'].mean())

输出为:

SibSp  Survived
0      0           3981           210
1      1           1120            97
2      0            151            13
3      0            121             4
4      0            151             3
5      0             5
8      0             7SibSp
0    0.345395
1    0.535885
2    0.464286
3    0.250000
4    0.166667
5    0.000000
8    0.000000Parch  Survived
0      0           4451           233
1      1            650            53
2      0            401            40
3      1             30             2
4      0             4
5      0             41             1
6      0             1Parch
0    0.343658
1    0.550847
2    0.500000
3    0.600000
4    0.000000
5    0.200000
6    0.000000

可知:独自一人的生存率较低,但如果亲人太多,生存率也会降低。

(6)Cabin:

Cabin的缺失率很高,无法做缺失值填补。暂时将它分为缺失和非缺失两种情况,分别计算生存率。

train.loc[train['Cabin'].isnull(),'Cabin_derived'] = 'Missing'
train.loc[train['Cabin'].notnull(),'Cabin_derived'] = 'Not Missing'
print (train.groupby('Cabin_derived')['Survived'].value_counts())
print (train.groupby('Cabin_derived')['Survived'].mean())

输出为:

Cabin_derived  Survived
Missing        0           4811           206
Not Missing    1           1360            68Cabin_derived
Missing        0.299854
Not Missing    0.666667

可知:Cabin缺失的乘客的生存率为29.99%,非缺失的乘客的生存率为66.67%,因此Cabin缺失与否可能与生存情况有关。

(7)费用Fare

先看一下生还者和未生还者的费用之间是否有区别:

print (train['Fare'][train['Survived'] == 0].describe())
print (train['Fare'][train['Survived'] == 1].describe())

输出为:

count    549.000000
mean      22.117887
std       31.388207
min        0.000000
25%        7.854200
50%       10.500000
75%       26.000000
max      263.000000count    342.000000
mean      48.395408
std       66.596998
min        0.000000
25%       12.475000
50%       26.000000
75%       57.000000
max      512.329200

可知,生还者的费用中位数为26,未生还者的费用中位数为10,两者之间差别比较明显。

(8)Name:

第一感觉是每个人的名字都不一样,因此这个特征没什么太大价值。其实这个观点大错特错,在Titanic中,Name这个因素很重要,可以从中提取很重要的信息。首先我们来看一下Name具体是什么形式的:

print (train.Name)

输出为:

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Williams, Mr. Charles Eugene
18     Vander Planke, Mrs. Julius (Emelia Maria Vande...
19                               Masselmani, Mrs. Fatima
...

可知:Name里面包含称呼:Mr., Mrs., Miss., Master.等等。因此我们先试着提取出一个独立的特征:Title

train['Title'] = train['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
print (train['Title'].value_counts())

输出为:

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Col               2
Major             2
Mlle              2
Mme               1
Ms                1
Don               1
Sir               1
Jonkheer          1
Capt              1
Lady              1
the Countess      1
Name: Title, dtype: int64

对于Master这个称呼,人数比较多,我们来看一下它代表的是哪一部分人群:

print (train[train['Title'] == 'Master'][['Survived','Title','Sex','Parch','SibSp','Fare','Age','Embarked']])

输出为:

     Survived   Title   Sex  Parch  SibSp      Fare    Age Embarked
7           0  Master  male      1      3   21.0750   2.00        S
16          0  Master  male      1      4   29.1250   2.00        Q
50          0  Master  male      1      4   39.6875   7.00        S
59          0  Master  male      2      5   46.9000  11.00        S
63          0  Master  male      2      3   27.9000   4.00        S
65          1  Master  male      1      1   15.2458    NaN        C
78          1  Master  male      2      0   29.0000   0.83        S
125         1  Master  male      0      1   11.2417  12.00        C
159         0  Master  male      2      8   69.5500    NaN        S
164         0  Master  male      1      4   39.6875   1.00        S
165         1  Master  male      2      0   20.5250   9.00        S
171         0  Master  male      1      4   29.1250   4.00        Q
176         0  Master  male      1      3   25.4667    NaN        S
...

可以看出,Master代表的是小男孩。

Title的种类比较多,我们把它们合并一下,再分析不同Title的生存率是否有差别:

train['Title'] = train['Title'].replace(['Lady', 'the Countess','Capt', 'Col','Don', 
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train['Title'] = train['Title'].replace(['Mlle', 'Ms'],'Miss')
train['Title'] = train['Title'].replace('Mme','Mrs')

输出为:

Title   Survived
Master  1            230            17
Miss    1           1300            55
Mr      0           4361            81
Mrs     1           1000            26
Rare    0            151             8Title
Master    0.575000
Miss      0.702703
Mr        0.156673
Mrs       0.793651
Rare      0.347826

可知:Title是影响生存率的一个因素。

三、数据预处理

包括缺失值填补、连续型数值变量的离散化、分类变量的dummy过程。数据预处理的时候,我们将训练集和测试集合并在一起进行。

1. 缺失值填补

从上面的分析中我们知道,训练集中Age、Cabin和Embarked这三个变量有缺失,测试集中Age、Cabin和Fare这三个变量有缺失。其中Cabin缺失率(>70%)太高,我们不进行填补。

(1)填补Embarked:

Embarked变量为分类变量,是指登船港口,可取值为:C, Q, S。我们使用出现频率最高的特征值填补。

train_test_combined['Embarked'].fillna(train_test_combined['Embarked'].mode().iloc[0], inplace=True)

(2)填补Fare:

Fare是一个数值型变量,我们根据不同Pclass的Fare均值来进行缺失值填补。

train_test_combined['Fare'] = train_test_combined[['Fare']].fillna(train_test_combined.groupby('Pclass').transform('mean'))

(3)填补Age:

Age是一个数值型变量,我们根据不同Title(Mr, Mrs, Miss, Master等)的年龄均值来进行缺失值填补。

train_test_combined['Title'] = train_test_combined['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
train_test_combined['Age'] = train_test_combined[['Age']].fillna(train_test_combined.groupby('Title').transform('mean'))

填补完缺失值后,我们再看一下数据的基本情况:

print (train_test_combined.info())

输出为:

Data columns (total 13 columns):
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
Title          1309 non-null object
dtypes: float64(3), int64(4), object(6)

2. 连续型数值变量的离散化

(1)Age:

通过前面分析年龄和生存情况之间的关系,我们将年龄分为<1,1-<15,15-<55,>=55这四个年龄段。

train_test_combined['Age_derived'] = pd.cut(train_test_combined['Age'], bins=[0,0.99,14.99,54.99,100],labels=['baby','child','adult','older'])
age_dummy = pd.get_dummies(train_test_combined['Age_derived']).rename(columns=lambda x: 'Age_' + str(x))
train_test_combined = pd.concat([train_test_combined,age_dummy],axis=1)

(2)Fare:

通过分析Ticket可知,有些人的Ticket号相同,存在团体票,所以需要将团体票价均分到每个人。

print (train_test_combined.Ticket.value_counts())

输出为:

CA. 2343              11
CA 2144                8
1601                   8
347082                 7
3101295                7
PC 17608               7
S.O.C. 14879           7
347077                 7
19950                  6
347088                 6
113781                 6
382652                 6
...

均分团体票价:

train_test_combined['Group_ticket'] = train_test_combined['Fare'].groupby(by=train_test_combined['Ticket']).transform('count')
train_test_combined['Fare'] = train_test_combined['Fare']/train_test_combined['Group_ticket']

查看Fare的均值、中位数等统计量:

print (train_test_combined['Fare'].describe())

输出为:

count    1309.000000
mean       14.756516
std        13.550515
min         0.000000
25%         7.550000
50%         8.050000
75%        15.000000
max       128.082300
Name: Fare, dtype: float64

我们以P25, P75将Fare分为三档,Low_fare: <=7.55, Median_fare: 7.55-15.00, High_fare: >15.00

train_test_combined['Fare_derived'] = pd.cut(train_test_combined['Fare'], bins=[-1,7.55,15.00,130], labels=['Low_fare','Median_fare','High_fare'])
fare_dummy = pd.get_dummies(train_test_combined['Fare_derived']).rename(columns=lambda x: str(x))
train_test_combined = pd.concat([train_test_combined,fare_dummy],axis=1)

3. Famliy Size

Sibsp和Parch都是反映亲人的数量,我们可以将这两个变量的值加起来,形成一个新的变量Family_size。

train_test_combined['Family_size'] = train_test_combined['Parch'] + train_test_combined['SibSp']print (train_test_combined.groupby('Family_size')['Survived'].value_counts())
print (train_test_combined.groupby('Family_size')['Survived'].mean())

输出为:

Family_size  Survived
0            0           3741           163
1            1            890            72
2            1            590            43
3            1            210             8
4            0            121             3
5            0            191             3
6            0             81             4
7            0             6
10           0             7Family_size
0     0.303538
1     0.552795
2     0.578431
3     0.724138
4     0.200000
5     0.136364
6     0.333333
7     0.000000
10    0.000000

可以看出独自一人或者family size过大,生存率均较低。我们根据Family_size的值分为三类:Single, Small family, Large family。

def family_size_category(Family_size):if Family_size == 0:return 'Single'elif Family_size <=3:return 'Small family'else:return 'Large family'train_test_combined['Family_size_category'] = train_test_combined['Family_size'].map(family_size_category)
family_dummy = pd.get_dummies(train_test_combined['Family_size_category']).rename(columns=lambda x: str(x))
train_test_combined = pd.concat([train_test_combined,family_dummy],axis=1)

4. Title

根据Name提取出title特征

train_test_combined['Title'] = train_test_combined['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])train_test_combined['Title'] = train_test_combined['Title'].replace(['Lady', 'the Countess','Capt', 'Col','Don', 
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train_test_combined['Title'] = train_test_combined['Title'].replace(['Mlle', 'Ms'],'Miss')
train_test_combined['Title'] = train_test_combined['Title'].replace('Mme','Mrs')title_dummy = pd.get_dummies(train_test_combined['Title']).rename(columns=lambda x: 'Title_' + str(x))
train_test_combined = pd.concat([train_test_combined,title_dummy],axis=1)

5. Cabin

根据Cabin是否缺失生成一个新的变量:

train_test_combined.loc[train_test_combined['Cabin'].isnull(),'Cabin_derived'] = 'Missing'
train_test_combined.loc[train_test_combined['Cabin'].notnull(),'Cabin_derived'] = 'Not Missing'
cabin_dummy = pd.get_dummies(train_test_combined['Cabin_derived']).rename(columns=lambda x: 'Cabin_' + str(x))
train_test_combined = pd.concat([train_test_combined,cabin_dummy],axis=1)

6. Pclass、Sex、Embarked

这三个变量只需将其dummy,不需其他处理。

#Pclass的dummy处理
pclass_dummy = pd.get_dummies(train_test_combined['Pclass']).rename(columns=lambda x: 'Pclass_' + str(x))
train_test_combined = pd.concat([train_test_combined,pclass_dummy],axis=1)#Sex的dummy处理
sex_dummy = pd.get_dummies(train_test_combined['Sex']).rename(columns=lambda x: str(x))
train_test_combined = pd.concat([train_test_combined,sex_dummy],axis=1)#Embarked的dummy处理
embarked_dummy = pd.get_dummies(train_test_combined['Embarked']).rename(columns=lambda x: 'Embarked_' + str(x))
train_test_combined = pd.concat([train_test_combined,embarked_dummy],axis=1)

最后我们将训练集和测试集分开,并保留有用的特征。

train = train_test_combined[:891]
test = train_test_combined[891:]selected_features = ['Embarked_C','female', 'male','Embarked_Q', 'Embarked_S', 'Age_baby', 'Age_child','Age_adult', 'Age_older', 'Low_fare','Median_fare', 'High_fare','Large family', 'Single', 'Small family', 'Title_Master', 'Title_Miss','Title_Mr', 'Title_Mrs', 'Title_Rare', 'Cabin_Missing', 'Cabin_Not Missing', 'Pclass_1','Pclass_2', 'Pclass_3']x_train = train[selected_features]
x_test = test[selected_features]
y_train = train['Survived']

到此,数据预处理已完成,我们可以通过建模来进行预测了。

四、建模分析

1. Logistic回归

采用Grid CV法寻找最优超参数C

lr = LogisticRegression(random_state=33)
param_lr = {'C':np.logspace(-4,4,9)}
grid_lr = GridSearchCV(estimator = lr, param_grid = param_lr, cv = 5)
grid_lr.fit(x_train,y_train)
print (grid_lr.grid_scores_,'\n', 'Best param: ' ,grid_lr.best_params_, '\n', 'Best score: ', grid_lr.best_score_)

输出结果为:

[mean: 0.64646, std: 0.00833, params: {'C': 0.0001}, 
mean: 0.70595, std: 0.01292, params: {'C': 0.001}, 
mean: 0.80471, std: 0.02215, params: {'C': 0.01}, 
mean: 0.82043, std: 0.00361, params: {'C': 0.10000000000000001}, 
mean: 0.82492, std: 0.02629, params: {'C': 1.0}, 
mean: 0.82379, std: 0.02747, params: {'C': 10.0}, 
mean: 0.82492, std: 0.02813, params: {'C': 100.0}, 
mean: 0.82492, std: 0.02813, params: {'C': 1000.0}, 
mean: 0.82492, std: 0.02813, params: {'C': 10000.0}] Best param:  {'C': 1.0} Best score:  0.8249158249158249

打印出每个特征的系数:

print (pd.DataFrame({"columns":list(x_train.columns), "coef":list(grid_lr.best_estimator_.coef_.T)}))

输出为:

              columns                coef
0          Embarked_C     [0.23649956536]
1              female    [0.892754957337]
2                male   [-0.817790866598]
3          Embarked_Q   [0.0560917611675]
4          Embarked_S   [-0.217627235788]
5            Age_baby    [0.903880875824]
6           Age_child    [0.307975441906]
7           Age_adult    [-0.12853864715]
8           Age_older    [-1.00835357984]
9            Low_fare   [-0.343780990932]
10        Median_fare   [-0.102505740604]
11          High_fare    [0.521250822275]
12       Large family    [-1.40958453387]
13             Single    [0.864627435362]
14       Small family    [0.619921189252]
15       Title_Master     [1.76928521042]
16         Title_Miss  [0.00766966811902]
17           Title_Mr    [-1.21722551405]
18          Title_Mrs    [0.469708936608]
19         Title_Rare   [-0.954474210357]
20      Cabin_Missing    [-0.35111453535]
21  Cabin_Not Missing    [0.426078626089]
22           Pclass_1    [0.279724883526]
23           Pclass_2    [0.295636224026]
24           Pclass_3   [-0.500397016812]

我们再用训练好的模型对测试集进行预测,并将结果保存在本地。

lr_y_predict = lr.predict(x_test).astype('int')
lr_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':lr_y_predict})
lr_submission.to_csv('../lr_submission.csv', index=False)

最后,我们去Kaggle上make a  submission。结果为0.7799。

2. 决策树

采用Grid CV法寻找最优超参数max_depth, min_samples_split。

clf = tree.DecisionTreeClassifier(random_state=33)
param_clf = {'max_depth':[3,5,10,15,20,25],'min_samples_split':[2,4,6,8,10,15,20]}
grid_clf = GridSearchCV(estimator = clf, param_grid = param_clf, cv = 5)
grid_clf.fit(x_train,y_train)
print (grid_clf.grid_scores_,'\n', 'Best param: ' ,grid_clf.best_params_, '\n', 'Best score: ', grid_clf.best_score_)#打印出feature importance
feature_imp_sorted_clf = pd.DataFrame({'feature': list(x_train.columns),'importance': grid_clf.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
print (feature_imp_sorted_clf)#输出预测结果
clf_y_predict = grid_clf.predict(x_test).astype('int')
clf_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':clf_y_predict})
clf_submission.to_csv('../clf_submission.csv', index=False)

输出为:

[mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 2}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 4}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 6}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 8}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 10}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 15}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 20}, 
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 2}, 
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 4}, 
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 6}, 
mean: 0.83389, std: 0.02877, params: {'max_depth': 5, 'min_samples_split': 8}, 
mean: 0.83389, std: 0.02877, params: {'max_depth': 5, 'min_samples_split': 10}, 
mean: 0.83389, std: 0.02877, params: {'max_depth': 5, 'min_samples_split': 15}, 
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 20}, 
mean: 0.81930, std: 0.01400, params: {'max_depth': 10, 'min_samples_split': 2}, 
mean: 0.81930, std: 0.01848, params: {'max_depth': 10, 'min_samples_split': 4}, 
mean: 0.82043, std: 0.01939, params: {'max_depth': 10, 'min_samples_split': 6}, 
mean: 0.82267, std: 0.02194, params: {'max_depth': 10, 'min_samples_split': 8}, 
mean: 0.82492, std: 0.02281, params: {'max_depth': 10, 'min_samples_split': 10}, 
mean: 0.82604, std: 0.02161, params: {'max_depth': 10, 'min_samples_split': 15}, 
mean: 0.82716, std: 0.01968, params: {'max_depth': 10, 'min_samples_split': 20}, 
mean: 0.81818, std: 0.01438, params: {'max_depth': 15, 'min_samples_split': 2}, 
mean: 0.81706, std: 0.01711, params: {'max_depth': 15, 'min_samples_split': 4}, 
mean: 0.81818, std: 0.01787, params: {'max_depth': 15, 'min_samples_split': 6}, 
mean: 0.82379, std: 0.02051, params: {'max_depth': 15, 'min_samples_split': 8}, 
mean: 0.82828, std: 0.02255, params: {'max_depth': 15, 'min_samples_split': 10}, 
mean: 0.82604, std: 0.02161, params: {'max_depth': 15, 'min_samples_split': 15}, 
mean: 0.82716, std: 0.01968, params: {'max_depth': 15, 'min_samples_split': 20}, 
mean: 0.81818, std: 0.01438, params: {'max_depth': 20, 'min_samples_split': 2}, 
mean: 0.81706, std: 0.01711, params: {'max_depth': 20, 'min_samples_split': 4}, 
mean: 0.81818, std: 0.01787, params: {'max_depth': 20, 'min_samples_split': 6}, 
mean: 0.82379, std: 0.02051, params: {'max_depth': 20, 'min_samples_split': 8}, 
mean: 0.82828, std: 0.02255, params: {'max_depth': 20, 'min_samples_split': 10}, 
mean: 0.82604, std: 0.02161, params: {'max_depth': 20, 'min_samples_split': 15}, 
mean: 0.82716, std: 0.01968, params: {'max_depth': 20, 'min_samples_split': 20}, 
mean: 0.81818, std: 0.01438, params: {'max_depth': 25, 'min_samples_split': 2}, 
mean: 0.81706, std: 0.01711, params: {'max_depth': 25, 'min_samples_split': 4}, 
mean: 0.81818, std: 0.01787, params: {'max_depth': 25, 'min_samples_split': 6}, 
mean: 0.82379, std: 0.02051, params: {'max_depth': 25, 'min_samples_split': 8}, 
mean: 0.82828, std: 0.02255, params: {'max_depth': 25, 'min_samples_split': 10}, 
mean: 0.82604, std: 0.02161, params: {'max_depth': 25, 'min_samples_split': 15}, 
mean: 0.82716, std: 0.01968, params: {'max_depth': 25, 'min_samples_split': 20}] Best param:  {'max_depth': 5, 'min_samples_split': 8} Best score:  0.8338945005611672feature  importance
17           Title_Mr    0.579502
12       Large family    0.135564
19         Title_Rare    0.066667
21  Cabin_Not Missing    0.065133
24           Pclass_3    0.045870
9            Low_fare    0.041589
4          Embarked_S    0.020851
2                male    0.014137
7           Age_adult    0.008480
23           Pclass_2    0.007741
11          High_fare    0.007008
22           Pclass_1    0.002868
13             Single    0.001521
14       Small family    0.001146
0          Embarked_C    0.001003
3          Embarked_Q    0.000633
18          Title_Mrs    0.000288
20      Cabin_Missing    0.000000
5            Age_baby    0.000000
16         Title_Miss    0.000000
6           Age_child    0.000000
1              female    0.000000
10        Median_fare    0.000000
8           Age_older    0.000000
15       Title_Master    0.000000

输出可视化决策树:

print (grid_clf.best_estimator_)
clf = tree.DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,max_features=None, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=8,min_weight_fraction_leaf=0.0, presort=False, random_state=33,splitter='best')
clf.fit(x_train,y_train)os.environ["PATH"] += os.pathsep + '/usr/local/Cellar/graphviz/2.40.1/bin/'data_feature_name = list(x_train.columns)dot_data = tree.export_graphviz(clf, out_file=None, feature_names = data_feature_name,filled=True, rounded=True,special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("TitanicTree.pdf")
print('Visible tree plot saved as pdf.')

输出为:

最后去kaggle上make a submission,准确率为0.78947。

3. Random Forest

使用Grid CV来调参,先确定n_estimators的值,再确定max_features的值,最后确定max_depth、min_samples_leaf、min_samples_split的值。

rf = RandomForestClassifier(random_state=33)
param_rf = {'n_estimators':[i for i in range(10,50,5)]}
#param_rf = {'n_estimators':[10,50,100,200,500,1000]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)rf = RandomForestClassifier(random_state=33, n_estimators=20)
param_rf = {'max_features':[i for i in range(2,23,2)]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)rf = RandomForestClassifier(random_state=33, n_estimators=20, max_features = 18)
param_rf = {'max_depth':[i for i in range(10,25,5)],'min_samples_split':[i for i in range(12,21,2)]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)rf = RandomForestClassifier(random_state=33, n_estimators=20, max_features = 18, max_depth=10)
param_rf = {'min_samples_split':[i for i in range(12,25,2)],'min_samples_leaf':[i for i in range(2,21,2)]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)rf = RandomForestClassifier(random_state=33, n_estimators=20, max_features = 18, max_depth=10, min_samples_leaf = 2, min_samples_split = 22, oob_score = True)
rf.fit(x_train,y_train)
#print (rf.oob_score_)
rf_y_predict = rf.predict(x_test).astype('int')
rf_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':rf_y_predict})
rf_submission.to_csv('../rf_submission.csv', index=False)

最终的选择的最优参数组合为:n_estimators=20, max_features = 18, max_depth=10, min_samples_leaf = 2, min_samples_split = 22,best score为0.8439955106621774。

最后去kaggle上提交,结果为0.79425。

4. Adaboost

Adaboost需要调节的参数较少,采用Grid CV法,寻找最优n_estimators和learning rate。这两个参数需要一起调。

ada = AdaBoostClassifier(random_state=33)
param_ada = {'n_estimators':[500,1000,2000,5000],'learning_rate':[0.001,0.01,0.1]}
grid_ada = GridSearchCV(estimator = ada, param_grid = param_ada, cv = 5)
grid_ada.fit(x_train,y_train)
print (grid_ada.grid_scores_,'\n', 'Best param: ' ,grid_ada.best_params_, '\n', 'Best score: ', grid_ada.best_score_)ada_y_predict = grid_ada.predict(x_test).astype('int')
ada_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':ada_y_predict})
ada_submission.to_csv('../ada_submission.csv', index=False)

输出为:

[mean: 0.77890, std: 0.01317, params: {'learning_rate': 0.001, 'n_estimators': 500}, 
mean: 0.78676, std: 0.01813, params: {'learning_rate': 0.001, 'n_estimators': 1000}, 
mean: 0.79125, std: 0.01352, params: {'learning_rate': 0.001, 'n_estimators': 2000}, 
mean: 0.81818, std: 0.01382, params: {'learning_rate': 0.001, 'n_estimators': 5000}, 
mean: 0.81818, std: 0.01382, params: {'learning_rate': 0.01, 'n_estimators': 500}, 
mean: 0.82941, std: 0.01887, params: {'learning_rate': 0.01, 'n_estimators': 1000}, 
mean: 0.82828, std: 0.02010, params: {'learning_rate': 0.01, 'n_estimators': 2000}, 
mean: 0.82492, std: 0.02700, params: {'learning_rate': 0.01, 'n_estimators': 5000}, 
mean: 0.82492, std: 0.02700, params: {'learning_rate': 0.1, 'n_estimators': 500}, 
mean: 0.82155, std: 0.02737, params: {'learning_rate': 0.1, 'n_estimators': 1000}, 
mean: 0.82267, std: 0.02647, params: {'learning_rate': 0.1, 'n_estimators': 2000}, 
mean: 0.82379, std: 0.02674, params: {'learning_rate': 0.1, 'n_estimators': 5000}] Best param:  {'learning_rate': 0.01, 'n_estimators': 1000} Best score:  0.8294051627384961

最后去kaggle上提交,结果为0.78947。

5. Gradient tree boosting

使用Grid CV来调参,先确定n_estimators和learning rate的值,再确定max_depth、min_samples_leaf、min_samples_split的值。

gtb = GradientBoostingClassifier(random_state=33, subsample=0.8)
param_gtb = {'n_estimators':[500,1000,2000,5000],'learning_rate':[0.001,0.005,0.01,0.02]}
grid_gtb = GridSearchCV(estimator = gtb, param_grid = param_gtb, cv = 5)
grid_gtb.fit(x_train,y_train)
print (grid_gtb.grid_scores_,'\n', 'Best param: ' ,grid_gtb.best_params_, '\n', 'Best score: ', grid_gtb.best_score_)gtb = GradientBoostingClassifier(random_state=33, subsample=0.8, n_estimators=1000, learning_rate=0.001)
param_gtb = {'max_depth':[i for i in range(10,25,5)],'min_samples_split':[i for i in range(12,21,2)]}
grid_gtb = GridSearchCV(estimator = gtb, param_grid = param_gtb, cv = 5)
grid_gtb.fit(x_train,y_train)
print (grid_gtb.grid_scores_,'\n', 'Best param: ' ,grid_gtb.best_params_, '\n', 'Best score: ', grid_gtb.best_score_)gtb = GradientBoostingClassifier(random_state=33, subsample=0.8, n_estimators=1000, learning_rate=0.001, max_depth=10)
param_gtb = {'min_samples_split':[i for i in range(10,18,2)],'min_samples_leaf':[i for i in range(14,19,2)]}
grid_gtb = GridSearchCV(estimator = gtb, param_grid = param_gtb, cv = 5)
grid_gtb.fit(x_train,y_train)
print (grid_gtb.grid_scores_,'\n', 'Best param: ' ,grid_gtb.best_params_, '\n', 'Best score: ', grid_gtb.best_score_)gtb = GradientBoostingClassifier(random_state=33, subsample=0.8, n_estimators=1000, learning_rate=0.001, max_depth=10, min_samples_split=10 , min_samples_leaf=16)
gtb.fit(x_train,y_train)
gtb_y_predict = gtb.predict(x_test).astype('int')
print (cross_val_score(gtb,x_train,y_train,cv=5).mean())
gtb_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':gtb_y_predict})
gtb_submission.to_csv('../gtb_submission.csv', index=False)

最终的选择的最优参数组合为:n_estimators=1000, learning_rate=0.001, max_depth=10, min_samples_split=10 , min_samples_leaf=16,best score为0.8417508417508418。

最后去kaggle上提交,结果为0.80382。

五、另一种预测方法

我们知道大部分女性都幸存下来了,大部分男性都没能存活下来。那怎么去判断是哪部分女性没有幸存下来,而哪部分男性幸存下来了呢。有一个合理的假设为:如果母亲活下来了,那么她的孩子也会存活下来。如果孩子死了,那么母亲也会死亡。因为训练集和测试集中的家族是有一部分重叠的,所以我们可以根据训练集家族孩子和女性的存活情况,来判断测试集中同一家族中孩子和女性的存活情况。其规则为:对于测试集中的小男孩,如果他家族中的女性和小男孩都活下来了,那么就预测该小男孩也会活下来。对于测试集中的女性,如果她家族中的女性和小男孩都死亡了,那么就预测该女性也会死亡。对剩下的数据,则再根据性别来判断,女性存活,男性死亡。

#读取数据
train = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/train.csv')
test = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/test.csv')#Surname
train['Surname'] = [train.iloc[i]['Name'].split(',')[0] + str(train.iloc[i]['Pclass']) for i in range(len(train))] #同一个家族的人船舱等级应该一致
test['Surname'] = [test.iloc[i]['Name'].split(',')[0] + str(test.iloc[i]['Pclass']) for i in range(len(test))] #同一个家族的人船舱等级应该一致train['Family_size'] = train['Parch'] + train['SibSp']
test['Family_size'] = test['Parch'] + test['SibSp']train['Title'] = train['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
test['Title'] = test['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])boy = (train.Name.str.contains('Master')) | ((train.Sex=='male') & (train.Age<13))
female = train.Sex=='female'
boy_or_female = boy | femaleboy_femSurvival = train[boy_or_female].groupby('Surname')['Survived'].mean().to_frame()boy_femSurvived = list(boy_femSurvival[boy_femSurvival['Survived']==1].index)
boy_femDied = list(boy_femSurvival[boy_femSurvival['Survived']==0].index)def boy_female_survival(input_dataset):for i in range(len(input_dataset)):if input_dataset.iloc[i]['Surname'] in boy_femSurvived and input_dataset.iloc[i]['Family_size']>0 and (input_dataset.iloc[i]['Sex']=='female' or (input_dataset.iloc[i]['Title']=='Master' or (input_dataset.iloc[i]['Sex']=='male' and input_dataset.iloc[i]['Age']<13))):input_dataset.loc[i,'Survived'] = 1elif input_dataset.iloc[i]['Surname'] in boy_femDied and input_dataset.iloc[i]['Family_size']>0:input_dataset.loc[i,'Survived'] = 0boy_female_survival(test)	
#print (test[test['Survived'] == 1][['Name', 'Age', 'Sex', 'Pclass','Family_size']])test_out1 = test[test['Survived'].notnull()]
test1 = test[test['Survived'].isnull()]
test1.index = range(0,len(test1))#对剩下的数据根据性别来判断幸存与否
def gender_survival(sex):if sex == 'female':return 1else:return 0test1['Survived'] = test1['Sex'].map(gender_survival)#合并两次预测的数据
test_out = pd.concat([test_out1, test1], axis=0).sort_values(by = 'PassengerId')
test_submission = test_out[['PassengerId','Survived']]
test_submission['Survived'] = test_submission['Survived'].astype('int')test_submission.to_csv('../test_submission.csv', index=False)

最后去kaggle上提交,结果为0.81339,比建模的效果要好。

 

参考文献:

1. How to score over 82% Titanic 

2. Kaggle_Titanic生存预测 -- 详细流程吐血梳理 

 

 

 

 

 

这篇关于Kaggle - Titanic 生存预测的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/175954

相关文章

kaggle竞赛宝典 | Mamba模型综述!

本文来源公众号“kaggle竞赛宝典”,仅用于学术分享,侵权删,干货满满。 原文链接:Mamba模型综述! 型语言模型(LLMs),成为深度学习的基石。尽管取得了令人瞩目的成就,Transformers仍面临固有的局限性,尤其是在推理时,由于注意力计算的平方复杂度,导致推理过程耗时较长。 最近,一种名为Mamba的新型架构应运而生,其灵感源自经典的状态空间模型,成为构建基础模型的有力替代方案

Tensorflow lstm实现的小说撰写预测

最近,在研究深度学习方面的知识,结合Tensorflow,完成了基于lstm的小说预测程序demo。 lstm是改进的RNN,具有长期记忆功能,相对于RNN,增加了多个门来控制输入与输出。原理方面的知识网上很多,在此,我只是将我短暂学习的tensorflow写一个预测小说的demo,如果有错误,还望大家指出。 1、将小说进行分词,去除空格,建立词汇表与id的字典,生成初始输入模型的x与y d

临床基础两手抓!这个12+神经网络模型太贪了,免疫治疗预测、通路重要性、基因重要性、通路交互作用性全部拿下!

生信碱移 IRnet介绍 用于预测病人免疫治疗反应类型的生物过程嵌入神经网络,提供通路、通路交互、基因重要性的多重可解释性评估。 临床实践中常常遇到许多复杂的问题,常见的两种是: 二分类或多分类:预测患者对治疗有无耐受(二分类)、判断患者的疾病分级(多分类); 连续数值的预测:预测癌症病人的风险、预测患者的白细胞数值水平; 尽管传统的机器学习提供了高效的建模预测与初步的特征重

结合Python与GUI实现比赛预测与游戏数据分析

在现代软件开发中,用户界面设计和数据处理紧密结合,以提升用户体验和功能性。本篇博客将基于Python代码和相关数据分析进行讨论,尤其是如何通过PyQt5等图形界面库实现交互式功能。同时,我们将探讨如何通过嵌入式预测模型为用户提供赛果预测服务。 本文的主要内容包括: 基于PyQt5的图形用户界面设计。结合数据进行比赛预测。文件处理和数据分析流程。 1. PyQt5 图形用户界面设计

CNN-LSTM模型中应用贝叶斯推断进行时间序列预测

这篇论文的标题是《在混合CNN-LSTM模型中应用贝叶斯推断进行时间序列预测》,作者是Thi-Lich Nghiem, Viet-Duc Le, Thi-Lan Le, Pierre Maréchal, Daniel Delahaye, Andrija Vidosavljevic。论文发表在2022年10月于越南富国岛举行的国际多媒体分析与模式识别会议(MAPR)上。 摘要部分提到,卷积

多维时序 | Matlab基于SSA-SVR麻雀算法优化支持向量机的数据多变量时间序列预测

多维时序 | Matlab基于SSA-SVR麻雀算法优化支持向量机的数据多变量时间序列预测 目录 多维时序 | Matlab基于SSA-SVR麻雀算法优化支持向量机的数据多变量时间序列预测效果一览基本介绍程序设计参考资料 效果一览 基本介绍 1.Matlab基于SSA-SVR麻雀算法优化支持向量机的数据多变量时间序列预测(完整源码和数据) 2.SS

力扣 | 递归 | 区间上的动态规划 | 486. 预测赢家

文章目录 一、递归二、区间动态规划 LeetCode:486. 预测赢家 一、递归 注意到本题数据范围为 1 < = n < = 20 1<=n<=20 1<=n<=20,因此可以使用递归枚举选择方式,时间复杂度为 2 20 = 1024 ∗ 1024 = 1048576 = 1.05 × 1 0 6 2^{20} = 1024*1024=1048576=1.05 × 10^

回归预测 | MATLAB实现PSO-LSTM(粒子群优化长短期记忆神经网络)多输入单输出

回归预测 | MATLAB实现PSO-LSTM(粒子群优化长短期记忆神经网络)多输入单输出 目录 回归预测 | MATLAB实现PSO-LSTM(粒子群优化长短期记忆神经网络)多输入单输出预测效果基本介绍模型介绍PSO模型LSTM模型PSO-LSTM模型 程序设计参考资料致谢 预测效果 Matlab实现PSO-LSTM多变量回归预测 1.input和outpu

时序预测|变分模态分解-双向时域卷积-双向门控单元-注意力机制多变量时间序列预测VMD-BiTCN-BiGRU-Attention

时序预测|变分模态分解-双向时域卷积-双向门控单元-注意力机制多变量时间序列预测VMD-BiTCN-BiGRU-Attention 文章目录 一、基本原理1. 变分模态分解(VMD)2. 双向时域卷积(BiTCN)3. 双向门控单元(BiGRU)4. 注意力机制(Attention)总结流程 二、实验结果三、核心代码四、代码获取五、总结 时序预测|变分模态分解-双向时域卷积

【销售预测 ARIMA模型】ARIMA模型预测每天的销售额

输入数据txt格式: 2017-05-01 100 2017-05-02 200 ……. python 实现arima: # encoding: utf-8"""function:时间序列预测ARIMA模型预测每天的销售额author:donglidate:2018-05-25"""# 导入库import numpy as np # numpy库from statsmode