【Snapshot Ensembles】《Snapshot Ensembles：Train 1，Get M for Free》

本文主要是介绍【Snapshot Ensembles】《Snapshot Ensembles：Train 1，Get M for Free》，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

在这里插入图片描述

https://github.com/gaohuang/SnapshotEnsemble

ICLR-2017

文章目录

1 Background and Motivation
2 Related Work
3 Advantages / Contributions
4 Method
5 Experiments
- 5.1 Datasets and Metrics
- 5.2 Snapshot Ensemble Results
- 5.3 Diversity of Model Ensembles
6 Conclusion（own） / Future work

1 Background and Motivation

深度学习领域，传统的 ensemble 方法多个模型平均，需要训练多个模型，computationally expensive

作者另辟蹊径，利用【SGDR】《SGDR：Stochastic Gradient Descent with Warm Restarts》中多次 restart 学习率的思想，每次 restart 时 snapshot 一个模型，ensembling multiple neural networks at no additional training cost

在这里插入图片描述

2 Related Work

“implicit” ensembles
Dropout / DropConnect / Stochastic Depth technique / Swapout
作者的方法可以和这些方法融合
test time cost of ensembles
作者是降低了 training cost

有些方法和作者相似，作者独特之处在于，we take snapshots only when the model reaches a minimum

3 Advantages / Contributions

借鉴和充分挖掘SGDR的技术，提出了新颖的 ensembles 方法，snapshot ensemble，极大的降低了 ensemble 时的 training cost，做多个公开数据集上，公开的网络上效果有提升

4 Method

在这里插入图片描述
L 和 K 是 densenet 网络中的参数

B 是迭代的总 epoch

作者方法的学习率公式化表达如下
在这里插入图片描述
公式中最小单位 t 是每一次迭代而不是每一个 epoch，区别于 SGDR 原文

T 是总迭代数

M 是 split the training process into M cycles，eg 图2 中 M=6

average the last (and therefore most accurate) m out of M models

初始时刻 $t = 1$ ， $\alpha = \alpha_0$ ， $\left \lceil T/M \right \rceil$ 时候， $\alpha$ 比较接近 0，学习率在 $\left \lceil T/M \right \rceil$ 周期内波动

ensemble 的方式

the average of the last m (m ≤ M) model’s softmax outputs

$h_{Ensemble} = \frac{1}{m}\sum_0^{m-1}h_{M-i}(x)$

$h_i$ 表示 softmax score of snapshot $i$

5 Experiments

5.1 Datasets and Metrics

CIFAR10，top1 error
CIFAR100，top1 error
SVHN——Street View House Numbers，top1 error
Tiny ImageNet——200 classes, each of which has 500 training images and 50 validation images，64×64，top1 error
ImageNet，top1 error

5.2 Snapshot Ensemble Results

（1）Accuracy
在这里插入图片描述

Single model 用的是正常的 step learning rate schedule
Droupout = Single model + Droupout
NoCycle Snapshot Ensemble
learning rate schdule 同 Single model，只是选取了一次训练中 snapshot 的几个模型 ensemble 了起来
SingleCycle Ensembles 每一次循环都初始化网络的权重重头开始训练一个模型，循环之间没有关连，对比 Snapshot Ensembles，eg M=6，前者网络初始化了6次，后者网络仅最开始的时候初始化了，后续每个 cycle 权重接着上一个 cycle 迭代（network is re-initialized at the beginning of every cosine learning rate cycle, rather than using the parameters from the previous optimization cycle.）

Snapshot Ensembles 就是作者提出的方法

SingleCycle Ensemble 没有 Snapshot Ensemlbe 强的原因：This is because it is difficult to train a large model from scratch in only a few epochs.

看看 ImageNet 上的表现
在这里插入图片描述
M = 2 的效果比 M=3 要好

（2）Ensemble Size

在这里插入图片描述
基本上 3 以后都比 baseline 要好

（3）Restart Learning Rate

图 3 可见 ensembles with the larger restart learning rate perform better

原因：increases the diversity of local minima

（4）Varying Number of Cycles

在这里插入图片描述

relatively robust with respect to different values of M

setting M to be 4 ∼ 8 works reasonably well

（5）Varying Training Budget

在这里插入图片描述
M 固定为 6，探索不同训练 epochs（60~300）下 SingleCycle Ensemble 和 Snapshot Ensemble 的表现

Snapshot Ensemble 受 epoch 的影响没有 SingleCycle 来的敏感
training budget decreases, Snapshot Ensembles still yield competitive results

train budget 有限的时候，eg 小于150 epochs，Snapshot 优势明显

有趣的是 cifar100 上，epoch高于 250 之后，SingleCycle 反超了，说明单个 cycle 中 250 / 6 epochs 足以网络收敛，再训久了可能过拟合了

（6）Comparison with True Ensembles

在这里插入图片描述

The true ensemble method averages models that are trained with 300 full epochs,
n 个模型，每个模型训练了完整的300epoch，再 ensemble

效果惊人，只能说大力出奇迹，就是这么的朴实无华

在这里插入图片描述

5.3 Diversity of Model Ensembles

（1）Parameter Space

在这里插入图片描述
compute the loss for a convex combination of model parameters

$J(\lambda(\theta_1) + (1-\lambda) (\theta_2))$

$\lambda = 0$ ，仅模型 $\theta_2$ ，M1~M5
$\lambda = 1$ ，仅模型 $\theta_1$ ，M6

Two models that converge to a similar minimum will have smooth parameter interpolations, whereas models that converge to different minima will likely have a non-convex interpolation, with a spike in error when λ is between 0 and 1.

曲线越陡峭曲折，说明两个模型对混合系数 $\lambda$ 越敏感，说明两个模型收敛的不太一样，即使在验证集上有相仿的错误率

图5 standard lr schedule 可以看出

4-th 和 5-th 与 6-th 模型混合，对混合比例 $\lambda$ 并不敏感，说明最后几个模型 lie in the same minimum as the final model, and therefore likely add limited diversity to the ensemble

需要注意的是，前两个子图和后两个子图的纵坐标尺度不太一样，cosine annealing 收敛的效果还是会好于 standard learning rate schedule 的

（2）Activation space

different local minima often have very similar error rates, the corresponding neural networks tend to make different mistakes

在这里插入图片描述
standard learning rate 最后几个模型相关性比较高（学习率相仿且比较小），这样对 ensemble 来说不友好

作者的方法这方面会弱化一些

6 Conclusion（own） / Future work

Future work will explore combining Snapshot Ensembles with traditional ensembles.
escape spurious saddle-points and local minima， these local minima contain useful information
加深了对 ensemble 的理解，以及对收敛的理解，不同模型可能有相同的精度（局部最优解），但可能错的不一样，这样是有利于 ensemble 的
使用余弦退火逃离局部最优点——快照集成(Snapshot Ensembles)在Keras上的应用
模型训练Tricks——Snapshot Ensembling