本文主要是介绍【Snapshot Ensembles】《Snapshot Ensembles:Train 1,Get M for Free》,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
https://github.com/gaohuang/SnapshotEnsemble
ICLR-2017
文章目录
- 1 Background and Motivation
- 2 Related Work
- 3 Advantages / Contributions
- 4 Method
- 5 Experiments
- 5.1 Datasets and Metrics
- 5.2 Snapshot Ensemble Results
- 5.3 Diversity of Model Ensembles
- 6 Conclusion(own) / Future work
1 Background and Motivation
深度学习领域,传统的 ensemble 方法多个模型平均,需要训练多个模型,computationally expensive
作者另辟蹊径,利用 【SGDR】《SGDR:Stochastic Gradient Descent with Warm Restarts》中多次 restart 学习率的思想,每次 restart 时 snapshot 一个模型,ensembling multiple neural networks at no additional training cost
2 Related Work
- “implicit” ensembles
Dropout / DropConnect / Stochastic Depth technique / Swapout
作者的方法可以和这些方法融合 - test time cost of ensembles
作者是降低了 training cost
有些方法和作者相似,作者独特之处在于,we take snapshots only when the model reaches a minimum
3 Advantages / Contributions
借鉴和充分挖掘SGDR的技术,提出了新颖的 ensembles 方法,snapshot ensemble,极大的降低了 ensemble 时的 training cost,做多个公开数据集上,公开的网络上效果有提升
4 Method
L 和 K 是 densenet 网络中的参数
B 是迭代的总 epoch
作者方法的学习率公式化表达如下
公式中最小单位 t 是每一次迭代而不是每一个 epoch,区别于 SGDR 原文
T 是总迭代数
M 是 split the training process into M cycles,eg 图2 中 M=6
average the last (and therefore most accurate) m out of M models
初始时刻 t = 1 t=1 t=1, α = α 0 \alpha = \alpha_0 α=α0, t = ⌈ T / M ⌉ t = \left \lceil T/M \right \rceil t=⌈T/M⌉ 时候, α \alpha α 比较接近 0,学习率在 ⌈ T / M ⌉ \left \lceil T/M \right \rceil ⌈T/M⌉ 周期内波动
ensemble 的方式
the average of the last m (m ≤ M) model’s softmax outputs
h E n s e m b l e = 1 m ∑ 0 m − 1 h M − i ( x ) h_{Ensemble} = \frac{1}{m}\sum_0^{m-1}h_{M-i}(x) hEnsemble=m10∑m−1hM−i(x)
h i h_i hi 表示 softmax score of snapshot i i i
5 Experiments
5.1 Datasets and Metrics
- CIFAR10,top1 error
- CIFAR100,top1 error
- SVHN——Street View House Numbers,top1 error
- Tiny ImageNet——200 classes, each of which has 500 training images and 50 validation images,64×64,top1 error
- ImageNet,top1 error
5.2 Snapshot Ensemble Results
(1)Accuracy
-
Single model 用的是正常的 step learning rate schedule
-
Droupout = Single model + Droupout
-
NoCycle Snapshot Ensemble
learning rate schdule 同 Single model,只是选取了一次训练中 snapshot 的几个模型 ensemble 了起来 -
SingleCycle Ensembles 每一次循环都初始化网络的权重重头开始训练一个模型,循环之间没有关连,对比 Snapshot Ensembles,eg M=6,前者网络初始化了6次,后者网络仅最开始的时候初始化了,后续每个 cycle 权重接着上一个 cycle 迭代(network is re-initialized at the beginning of every cosine learning rate cycle, rather than using the parameters from the previous optimization cycle.)
Snapshot Ensembles 就是作者提出的方法
SingleCycle Ensemble 没有 Snapshot Ensemlbe 强的原因:This is because it is difficult to train a large model from scratch in only a few epochs.
看看 ImageNet 上的表现
M = 2 的效果比 M=3 要好
(2)Ensemble Size
基本上 3 以后都比 baseline 要好
(3)Restart Learning Rate
图 3 可见 ensembles with the larger restart learning rate perform better
原因:increases the diversity of local minima
(4)Varying Number of Cycles
relatively robust with respect to different values of M
setting M to be 4 ∼ 8 works reasonably well
(5)Varying Training Budget
M 固定为 6,探索不同训练 epochs(60~300) 下 SingleCycle Ensemble 和 Snapshot Ensemble 的表现
Snapshot Ensemble 受 epoch 的影响没有 SingleCycle 来的敏感
training budget decreases, Snapshot Ensembles still yield competitive results
train budget 有限的时候,eg 小于150 epochs,Snapshot 优势明显
有趣的是 cifar100 上,epoch高于 250 之后,SingleCycle 反超了,说明单个 cycle 中 250 / 6 epochs 足以网络收敛,再训久了可能过拟合了
(6)Comparison with True Ensembles
The true ensemble method averages models that are trained with 300 full epochs,
n 个模型,每个模型训练了完整的300epoch,再 ensemble
效果惊人,只能说大力出奇迹,就是这么的朴实无华
5.3 Diversity of Model Ensembles
(1)Parameter Space
compute the loss for a convex combination of model parameters
J ( λ ( θ 1 ) + ( 1 − λ ) ( θ 2 ) ) J(\lambda(\theta_1) + (1-\lambda) (\theta_2)) J(λ(θ1)+(1−λ)(θ2))
λ = 0 \lambda = 0 λ=0,仅模型 θ 2 \theta_2 θ2,M1~M5
λ = 1 \lambda = 1 λ=1,仅模型 θ 1 \theta_1 θ1,M6
Two models that converge to a similar minimum will have smooth parameter interpolations, whereas models that converge to different minima will likely have a non-convex interpolation, with a spike in error when λ is between 0 and 1.
曲线越陡峭曲折,说明两个模型对混合系数 λ \lambda λ 越敏感,说明两个模型收敛的不太一样,即使在验证集上有相仿的错误率
图5 standard lr schedule 可以看出
4-th 和 5-th 与 6-th 模型混合,对混合比例 λ \lambda λ 并不敏感,说明最后几个模型 lie in the same minimum as the final model, and therefore likely add limited diversity to the ensemble
需要注意的是,前两个子图和后两个子图的纵坐标尺度不太一样,cosine annealing 收敛的效果还是会好于 standard learning rate schedule 的
(2)Activation space
different local minima often have very similar error rates, the corresponding neural networks tend to make different mistakes
standard learning rate 最后几个模型相关性比较高(学习率相仿且比较小),这样对 ensemble 来说不友好
作者的方法这方面会弱化一些
6 Conclusion(own) / Future work
-
Future work will explore combining Snapshot Ensembles with traditional ensembles.
-
escape spurious saddle-points and local minima, these local minima contain useful information
-
加深了对 ensemble 的理解,以及对收敛的理解,不同模型可能有相同的精度(局部最优解),但可能错的不一样,这样是有利于 ensemble 的
-
使用余弦退火逃离局部最优点——快照集成(Snapshot Ensembles)在Keras上的应用
-
模型训练Tricks——Snapshot Ensembling
这篇关于【Snapshot Ensembles】《Snapshot Ensembles:Train 1,Get M for Free》的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!