论文：Evolving Deep Convolution Neural Networks for Image Classifcation

本文主要是介绍论文：Evolving Deep Convolution Neural Networks for Image Classifcation，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

来自文章：《Evolving Deep Convolutional Neural Networks for Image Classfication》
对于CNN的性能来讲depth，权重的初始化都很重要。
一、卷积神经网络
从B站上看了一下讲解CNN 的视频，卷积核filter实际上就是连接矩阵中的value，卷积和池化都是用来学习图像特征的，让网络关注更有意义的局部特征。卷积共享权重（shared weight），从而减少参数的个数。
卷积的类型：
卷积的类型

有填充的卷积（一般的卷积、常规卷积）：input【55】–padding–>【77】–filter【33】—>【55】
无填充的卷积
扩张卷积：间隔删除，以相同的计算成本提供了更宽的视野。
转置卷积：卷积的逆过程，上采样，还原
深度可分离卷积：Depthwise Separable Convolution, 将空间卷积和通道卷积分开来实现。如图

如上图，对于输入通道数为2，输出通道数为3的3x3卷积来说，常规卷积需要的参数量为2x3x3x3=54（这里为啥要乘以2？是因为inputchannel=2？）。深度可分离卷积首先对输入的每一个通道进行一个3x3卷积操作，分别产生一个特征图，一共2个，然后再使用3个1x1卷积将这2个通道进行不同系数加权的线性组合，共需要参数量为2x3x3x1+2x1x1x3=24，只有常规卷积的一半不到。

文章中这一点不理解：If the input data is with multiple channels, say three, one feature map will also require three different filters, and each filter convolves on each channel, then the results are summed element-wised.
filter 的数目=经过这些filters卷积后得到的feature map数目，所以若用3个filters对input data卷积，应该得到3个feature map？
二、权重初始化
DL中的weight initialization对模型收敛速度和模型质量有重要影响。
初始参数的选择应使得objective function（如损失函数）易于被优化（取到最优值）。
初始化方法分三类：

用一个常量初始化connection weights：比如zero Initializer，one Initializer and other fixed value Initializer
distribution Initializer：如权重矩阵中的值服从均匀分布或是高斯分布（高斯分布其实就是正态分布）
the initialization approach with some prior knowledge and the famous Xavier initializer.
Xavier Initializer
Xavier Initializer初始化相对于随机初始化可以缓解梯度爆炸或梯度消失的问题，基于“方差一致性”。

论文中说：The Xavier initializer is presented on the usage of the sigmoid activation, while the widely used activation function in CNNs is the RELU.
In the proposed algorithm , we use GA to evolve the proper mean and standard derivation for the Gaussian distribution.

三、遗传算法GAs
个体编码、种群初始化，个体适应度评估，二进制锦标赛选择策略选择mate parent，进入交配池，交叉，变异，生成后代，合并，环境选择
这篇文章里面用GA用于两个方面：

搜索CNN模型结构
初始化连接权重connection weight
四、综述

framework of the EvoCNN：
P0<--用变长基因编码策略初始化种群;
t=0;
while 进化过程没有结束，如没有达到最大代 do评估种群Pt中所有个体的fitness;S<--使用提出的修改后的二进制锦标赛选择策略选择parent solutions;Qt<--Generate offsprings with the proposed genetic operators from S;Pt+1<--用文章中提到的策略，从Pt+Qt中经过环境选择生成新的下一代;
end
从最后一代中选择“最好的”个体，并将该个体解码成对应的CNN结构;

1.Gene Encoding Strategy
三种类型的单元：卷积，池化，全连接。每种单元存放与之功能对应的一些参数，包括连接权重矩阵中的值（使用平均值和标准差来表示这些values）。
the connection weights are sampled from the corresponding Gaussian distribution.
2.Population Initialization
个体的基因编码信息分成两个parts：part1包括卷积层单元和池化层单元，part2包括全连接层单元。
3.fitness evaluation
decode：根据编码信息构建CNN，根据gene中的权重矩阵的mean和standard derivation初始化矩阵。
然后依次训练每个个体，训练过程中更新权重参数，训练完成后用验证数据集对该个体进行验证，因为一次同时处理的数据有限，所以验证时validation dataset被分成多份依次处理的。综合考虑该个体的参数量，分类平均错误率，分类错误的标准差这三个量当做fitness。训练需要N次，但是验证只进行一次。
The number of connection weights is also chosen as an addividual’s quality based on the principle of Occam’s razor.
奥卡姆剃刀原则是一种思维方式，

如无必要，勿增实体。
Entities should not be multiplied unnecessarily.
简单来说就是“简单有效原理”，简单模型更有效。

With the conventions, each represented CNN is trained on the training dataset, and the fitness is esmitated on the validation dataset,the test dataset is used after the ending of GA while the best one is chosen.
4. Slack Binary Tournament Selection
适应度是由三个方面：连接权重矩阵参数个数，分类的平均错误率和其标准差。选择的时候要兼顾。
上篇遗传算法论文中的fitness是由分类准确度确定的。
5. Offspring Generation

从mate pool中随机选择两个个体parents
交叉算子作用于选定的两个父代个体上，生成offspring
用变异算子作用于offspring上
将两个新生成的子代个体store，将两个parents从交配池中remove，重复前三个步骤直到mate pool空
交叉
Unit Collection phase(分类，同一类别的单元归一块儿)
Unit Alignment and crossover phase
Unit Restore phase
由于编码信息使用实数表示的，这里用到了模拟二进制交叉SBX和多项式变异PM.

模拟二进制交叉算子：simulated binary crossover
假设两个父代P1，P2---->两个子代C1,C2
针对使用二进制编码的单点交叉具有的Average Property和Spread Factor Property,使用概率密度函数的方法在实数中也对此进行模拟。
Average Property:子代，父代解码后的平均值守恒,
Spread Factor Property:传播因子，即子代的差值和父代差值的比，用β来表示，β大致等于1.
这篇讲得很详细实数编码情况下的交叉操作SBX
遗传算法中几种交叉算子小结
n是分布指数
多项式变异算子，如下
在这里插入图片描述
6. Environment Selection
先选出 a fraction of individuals with promising mean values,然后从剩下的个体中按照前面讲得modified binary tournament selection选。根据**Pareto principle，**新的种群中的精英个体占20%.同时20% data are selected randomly from the training images as the validation dataset.
Pareto principle
即帕累托原则，二八定律
During the final training phase, each individuals is subjected to the BatchNorm for speeding up and the weight decay with an unified number for preventing from the overfitting.
批归一化：通过减小内部协方差平移来加速深层网络训练。在训练过程中，随着网络加深，分布逐渐发生移动，导致整体分布逐渐往激活函数的饱和区间移动，从而反向传播时底层出现梯度消失，也就是收敛越来越慢的原因。
而 Normalization则是强行将分布“拉回”到均值=0，方差=1的标准正态分布，使得激活函数输入值落在非线性函数对输入比较敏感的区域，这样输入的小变化就会导致损失函数较大的变化，避免梯度消失，加速收敛。
五、数据集
用到了九种数据集，分别是