去噪自编码器_DeepLearning 0.1 documentation中文翻译

本文主要是介绍去噪自编码器_DeepLearning 0.1 documentation中文翻译，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

DeepLearning 0.1 documentation中文翻译: Denoising Autoencoders (DA)_去噪自编码器

原文网址: http://deeplearning.net/tutorial/dA.html

DeepLearning 01 documentation中文翻译 Denoising Autoencoders DA_去噪自编码器
- 自编码器Autoencoders
- 去噪自编码器Denoising Autoencoders
- 整合程序Putting it All Together
- 运行代码Running the Code
DeepLearning 01 documentation中文翻译

Note
这部分假设读者已经阅读过Classifying MNIST digits using Logistic Regression 和 Multilayer Perceptron. 此外, 本节会使用到新的Theano函数和概念: T.tanh, shared variables, basic arithmetic ops, T.grad, Random numbers, floatX. 如果你打算在GPU上运行,也要阅读GPU.

Note
代码可以在此下载:code

去噪自编码器是经典自编码器的扩展, 并且在文献[Vincent08]中, 它被作为深度网络的组件引入. 我们将下面简单介绍一下自编码器(Autoencoders).

自编码器(Autoencoders)

自编码器的概述参见文献[Bengio09]的4.6部分. 一个自编码器以 $\mathbf{x}\in[0,1]^d$ 作为输入, 并且首先将输入通过一个确定性的映射, 映射为 (with an encoder)隐层表示 $\mathbf{y} \in [0,1]^{d'}$ , e.g.:

y = s (W x + b)

$\mathbf{y} = s(\mathbf{W}\mathbf{x} + \mathbf{b})$

其中, $s$ 是一个非线性映射, 如sigmoid. 隐式表示 $\mathbf{y}$ , 或者叫编码, 紧接着被映射回来(with a decoder), 形成重构(Reconstruction) $\mathbf{z}$ , 它与 $\mathbf{x}$ 有同样的形状大小. 这个映射也是通过类似编码映射的变换, e.g.:

z = s (W' y + b')

$\mathbf{z} = s(\mathbf{W'}\mathbf{y} + \mathbf{b'})$

(这里, 撇符号不表示矩阵转置.) $\mathbf{z}$ 应当被看作给定编码 $\mathbf{y}$ 时, 对 $\mathbf{x}$ 的预测. 逆映射的权重矩阵 $\mathbf{W'}$ 可以选择约束成正向映射的转置: $\mathbf{W'} = \mathbf{W}^T$ . 这被称为捆绑权重. 如果不使用捆绑的权重, 那么对这个模型的参数 (即 $\mathbf{W}, \mathbf{b}, \mathbf{b'}$ , 和 $\mathbf{W'}$ (不使用捆绑权重)) 进行优化, 以使得平均重构误差最小.

重构误差可以从很多方面来衡量, 这取决于在给定编码时,对输入的适当的分布假设. 可以用传统的均方误差 $L(\mathbf{x} \mathbf{z}) = || \mathbf{x} - \mathbf{z} ||^2$ . 如果输入被解释为位向量或位概率(bit probabilities)向量, 那么可以使用输入与重构的交叉熵(cross-entropy)来衡量:

L H (x, z) = - \sum k = 1 d [x k log z k + (1 - x k) log (1 - z k)]

$L_{H} (\mathbf{x}, \mathbf{z}) = - \sum^d_{k=1}[\mathbf{x}_k \log\mathbf{z}_k + (1 - \mathbf{x}_k)\log(1 - \mathbf{z}_k)]$

我们希望编码 $\mathbf{y}$ 是一个分布式表示, 它可以抓住数据变化主要因素的方向. 这类似于向主成分方向投影的方式, 都将获得数据变化的主要因素. 实际上, 如果有一个线性隐藏层(编码), 并且使用均方误差准则来训练网络, 那么, $k$ 个隐藏层单元就是学习如何将输入投影到由数据的前 $k$ 个主成分张成的子空间里(in the span of the first $k$ principal components of the data). 如果隐藏层是非线性的, 自编码器与PCA表现不同, 它具有捕获数据分布多模态(multi-modal)的能力. 因而, 当我们构建深度自编码器[Hinton06]时, 我们会堆叠多个编码器(stacking multiple encoders, 及其对应的解码器), 此时, 就不能以PCA的观点来看待.

因为 $\mathbf{y}$ 是 $\mathbf{x}$ 的有损压缩, 所以对于所有的 $\mathbf{x}$ , $\mathbf{y}$ 不能很好(small-loss)的压缩 $\mathbf{x}$ . 优化自编码器模型, 使 $\mathbf{y}$ 成为训练样例的良好压缩, 并希望自编码器对其它的输入也能较好的压缩, 但不是说对任意的输入都有良好的压缩. 那么一个自编码器推广的意义就是: 一个自编码器在与训练样例(training examples)有相同分布的测试样例(test examples)上的重构误差低, 然而, 一般说来, 对在从输入空间随机抽取的样例上的重构误差很高.

我们想使用Theano, 以类的形式实现一个自编码器, 它最终会被用于构造堆叠式自编码器(也叫堆栈式stacked autoencoder). 第一步是, 为自编码器的参数 $\mathbf{W}, \mathbf{b}$ 和 $\mathbf{b'}$ 创建共享变量(shared variables). (由于本教程中使用捆绑权重, $\mathbf{W'}$ 取为 $\mathbf{W}^T$ ):

class dA(object):"""Denoising Auto-Encoder class (dA)A denoising autoencoders tries to reconstruct the input from a corruptedversion of it by projecting it first in a latent space and reprojectingit afterwards back in the input space. Please refer to Vincent et al.,2008for more details. If x is the input then equation (1) computes a partiallydestroyed version of x by means of a stochastic mapping q_D. Equation (2)computes the projection of the input into the latent space. Equation (3)computes the reconstruction of the input, while equation (4) computes thereconstruction error... math::\tilde{x} ~ q_D(\tilde{x}|x)                                     (1)y = s(W \tilde{x} + b)                                           (2)x = s(W' y  + b')                                                (3)L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)]      (4)"""def __init__(self,numpy_rng,theano_rng=None,input=None,n_visible=784,n_hidden=500,W=None,bhid=None,bvis=None):"""Initialize the dA class by specifying the number of visible units (thedimension d of the input ), the number of hidden units ( the dimensiond' of the latent or hidden space ) and the corruption level. Theconstructor also receives symbolic variables for the input, weights andbias. Such a symbolic variables are useful when, for example the inputis the result of some computations, or when weights are shared betweenthe dA and an MLP layer. When dealing with SdAs this always happens,the dA on layer 2 gets as input the output of the dA on layer 1,and the weights of the dA are used in the second stage of trainingto construct an MLP.:type numpy_rng: numpy.random.RandomState:param numpy_rng: number random generator used to generate weights:type theano_rng: theano.tensor.shared_randomstreams.RandomStreams:param theano_rng: Theano random generator; if None is given one isgenerated based on a seed drawn from `rng`:type input: theano.tensor.TensorType:param input: a symbolic description of the input or None forstandalone dA:type n_visible: int:param n_visible: number of visible units:type n_hidden: int:param n_hidden:  number of hidden units:type W: theano.tensor.TensorType:param W: Theano variable pointing to a set of weights that should beshared belong the dA and another architecture; if dA shouldbe standalone set this to None:type bhid: theano.tensor.TensorType:param bhid: Theano variable pointing to a set of biases values (forhidden units) that should be shared belong dA and anotherarchitecture; if dA should be standalone set this to None:type bvis: theano.tensor.TensorType:param bvis: Theano variable pointing to a set of biases values (forvisible units) that should be shared belong dA and anotherarchitecture; if dA should be standalone set this to None"""self.n_visible = n_visibleself.n_hidden = n_hidden# create a Theano random generator that gives symbolic random valuesif not theano_rng:theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))# note : W' was written as `W_prime` and b' as `b_prime`if not W:# W is initialized with `initial_W` which is uniformely sampled# from -4*sqrt(6./(n_visible+n_hidden)) and# 4*sqrt(6./(n_hidden+n_visible))the output of uniform if# converted using asarray to dtype# theano.config.floatX so that the code is runable on GPUinitial_W = numpy.asarray(numpy_rng.uniform(low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),size=(n_visible, n_hidden)),dtype=theano.config.floatX)W = theano.shared(value=initial_W, name='W', borrow=True)if not bvis:bvis = theano.shared(value=numpy.zeros(n_visible,dtype=theano.config.floatX),borrow=True)if not bhid:bhid = theano.shared(value=numpy.zeros(n_hidden,dtype=theano.config.floatX),name='b',borrow=True)self.W = W# b corresponds to the bias of the hiddenself.b = bhid# b_prime corresponds to the bias of the visibleself.b_prime = bvis# tied weights, therefore W_prime is W transposeself.W_prime = self.W.Tself.theano_rng = theano_rng# if no input is given, generate a variable representing the inputif input is None:# we use a matrix because we expect a minibatch of several# examples, each example being a rowself.x = T.dmatrix(name='input')else:self.x = inputself.params = [self.W, self.b, self.b_prime]

请注意, 我们传递给自编码器的输入参数是符号形式的, 这样, 我们可以连接自编码器的各层, 来形成深度网络: 第 $k$ 层的符号形的输出, 将作为第 $k+1$ 层符号形的输入.

现在, 我们可以计算隐层表示和信号的重构:

def get_hidden_values(self, input):""" Computes the values of the hidden layer """return T.nnet.sigmoid(T.dot(input, self.W) + self.b)

def get_reconstructed_input(self, hidden):"""Computes the reconstructed input given the values of thehidden layer"""return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)

并且, 使用这些函数, 可以计算代价和一次随机梯度下降步骤的更新:

def get_cost_updates(self, corruption_level, learning_rate):""" This function computes the cost and the updates for one trainngstep of the dA """tilde_x = self.get_corrupted_input(self.x, corruption_level)y = self.get_hidden_values(tilde_x)z = self.get_reconstructed_input(y)# note : we sum over the size of a datapoint; if we are using#        minibatches, L will be a vector, with one entry per#        example in minibatchL = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1)# note : L is now a vector, where each element is the#        cross-entropy cost of the reconstruction of the#        corresponding example of the minibatch. We need to#        compute the average of all these to get the cost of#        the minibatchcost = T.mean(L)# compute the gradients of the cost of the `dA` with respect# to its parametersgparams = T.grad(cost, self.params)# generate the list of updatesupdates = [(param, param - learning_rate * gparam)for param, gparam in zip(self.params, gparams)]return (cost, updates)

现在, 我们可以定义用于迭代更新参数 W, b 和 b_prime 的函数, 以使重构代价接近最小化:

da = dA(numpy_rng=rng,theano_rng=theano_rng,input=x,n_visible=28 * 28,n_hidden=500)cost, updates = da.get_cost_updates(corruption_level=0.,learning_rate=learning_rate)train_da = theano.function([index],cost,updates=updates,givens={x: train_set_x[index * batch_size: (index + 1) * batch_size]})

如果除了重建误差最小化外没有其它约束, 我们可以预料到的是: 一个 $n$ 输入和 $n$ 维(或更大)的自编码器, 将学习恒等函数, 几乎将输入映射成它自己的副本. 这样的一个自编码器, 不能区分来自其它输入配置的测试样例.

令人惊讶的是, 文献[ bengio07 ]的实验表明: 实际上, 隐藏层单元远多于输入(称为过完备, overcomplete)的非线性自编码器, 当采用随机梯度下降法训练时, 会产生有用表示(useful representations). ( 在这里, “有用”是指一个网络以编码作为输入, 分类错误率较低. )

一个简单的解释是, 早期停止迭代的随机梯度下降法与参数的 $L_2$ 正则化类似. 为了实现连续输入的完美重构, 一个单隐层非线性的自动编码器( 就像上面代码中的 ), 在第一( 编码 )层, 需要非常小的权重, 而将隐层单元的非线性转到线性性, 在第二( 解码 )层, 需要非常大的权重. 对于二值输入, 要完全最小化重构误差, 也需要非常大的权重. 隐式或显式的正则化, 使得得到非常大的权重解很困难, 优化算法仅仅对那些与训练集相似的样例有好的编码效果, 这也是我们想要的. 这意味着, 表示利用了训练集中的统计规律, 而不是仅仅学习复制输入.

也有其它方法, 来阻止隐藏层单元多于输入的一个自编码器, 学习恒等函数, 并让它在隐层学到有用的东西. 方法之一是加入稀疏性( 迫使许多隐藏层单元为零或接近零 )限制, 稀疏性已经广泛地被成功应用[ Ranzato07 ] [ Lee08 ]. 另一种方法是从在从输入到重构的转换中加入随机性. 这种技术被被用在受限波尔兹曼机( Restricted Boltzmann Machines (RBM), 在后面的受限波尔兹曼机( RBM )中会讨论 )中, 以及下面要讨论的去噪自动编码器中.

去噪自编码器(Denoising Autoencoders)

去噪自编码器背后的思想很简单. 为了迫使隐藏层单元发现更多鲁棒性好的特征, 以及阻止它学习恒等函数, 我们拿受损的输入来训练自编码器重构输入.

降噪自动编码器是自动编码器的随机版. 直观地说, 一个降噪自动编码器做两件事情: 试图对输入编码( 保存输入的信息 ), 和试图消除随机应用于自动编码器输入的损坏处理的影响. 后者则只能通过输入之间的统计相关性实现. 降噪自动编码器可以从不同的角度理解( 流形学习的角度, 随机算子的角度, 自底而上的信息理论的角度, 和自顶而下的生成模型的角度 ), 这些都在文献[ vincent08 ]中有解释. 关于自动编码器概述, 参见文献[ bengio09 ]的7.2节.

在文献[ vincent08 ]中, 随机损坏处理随机地将一些输入( 多达一半 )置零. 因此, 对于随机选择的丢失了特征的子集, 降噪自动编码器尝试根据未损坏( 即未丢失 )的值来预测损坏( 即丢失 )的值. 注意, 如何能够从剩余集合中预测任意变量的子集, 是完全获得一个集合的变量间的联合分布的充分条件( 这就是吉布斯采样(Gibbs sampling)的原理 ).

将自编码器类转换成去噪自编码器类, 我们需要做的是给输入增加一个随机损坏(stochastic corruption)操作. 输入可以以多种方式破坏, 但在本教程中, 我们将坚持使用原始的破坏机制, 即随机地将输入的某些项置零以掩盖这些输入. 下面的代码就是这样做的:

from theano.tensor.shared_randomstreams import RandomStreamsdef get_corrupted_input(self, input, corruption_level):""" This function keeps ``1-corruption_level`` entries of the inputs the sameand zero-out randomly selected subset of size ``coruption_level``Note : first argument of theano.rng.binomial is the shape(size) ofrandom numbers that it should producesecond argument is the number of trialsthird argument is the probability of success of any trialthis will produce an array of 0s and 1s where 1 has a probability of1 - ``corruption_level`` and 0 with ``corruption_level``"""return  self.theano_rng.binomial(size=input.shape, n=1, p=1 - corruption_level) * input

在堆叠式自编码器类(Stacked Autoencoders) 中, dA 类的权重, 必须与相应的 sigmoid 层的权重共享. 为此, dA 的构造函数也获得指向共享参数的Theano变量. 如果这些参数为空的话, 将会新建一个.

最终的去噪自编码器类变成这样:

class dA(object):"""Denoising Auto-Encoder class (dA)A denoising autoencoders tries to reconstruct the input from a corruptedversion of it by projecting it first in a latent space and reprojectingit afterwards back in the input space. Please refer to Vincent et al.,2008for more details. If x is the input then equation (1) computes a partiallydestroyed version of x by means of a stochastic mapping q_D. Equation (2)computes the projection of the input into the latent space. Equation (3)computes the reconstruction of the input, while equation (4) computes thereconstruction error... math::\tilde{x} ~ q_D(\tilde{x}|x)                                     (1)y = s(W \tilde{x} + b)                                           (2)x = s(W' y  + b')                                                (3)L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)]      (4)"""def __init__(self, numpy_rng, theano_rng=None, input=None, n_visible=784, n_hidden=500,W=None, bhid=None, bvis=None):"""Initialize the dA class by specifying the number of visible units (thedimension d of the input ), the number of hidden units ( the dimensiond' of the latent or hidden space ) and the corruption level. Theconstructor also receives symbolic variables for the input, weights andbias. Such a symbolic variables are useful when, for example the input isthe result of some computations, or when weights are shared between thedA and an MLP layer. When dealing with SdAs this always happens,the dA on layer 2 gets as input the output of the dA on layer 1,and the weights of the dA are used in the second stage of trainingto construct an MLP.:type numpy_rng: numpy.random.RandomState:param numpy_rng: number random generator used to generate weights:type theano_rng: theano.tensor.shared_randomstreams.RandomStreams:param theano_rng: Theano random generator; if None is given one is generatedbased on a seed drawn from `rng`:type input: theano.tensor.TensorType:paran input: a symbolic description of the input or None for standalonedA:type n_visible: int:param n_visible: number of visible units:type n_hidden: int:param n_hidden:  number of hidden units:type W: theano.tensor.TensorType:param W: Theano variable pointing to a set of weights that should beshared belong the dA and another architecture; if dA shouldbe standalone set this to None:type bhid: theano.tensor.TensorType:param bhid: Theano variable pointing to a set of biases values (forhidden units) that should be shared belong dA and anotherarchitecture; if dA should be standalone set this to None:type bvis: theano.tensor.TensorType:param bvis: Theano variable pointing to a set of biases values (forvisible units) that should be shared belong dA and anotherarchitecture; if dA should be standalone set this to None"""self.n_visible = n_visibleself.n_hidden = n_hidden# create a Theano random generator that gives symbolic random valuesif not theano_rng :theano_rng = RandomStreams(rng.randint(2 ** 30))# note : W' was written as `W_prime` and b' as `b_prime`if not W:# W is initialized with `initial_W` which is uniformely sampled# from -4.*sqrt(6./(n_visible+n_hidden)) and 4.*sqrt(6./(n_hidden+n_visible))# the output of uniform if converted using asarray to dtype# theano.config.floatX so that the code is runable on GPUinitial_W = numpy.asarray(numpy_rng.uniform(low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),size=(n_visible, n_hidden)), dtype=theano.config.floatX)W = theano.shared(value=initial_W, name='W')if not bvis:bvis = theano.shared(value = numpy.zeros(n_visible,dtype=theano.config.floatX), name='bvis')if not bhid:bhid = theano.shared(value=numpy.zeros(n_hidden,dtype=theano.config.floatX), name='bhid')self.W = W# b corresponds to the bias of the hiddenself.b = bhid# b_prime corresponds to the bias of the visibleself.b_prime = bvis# tied weights, therefore W_prime is W transposeself.W_prime = self.W.Tself.theano_rng = theano_rng# if no input is given, generate a variable representing the inputif input == None:# we use a matrix because we expect a minibatch of several examples,# each example being a rowself.x = T.dmatrix(name='input')else:self.x = inputself.params = [self.W, self.b, self.b_prime]def get_corrupted_input(self, input, corruption_level):""" This function keeps ``1-corruption_level`` entries of the inputs the sameand zero-out randomly selected subset of size ``coruption_level``Note : first argument of theano.rng.binomial is the shape(size) ofrandom numbers that it should producesecond argument is the number of trialsthird argument is the probability of success of any trialthis will produce an array of 0s and 1s where 1 has a probability of1 - ``corruption_level`` and 0 with ``corruption_level``"""return  self.theano_rng.binomial(size=input.shape, n=1, p=1 - corruption_level) * inputdef get_hidden_values(self, input):""" Computes the values of the hidden layer """return T.nnet.sigmoid(T.dot(input, self.W) + self.b)def get_reconstructed_input(self, hidden ):""" Computes the reconstructed input given the values of the hidden layer """return  T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)def get_cost_updates(self, corruption_level, learning_rate):""" This function computes the cost and the updates for one trainngstep of the dA """tilde_x = self.get_corrupted_input(self.x, corruption_level)y = self.get_hidden_values( tilde_x)z = self.get_reconstructed_input(y)# note : we sum over the size of a datapoint; if we are using minibatches,#        L will  be a vector, with one entry per example in minibatchL = -T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1 )# note : L is now a vector, where each element is the cross-entropy cost#        of the reconstruction of the corresponding example of the#        minibatch. We need to compute the average of all these to get#        the cost of the minibatchcost = T.mean(L)# compute the gradients of the cost of the `dA` with respect# to its parametersgparams = T.grad(cost, self.params)# generate the list of updatesupdates = []for param, gparam in zip(self.params, gparams):updates.append((param, param - learning_rate * gparam))return (cost, updates)

整合程序(Putting it All Together)

现在很容易构造一个dA类实例, 并训练它.

# allocate symbolic variables for the data
index = T.lscalar()  # index to a [mini]batch
x = T.matrix('x')  # the data is presented as rasterized images######################
# BUILDING THE MODEL #
######################rng = numpy.random.RandomState(123)
theano_rng = RandomStreams(rng.randint(2 ** 30))da = dA(numpy_rng=rng, theano_rng=theano_rng, input=x,n_visible=28 * 28, n_hidden=500)cost, updates = da.get_cost_updates(corruption_level=0.2,learning_rate=learning_rate)train_da = theano.function([index], cost, updates=updates,givens = {x: train_set_x[index * batch_size: (index + 1) * batch_size]})start_time = time.clock()############
# TRAINING #
############# go through training epochs
for epoch in xrange(training_epochs):# go through trainng setc = []for batch_index in xrange(n_train_batches):c.append(train_da(batch_index))print 'Training epoch %d, cost ' % epoch, numpy.mean(c)end_time = time.clocktraining_time = (end_time - start_time)print ('Training took %f minutes' % (pretraining_time / 60.))

为了看到网络到底学习到了什么, 我们准备绘制出这个滤波器(由权重矩阵定义的). 然而, 请记住, 这并不能代表所有, 因为我们忽略了偏置, 并将权重归一化到0和1之间.

为了绘制滤波器, 我们需要利用 tile_raster_images 函数(see Plotting Samples and Filters) , 因此我们希望读者自己熟悉它. 在Python 图像库( Python Image Library ) 的帮助下, 下面的这些代码行将把滤波器保存成图像:

image = Image.fromarray(tile_raster_images(X=da.W.get_value(borrow=True).T,img_shape=(28, 28), tile_shape=(10, 10),tile_spacing=(1, 1)))
image.save('filters_corruption_30.png')

运行代码(Running the Code)

运行代码 :

python dA.py

未加任何噪声, 得到的滤波器结果:
The resulted filters when we do not use any noise are :

加入30%的噪声时的滤波器结果:
The filters for 30 percent noise :

DeepLearning 01 documentation中文翻译 Denoising Autoencoders DA_去噪自编码器
- 自编码器Autoencoders
- 去噪自编码器Denoising Autoencoders
- 整合程序Putting it All Together
- 运行代码Running the Code
DeepLearning 01 documentation中文翻译