【Deep Learning】Variational Autoencoder ELBO：优美的数学推导

本文主要是介绍【Deep Learning】Variational Autoencoder ELBO：优美的数学推导，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Variational Autoencoder

In this note, we talk about the generation model, where $x$ represents the given dataset, $z$ represents the latent variable, $\theta,\phi$ denote the parameters of models.

Latent Variable Model

Generate $x$ by latent variable $z$ : $p (x, z) = p (x) p (x ∣ z)$
Training: Maximum likelihood

$\begin{align*} L(\theta)&=\sum_{x\in D}\log p(x)\\ &=\sum_{x\in D}\log \sum_{z}p(x,z;\theta)\\ &=\sum_{x\in D}\log \sum_{z} q(z)\frac{p(x,z;\theta)}{q(z)} & \text{Important Sampling}\\ &\ge\sum_{x\in D}\sum_{z}q(z)\log \frac{p(x,z;\theta)}{q(z)} & \text{Concavcity of log} \end{align*}$

Assumption: $\sum_z q(z)=1$ . The summation can be regarded as expectation(just for simplicity)

ELBO

In the above deriviation, $\sum_zq(z)\log \frac{p(x,z;\theta)}{q(z)}$ is the Evidence Lower Bound of $\log p(x)$
When $q(z)=p(z|x;\theta)$ ,

$\sum_zq(z)\log\frac{p(x,z;\theta)}{q(z)}=\log p(x;\theta)$

We can set $q(z)=p(z|x;\theta)$ to optimize a tight lowerbound of $\log p(x;\theta)$
- We call $p(z|x;\theta)$ posterior.
- Don’t know $p(z|x;\theta)$ ? Use network $q(z;\phi)$ to paratermize $p (z ∣ x)$ .
- Optimize $q(z;\phi)\approx p(z|x;\theta)$ and $p(x|z;\theta)$ alternatively.
Since we use $q(z;\phi)$ to approximate $p(z|x;\theta )$ , what is the distance metric between them?
- $KL(q||p)=\sum_z q(z)\log \frac{q(z)}{p(z)}$
  - Compared to $K L (p ∣∣ q)$ , $K L (q ∣∣ p)$ is reverse KL.
    - Empirically, We often use $K L (p ∣∣ q)$ , where $p$ is the groundtruth distribution, that’s why $K L (q ∣∣ p)$ is ‘reverse’.
- We call the procedure to find such $\phi$ by Variational Inference: $\min_\phi KL(q||p)$ .
Look at the optimization of $K L (q ∣∣ p)$ :
$\begin{align*} KL(q(z;\phi)||p(z|x))&=\sum_{z}q(z;\phi)\log \frac{q(z;\phi)}{p(z|x)}\\ &=\sum_{z}q(z;\phi)\log \frac{q(z;\phi)p(x)}{p(z,x)}\\ &=\log p(x)-\sum_zq(z;\phi)\log \frac{p(z,x)}{q(z;\phi)} \end{align*}$
Amazing! $\sum_{z}q(z;\phi)\log \frac{p(z,x)}{q(z;\phi)}$ is just the ELBO! When we minimize $K L (q ∣∣ p)$ , we are also maximizing ELBO, which means the objective we alternatively trained for $p(x|z;\theta)$ and $q(z;\phi)$ is magically the same!

What’s more, we can also find that
$\log p(x) = KL(q(z;\phi)||p(z|x)) + ELBO=ApproxError+ELBO$
which verifies that ELBO is the lowerbound of $\log p(x)$ , and there difference is exactly the approximate error between $q(z;\phi)$ and $p (z ∣ x)$ .
Notice: $q(z;\phi)\approx p(z|x,\theta)$ . $q$ depends on $x$ , hence we can use $q(z|x;\phi)$ instead of $q(z;\phi)$ , named Amortized Variational Inference.
Now, only ELBO is our only joint objective. Train $\theta,\phi$ together!
$\begin{align*} J(\theta,\phi;x)&=\sum_z q(z|x;\phi)\log\frac{p(x,z;\theta)}{q(z|x;\phi)}\\ &=\sum_z q(z|x;\phi)\left( \log p(x|z;\theta)+\log p(z;\theta)-\log q(z|x;\phi)\right)\\ &=\sum_z q(z|x;\phi)\log p(x|z;\theta)-\sum_zq(z|x;\phi)\frac{\log q(z|x;\phi)}{\log p(z;\theta)}\\ &=\mathbb{E}_{z\sim q(\cdot|x;\phi)}\log p(x|z;\theta )-KL(q(z|x;\phi)||p(z;\theta)) \end{align*}$

VAE

Pratically, we obtain VAE from $E L BO$ .
Assume that
$p(z)\sim N(0,I)\\q(z|x;\phi)\sim N(\mu_\phi(x),\sigma_\phi(x))\\ p(x|z;\theta)\sim N(\mu_\theta(z),\sigma_\mu(z))$
They are all Gaussian, where the mean and variance are from net work.
Let $q(z|x;\phi)$ be the encoder, $p(x|z;\theta)$ be the decoder, then $\mathbb{E}_{z\sim q(\cdot|x;\phi)}\log p(x|z;\theta )$ represents reconstruction error:
- The error after encoding into latent space, then decoding into the original space.
- We wish this term big, so that the original data can be recovered with high probability.
Re-parameterization trick:
- In $\mathbb{E}_{z\sim q(\cdot|x;\phi)}\log p(x|z;\theta )$ term, $\phi$ is the sampling parameters, whose gradient can’t be computed.
- Sample $z'\sim N(0,I)$ , then compute $z=\mu+z'\cdot \sigma$ .

Conclusion

The amazing and elegent mathematical deviation behind VAE inspires me to write down this blog.
Furthermore, VAE shows its great stability through many tasks, compared to GAN. There are still more Pro and Cons to talk about.

这篇关于【Deep Learning】Variational Autoencoder ELBO：优美的数学推导的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！