本文主要是介绍【Deep Learning】Variational Autoencoder ELBO:优美的数学推导,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
Variational Autoencoder
- In this note, we talk about the generation model, where x x x represents the given dataset, z z z represents the latent variable, θ , ϕ \theta,\phi θ,ϕ denote the parameters of models.
Latent Variable Model
- Generate x x x by latent variable z z z: p ( x , z ) = p ( x ) p ( x ∣ z ) p(x,z)=p(x)p(x|z) p(x,z)=p(x)p(x∣z)
- Training: Maximum likelihood
L ( θ ) = ∑ x ∈ D log p ( x ) = ∑ x ∈ D log ∑ z p ( x , z ; θ ) = ∑ x ∈ D log ∑ z q ( z ) p ( x , z ; θ ) q ( z ) Important Sampling ≥ ∑ x ∈ D ∑ z q ( z ) log p ( x , z ; θ ) q ( z ) Concavcity of log \begin{align*} L(\theta)&=\sum_{x\in D}\log p(x)\\ &=\sum_{x\in D}\log \sum_{z}p(x,z;\theta)\\ &=\sum_{x\in D}\log \sum_{z} q(z)\frac{p(x,z;\theta)}{q(z)} & \text{Important Sampling}\\ &\ge\sum_{x\in D}\sum_{z}q(z)\log \frac{p(x,z;\theta)}{q(z)} & \text{Concavcity of log} \end{align*} L(θ)=x∈D∑logp(x)=x∈D∑logz∑p(x,z;θ)=x∈D∑logz∑q(z)q(z)p(x,z;θ)≥x∈D∑z∑q(z)logq(z)p(x,z;θ)Important SamplingConcavcity of log
- Assumption: ∑ z q ( z ) = 1 \sum_z q(z)=1 ∑zq(z)=1. The summation can be regarded as expectation(just for simplicity)
ELBO
- In the above deriviation, ∑ z q ( z ) log p ( x , z ; θ ) q ( z ) \sum_zq(z)\log \frac{p(x,z;\theta)}{q(z)} ∑zq(z)logq(z)p(x,z;θ) is the Evidence Lower Bound of log p ( x ) \log p(x) logp(x)
- When q ( z ) = p ( z ∣ x ; θ ) q(z)=p(z|x;\theta) q(z)=p(z∣x;θ),
∑ z q ( z ) log p ( x , z ; θ ) q ( z ) = log p ( x ; θ ) \sum_zq(z)\log\frac{p(x,z;\theta)}{q(z)}=\log p(x;\theta) z∑q(z)logq(z)p(x,z;θ)=logp(x;θ)
-
We can set q ( z ) = p ( z ∣ x ; θ ) q(z)=p(z|x;\theta) q(z)=p(z∣x;θ) to optimize a tight lowerbound of log p ( x ; θ ) \log p(x;\theta) logp(x;θ)
- We call p ( z ∣ x ; θ ) p(z|x;\theta) p(z∣x;θ) posterior.
- Don’t know p ( z ∣ x ; θ ) p(z|x;\theta) p(z∣x;θ)? Use network q ( z ; ϕ ) q(z;\phi) q(z;ϕ) to paratermize p ( z ∣ x ) p(z|x) p(z∣x).
- Optimize q ( z ; ϕ ) ≈ p ( z ∣ x ; θ ) q(z;\phi)\approx p(z|x;\theta) q(z;ϕ)≈p(z∣x;θ) and p ( x ∣ z ; θ ) p(x|z;\theta) p(x∣z;θ) alternatively.
-
Since we use q ( z ; ϕ ) q(z;\phi) q(z;ϕ) to approximate p ( z ∣ x ; θ ) p(z|x;\theta ) p(z∣x;θ), what is the distance metric between them?
- K L ( q ∣ ∣ p ) = ∑ z q ( z ) log q ( z ) p ( z ) KL(q||p)=\sum_z q(z)\log \frac{q(z)}{p(z)} KL(q∣∣p)=∑zq(z)logp(z)q(z)
- Compared to K L ( p ∣ ∣ q ) KL(p||q) KL(p∣∣q), K L ( q ∣ ∣ p ) KL(q||p) KL(q∣∣p) is reverse KL.
- Empirically, We often use K L ( p ∣ ∣ q ) KL(p||q) KL(p∣∣q), where p p p is the groundtruth distribution, that’s why K L ( q ∣ ∣ p ) KL(q||p) KL(q∣∣p) is ‘reverse’.
- Compared to K L ( p ∣ ∣ q ) KL(p||q) KL(p∣∣q), K L ( q ∣ ∣ p ) KL(q||p) KL(q∣∣p) is reverse KL.
- We call the procedure to find such ϕ \phi ϕ by Variational Inference: min ϕ K L ( q ∣ ∣ p ) \min_\phi KL(q||p) minϕKL(q∣∣p).
- K L ( q ∣ ∣ p ) = ∑ z q ( z ) log q ( z ) p ( z ) KL(q||p)=\sum_z q(z)\log \frac{q(z)}{p(z)} KL(q∣∣p)=∑zq(z)logp(z)q(z)
-
Look at the optimization of K L ( q ∣ ∣ p ) KL(q||p) KL(q∣∣p):
K L ( q ( z ; ϕ ) ∣ ∣ p ( z ∣ x ) ) = ∑ z q ( z ; ϕ ) log q ( z ; ϕ ) p ( z ∣ x ) = ∑ z q ( z ; ϕ ) log q ( z ; ϕ ) p ( x ) p ( z , x ) = log p ( x ) − ∑ z q ( z ; ϕ ) log p ( z , x ) q ( z ; ϕ ) \begin{align*} KL(q(z;\phi)||p(z|x))&=\sum_{z}q(z;\phi)\log \frac{q(z;\phi)}{p(z|x)}\\ &=\sum_{z}q(z;\phi)\log \frac{q(z;\phi)p(x)}{p(z,x)}\\ &=\log p(x)-\sum_zq(z;\phi)\log \frac{p(z,x)}{q(z;\phi)} \end{align*} KL(q(z;ϕ)∣∣p(z∣x))=z∑q(z;ϕ)logp(z∣x)q(z;ϕ)=z∑q(z;ϕ)logp(z,x)q(z;ϕ)p(x)=logp(x)−z∑q(z;ϕ)logq(z;ϕ)p(z,x)
Amazing! ∑ z q ( z ; ϕ ) log p ( z , x ) q ( z ; ϕ ) \sum_{z}q(z;\phi)\log \frac{p(z,x)}{q(z;\phi)} ∑zq(z;ϕ)logq(z;ϕ)p(z,x) is just the ELBO! When we minimize K L ( q ∣ ∣ p ) KL(q||p) KL(q∣∣p), we are also maximizing ELBO, which means the objective we alternatively trained for p ( x ∣ z ; θ ) p(x|z;\theta) p(x∣z;θ) and q ( z ; ϕ ) q(z;\phi) q(z;ϕ) is magically the same!What’s more, we can also find that
log p ( x ) = K L ( q ( z ; ϕ ) ∣ ∣ p ( z ∣ x ) ) + E L B O = A p p r o x E r r o r + E L B O \log p(x) = KL(q(z;\phi)||p(z|x)) + ELBO=ApproxError+ELBO logp(x)=KL(q(z;ϕ)∣∣p(z∣x))+ELBO=ApproxError+ELBO
which verifies that ELBO is the lowerbound of log p ( x ) \log p(x) logp(x), and there difference is exactly the approximate error between q ( z ; ϕ ) q(z;\phi) q(z;ϕ) and p ( z ∣ x ) p(z|x) p(z∣x). -
Notice: q ( z ; ϕ ) ≈ p ( z ∣ x , θ ) q(z;\phi)\approx p(z|x,\theta) q(z;ϕ)≈p(z∣x,θ). q q q depends on x x x, hence we can use q ( z ∣ x ; ϕ ) q(z|x;\phi) q(z∣x;ϕ) instead of q ( z ; ϕ ) q(z;\phi) q(z;ϕ), named Amortized Variational Inference.
-
Now, only ELBO is our only joint objective. Train θ , ϕ \theta,\phi θ,ϕ together!
J ( θ , ϕ ; x ) = ∑ z q ( z ∣ x ; ϕ ) log p ( x , z ; θ ) q ( z ∣ x ; ϕ ) = ∑ z q ( z ∣ x ; ϕ ) ( log p ( x ∣ z ; θ ) + log p ( z ; θ ) − log q ( z ∣ x ; ϕ ) ) = ∑ z q ( z ∣ x ; ϕ ) log p ( x ∣ z ; θ ) − ∑ z q ( z ∣ x ; ϕ ) log q ( z ∣ x ; ϕ ) log p ( z ; θ ) = E z ∼ q ( ⋅ ∣ x ; ϕ ) log p ( x ∣ z ; θ ) − K L ( q ( z ∣ x ; ϕ ) ∣ ∣ p ( z ; θ ) ) \begin{align*} J(\theta,\phi;x)&=\sum_z q(z|x;\phi)\log\frac{p(x,z;\theta)}{q(z|x;\phi)}\\ &=\sum_z q(z|x;\phi)\left( \log p(x|z;\theta)+\log p(z;\theta)-\log q(z|x;\phi)\right)\\ &=\sum_z q(z|x;\phi)\log p(x|z;\theta)-\sum_zq(z|x;\phi)\frac{\log q(z|x;\phi)}{\log p(z;\theta)}\\ &=\mathbb{E}_{z\sim q(\cdot|x;\phi)}\log p(x|z;\theta )-KL(q(z|x;\phi)||p(z;\theta)) \end{align*} J(θ,ϕ;x)=z∑q(z∣x;ϕ)logq(z∣x;ϕ)p(x,z;θ)=z∑q(z∣x;ϕ)(logp(x∣z;θ)+logp(z;θ)−logq(z∣x;ϕ))=z∑q(z∣x;ϕ)logp(x∣z;θ)−z∑q(z∣x;ϕ)logp(z;θ)logq(z∣x;ϕ)=Ez∼q(⋅∣x;ϕ)logp(x∣z;θ)−KL(q(z∣x;ϕ)∣∣p(z;θ))
VAE
-
Pratically, we obtain VAE from E L B O ELBO ELBO.
-
Assume that
p ( z ) ∼ N ( 0 , I ) q ( z ∣ x ; ϕ ) ∼ N ( μ ϕ ( x ) , σ ϕ ( x ) ) p ( x ∣ z ; θ ) ∼ N ( μ θ ( z ) , σ μ ( z ) ) p(z)\sim N(0,I)\\q(z|x;\phi)\sim N(\mu_\phi(x),\sigma_\phi(x))\\ p(x|z;\theta)\sim N(\mu_\theta(z),\sigma_\mu(z)) p(z)∼N(0,I)q(z∣x;ϕ)∼N(μϕ(x),σϕ(x))p(x∣z;θ)∼N(μθ(z),σμ(z))
They are all Gaussian, where the mean and variance are from net work. -
Let q ( z ∣ x ; ϕ ) q(z|x;\phi) q(z∣x;ϕ) be the encoder, p ( x ∣ z ; θ ) p(x|z;\theta) p(x∣z;θ) be the decoder, then E z ∼ q ( ⋅ ∣ x ; ϕ ) log p ( x ∣ z ; θ ) \mathbb{E}_{z\sim q(\cdot|x;\phi)}\log p(x|z;\theta ) Ez∼q(⋅∣x;ϕ)logp(x∣z;θ) represents reconstruction error:
- The error after encoding into latent space, then decoding into the original space.
- We wish this term big, so that the original data can be recovered with high probability.
-
Re-parameterization trick:
- In E z ∼ q ( ⋅ ∣ x ; ϕ ) log p ( x ∣ z ; θ ) \mathbb{E}_{z\sim q(\cdot|x;\phi)}\log p(x|z;\theta ) Ez∼q(⋅∣x;ϕ)logp(x∣z;θ) term, ϕ \phi ϕ is the sampling parameters, whose gradient can’t be computed.
- Sample z ′ ∼ N ( 0 , I ) z'\sim N(0,I) z′∼N(0,I), then compute z = μ + z ′ ⋅ σ z=\mu+z'\cdot \sigma z=μ+z′⋅σ.
Conclusion
- The amazing and elegent mathematical deviation behind VAE inspires me to write down this blog.
- Furthermore, VAE shows its great stability through many tasks, compared to GAN. There are still more Pro and Cons to talk about.
这篇关于【Deep Learning】Variational Autoencoder ELBO:优美的数学推导的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!