本文主要是介绍指数族分布和变分推断,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
指数族分布
- 指数族分布的pdf / pmf可以表示成:
p ( x ∣ η ) = h ( x ) e x p ( T ( x ) T η − A ( η ) ) p(x| \eta)=h(x)exp(T(x)^T \eta - A(\eta)) p(x∣η)=h(x)exp(T(x)Tη−A(η))
其中, 、 T ( x ) 、 h ( x ) 、T(x)、h(x) 、T(x)、h(x)只是包含 x x x的函数, A ( η ) A(\eta) A(η)是只包含 η \eta η的函数。 T ( x ) T(x) T(x)叫做sufficient statistics。 A ( η ) A(\eta) A(η)叫做log-normalizer。在变分推断中, A ( η ) A(\eta) A(η)起到很重要的作用。
∫ h ( x ) e x p ( T ( x ) T η ) d x e x p ( A ( η ) ) = 1 A ( η ) = l o g ∫ h ( x ) e x p ( T ( X ) T η ) d x \frac{\int h(x)exp(T(x)^T\eta)dx}{exp(A(\eta))}=1\\ A(\eta)=log\int h(x)exp(T(X)^T\eta)dx exp(A(η))∫h(x)exp(T(x)Tη)dx=1A(η)=log∫h(x)exp(T(X)Tη)dx
- 我们学到的很多分布都是指数族分布,比如:
Normal, beta, Poisson, gamma, Bernoulli, chi-squared, geometric, exponential, categorical…
- 举高斯分布为例子
p ( x ∣ θ ) = p ( x ∣ μ , σ 2 ) = N ( μ , σ 2 ) = 1 2 π σ e x p − ( x − μ ) 2 2 σ 2 p(x| \theta)=p(x|\mu, \sigma^2)=N(\mu, \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}exp{-\frac{(x-\mu)^2}{2\sigma^2}} p(x∣θ)=p(x∣μ,σ2)=N(μ,σ2)=2πσ1exp−2σ2(x−μ)2
- 例子:怎样把高斯分布写成指数族分布的形式,就是怎样把均值和方差这两个参数替换成 η 1 , η 2 \eta_1, \eta_2 η1,η2。
N ( x ∣ μ , σ 2 ) = ( 2 π σ 2 ) − 1 2 e x p ( − ( x − μ ) 2 2 σ 2 ) = e x p ( − x 2 − 2 x μ + μ 2 2 σ 2 − 1 2 l n ( 2 π σ 2 ) = e x p ( − 1 2 σ 2 x 2 + μ σ 2 x − μ 2 2 σ 2 − 1 2 l n ( 2 π σ 2 ) ) = e x p ( [ x x 2 ] T [ μ σ 2 − 1 2 σ 2 ] − μ 2 2 σ 2 − 1 2 l n ( 2 π σ 2 ) ) \begin{aligned} N(x|\mu,\sigma^2)&=(2\pi \sigma^2)^{-\frac{1}{2}}exp(-\frac{(x-\mu)^2}{2\sigma^2})\\ &=exp(-\frac{x^2-2x\mu+\mu^2}{2\sigma^2}-\frac{1}{2}ln(2\pi\sigma^2)\\ &=exp(-\frac{1}{2\sigma^2}x^2+\frac{\mu}{\sigma^2}x-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}ln({2\pi\sigma^2}))\\ &=exp(\begin{bmatrix} x \\ x^2 \end{bmatrix}^T\begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix}-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}ln(2\pi\sigma^2)) \end{aligned} N(x∣μ,σ2)=(2πσ2)−21exp(−2σ2(x−μ)2)=exp(−2σ2x2−2xμ+μ2−21ln(2πσ2)=exp(−2σ21x2+σ2μx−2σ2μ2−21ln(2πσ2))=exp([xx2]T[σ2μ−2σ21]−2σ2μ2−21ln(2πσ2))
这里,我们得到:
T ( x ) = [ x x 2 ] η = [ η 1 η 2 ] = [ μ σ 2 − 1 2 σ 2 ] θ = [ μ σ 2 ] = [ − η 1 2 η 2 − 1 2 η 2 ] A ( η ) = − η 1 2 4 η 2 − 1 2 l n ( − 2 η 2 ) \begin{aligned} T(x)=\begin{bmatrix} x \\ x^2 \end{bmatrix}\\ \eta=\begin{bmatrix} \eta_1\\ \eta_2 \end{bmatrix}=\begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix}\\ \theta=\begin{bmatrix} \mu\\ \sigma^2 \end{bmatrix}=\begin{bmatrix} \frac{-\eta_1}{2\eta_2}\\ \frac{-1}{2\eta_2} \end{bmatrix}\\ A(\eta)=\frac{-\eta_1^2}{4\eta_2}-\frac{1}{2}ln(-2\eta_2) \end{aligned} T(x)=[xx2]η=[η1η2]=[σ2μ−2σ21]θ=[μσ2]=[2η2−η12η2−1]A(η)=4η2−η12−21ln(−2η2)
所以均值和方差可以表示为:
η 2 = − 1 2 σ 2 ⇒ σ 2 = − 1 2 η 2 μ = η 1 σ 2 = η 1 − 1 2 η 2 = − η 1 2 η 2 \eta_2=-\frac{1}{2\sigma^2}\Rightarrow \sigma^2=-\frac{1}{2\eta_2}\\ \mu=\eta_1\sigma^2=\eta_1\frac{-1}{2\eta_2}=-\frac{\eta_1}{2\eta_2} η2=−2σ21⇒σ2=−2η21μ=η1σ2=η12η2−1=−2η2η1
- 指数族分布有什么好处呢?
- 如果一个条件概率可以写成上面的形式,很多问题的求解变得简单。
- 比如:求解 a r g m a x θ [ l o g p ( X ∣ η ) ] \underset{\theta}{argmax}[log p(X| \eta)] θargmax[logp(X∣η)]:
a r g m a x η [ l o g p ( X ∣ η ) ] = a r g m a x η [ l o g ∏ i = 1 N p ( x i ∣ η ) ] = a r g m a x η ∑ i = 1 N [ l o g h ( x i ) + T ( x i ) T η − A ( η ) ] = a r g m a x η ∑ i = 1 N T ( x i ) T η − N A ( η ) \begin{aligned} \underset{\eta}{argmax}[log p(X| \eta)]&=\underset{\eta}{argmax}[log \prod_{i=1}^{N} p(x_i| \eta)]\\ &=\underset{\eta}{argmax}\sum_{i=1}^{N}[logh(x_i)+T(x_i)^T\eta-A(\eta)]\\ &=\underset{\eta}{argmax}\sum_{i=1}^{N}T(x_i)^T\eta-NA(\eta) \end{aligned} ηargmax[logp(X∣η)]=ηargmax[logi=1∏Np(xi∣η)]=ηargmaxi=1∑N[logh(xi)+T(xi)Tη−A(η)]=ηargmaxi=1∑NT(xi)Tη−NA(η)
令上式为 L ( η ) L(\eta) L(η),则
∂ L ( η ) ∂ η = ∑ i = 1 N T ( x i ) − N A ′ ( η ) = 0 \frac{\partial{L(\eta)}}{\partial \eta}=\sum_{i=1}^{N}T(x_i)-NA'(\eta)=0 ∂η∂L(η)=i=1∑NT(xi)−NA′(η)=0
即:
A ′ ( η ) = ∑ i = 1 N T ( x i ) N A'(\eta)=\frac{\sum_{i=1}^{N}T(x_i)}{N} A′(η)=N∑i=1NT(xi)
- 共轭:
p ( β ∣ x ) ∝ p ( x ∣ β ) p ( β ) p(\beta | x) \propto p(x|\beta)p(\beta) p(β∣x)∝p(x∣β)p(β)
如果似然函数和先验是共轭的,则后验和先验是同一种分布。
如果似然函数是指数族分布,理论上一定可以找到一个与之共轭的先验分布(也是指数族分布)。
- 一个结论: A l ′ ( β ) = E p ( x ∣ β ) [ T ( x ) ] A_l'(\beta)=E_{p(x|\beta)}[T(x)] Al′(β)=Ep(x∣β)[T(x)]
证明:
p ( x ∣ β ) = h ( x ) e x p ( T ( x ) T β − A l ( β ) ) ∵ ∫ p ( x ∣ β ) d x = 1 ∴ ∂ ∫ p ( x ∣ β ) d x ∂ β = ∂ ∫ h ( x ) e x p ( T ( x ) T β − A l ( β ) ) d x ∂ β = 0 = ∫ x ∂ [ h ( x ) e x p [ T ( x ) T β − A l ( β ) ] ∂ β d x = ∫ x h ( x ) e x p [ T ( x ) T β − A l ( β ) ] ( T ( x ) − A l ′ ( β ) ) d x = ∫ x h ( x ) e x p [ T ( x ) T β − A l ( β ) ] T ( x ) d x − ∫ x h ( x ) e x p [ T ( x ) T β − A l ( β ) ] A l ′ ( β ) ) d x = E p ( x ∣ β ) [ T ( x ) ] − A l ′ ( β ) = 0 p(x|\beta)=h(x)exp(T(x)^T\beta-A_l(\beta))\\ \because \int p(x|\beta)dx=1\\ \begin{aligned} \therefore \frac{\partial \int p(x|\beta)dx}{\partial \beta}&=\frac{\partial \int h(x)exp(T(x)^T\beta-A_l(\beta))dx}{\partial \beta}=0\\ &=\int_x \frac{\partial [h(x)exp[T(x)^T\beta - A_l(\beta)]}{\partial \beta}dx\\ &=\int_x h(x)exp[T(x)^T\beta-A_l(\beta)](T(x)-A_l'(\beta))dx\\ &=\int_x h(x)exp[T(x)^T\beta-A_l(\beta)]T(x)dx-\int_x h(x)exp[T(x)^T\beta-A_l(\beta)]A_l'(\beta))dx\\ &=E_{p(x|\beta)}[T(x)]-A_l'(\beta)=0 \end{aligned} p(x∣β)=h(x)exp(T(x)Tβ−Al(β))∵∫p(x∣β)dx=1∴∂β∂∫p(x∣β)dx=∂β∂∫h(x)exp(T(x)Tβ−Al(β))dx=0=∫x∂β∂[h(x)exp[T(x)Tβ−Al(β)]dx=∫xh(x)exp[T(x)Tβ−Al(β)](T(x)−Al′(β))dx=∫xh(x)exp[T(x)Tβ−Al(β)]T(x)dx−∫xh(x)exp[T(x)Tβ−Al(β)]Al′(β))dx=Ep(x∣β)[T(x)]−Al′(β)=0
- 数据集合 X X X,隐变量集合 Z Z Z,参数集合 β \beta β。
后验概率分布:
p ( β , Z ∣ X ) = p ( β ∣ Z , X ) p ( Z ∣ X ) = p ( Z ∣ β , X ) p ( β ∣ X ) \begin{aligned} p(\beta,Z|X)&=p(\beta|Z,X)p(Z|X)\\ &=p(Z|\beta,X)p(\beta|X) \end{aligned} p(β,Z∣X)=p(β∣Z,X)p(Z∣X)=p(Z∣β,X)p(β∣X)
p ( β ∣ Z , X ) p(\beta|Z,X) p(β∣Z,X)和 p ( Z ∣ β , X ) p(Z|\beta,X) p(Z∣β,X),这两个后验分布都是指数族分布。
则:
p ( β ∣ Z , X ) = h ( β ) e x p ( T ( β ) T η ( Z , X ) − A l ( η ( Z , X ) ) ) p(\beta|Z,X)=h(\beta)exp(T(\beta)^T\eta(Z,X)-A_l(\eta(Z,X))) p(β∣Z,X)=h(β)exp(T(β)Tη(Z,X)−Al(η(Z,X)))
在做变分推断时,希望用函数 q ( β ∣ λ ) q(\beta|\lambda) q(β∣λ)去近似 p ( β ∣ Z , X ) p(\beta|Z,X) p(β∣Z,X),即:
p ( β ∣ Z , X ) ≈ q ( β ∣ λ ) = h ( β ) e x p ( T ( β ) T λ − A g ( λ ) ) p(\beta|Z,X)\approx q(\beta|\lambda)=h(\beta)exp(T(\beta)^T\lambda-A_g(\lambda)) p(β∣Z,X)≈q(β∣λ)=h(β)exp(T(β)Tλ−Ag(λ))
接下来,就要不断地调整 λ \lambda λ,使得 q ( β ∣ λ ) q(\beta|\lambda) q(β∣λ)越来越接近于 p ( β ∣ Z , X ) p(\beta|Z,X) p(β∣Z,X),即增大 E L O B ELOB ELOB函数。
同样的,对于 p ( Z ∣ β , X ) p(Z|\beta, X) p(Z∣β,X)也是如此:
p ( Z ∣ β , X ) = h ( Z ) e x p ( T ( Z ) T η ( β , X ) − A l ( η ( β , X ) ) ) ≈ q ( Z ∣ ϕ ) = h ( Z ) e x p ( T ( Z ) T ϕ − A g ( ϕ ) ) \begin{aligned} p(Z|\beta,X)&=h(Z)exp(T(Z)^T\eta(\beta,X)-A_l(\eta(\beta,X)))\\ &\approx q(Z|\phi)=h(Z)exp(T(Z)^T\phi-A_g(\phi)) \end{aligned} p(Z∣β,X)=h(Z)exp(T(Z)Tη(β,X)−Al(η(β,X)))≈q(Z∣ϕ)=h(Z)exp(T(Z)Tϕ−Ag(ϕ))
E L O B ELOB ELOB函数如下:
L ( q ) = E q ( Z , β ) [ l o g p ( X , Z , β ) ] − E q ( Z , β ) [ l o g q ( Z , β ) ] L(q)=E_{q(Z,\beta)}[logp(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)] L(q)=Eq(Z,β)[logp(X,Z,β)]−Eq(Z,β)[logq(Z,β)]
现在, E L O B ELOB ELOB函数可以写成:
L ( λ , ϕ ) = E q ( Z , β ) [ l o g P ( X , Z , β ) ] − E q ( Z , β ) [ l o g q ( Z , β ) ] L(\lambda, \phi)=E_{q(Z,\beta)}[logP(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)] L(λ,ϕ)=Eq(Z,β)[logP(X,Z,β)]−Eq(Z,β)[logq(Z,β)]
目标:找到一个 λ \lambda λ和 ϕ \phi ϕ,使得 E L O B ELOB ELOB函数最大化。
方法:
- 先固定一个参数,对另一个参数优化
具体做法:
-
固定 ϕ \phi ϕ,优化 λ \lambda λ
-
L ( λ , ϕ ) = E q ( Z , β ) [ l o g p ( X , Z , β ) ] − E q ( Z , β ) [ l o g q ( Z , β ) ] = E q ( Z , β ) [ l o g p ( β ∣ X , Z ) + l o g p ( Z ∣ X ) ] − E q ( Z , β ) [ l o g q ( β ) ] − E q ( Z , β ) [ l o g q ( Z ) ] = E q ( Z , β ) [ l o g p ( β ∣ X , Z ) ] − E q ( Z , β ) [ l o g q ( β ∣ λ ) ] \begin{aligned} L(\lambda, \phi)&=E_{q(Z,\beta)}[logp(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)]\\ &=E_{q(Z,\beta)}[logp(\beta|X,Z)+logp(Z|X)]-E_{q(Z,\beta)}[logq(\beta)]-E_{q(Z,\beta)}[logq(Z)]\\ &=E_{q(Z,\beta)}[logp(\beta|X,Z)]-E_{q(Z,\beta)}[logq(\beta|\lambda)] \end{aligned} L(λ,ϕ)=Eq(Z,β)[logp(X,Z,β)]−Eq(Z,β)[logq(Z,β)]=Eq(Z,β)[logp(β∣X,Z)+logp(Z∣X)]−Eq(Z,β)[logq(β)]−Eq(Z,β)[logq(Z)]=Eq(Z,β)[logp(β∣X,Z)]−Eq(Z,β)[logq(β∣λ)]
-
将 p ( β ∣ Z , X ) p(\beta|Z,X) p(β∣Z,X)和 q ( β ∣ λ ) q(\beta|\lambda) q(β∣λ)代入上式
-
L ( λ , ϕ ) = E q ( Z , β ) [ l o g h ( β ) ] + E q ( Z , β ) [ T ( β ) T η ( Z , X ) ] − E q ( Z , β ) [ A g ( η ( X , Z ) ) ] − E q ( Z , β ) [ l o g h ( β ) ] − E q ( Z , β ) [ ( T ( β ) T λ ) ] + E q ( Z , β ) [ A g ( λ ) ] = E q ( β ) [ T ( β ) T ] ⋅ E q ( Z ) [ η ( Z , X ) ] − E q ( Z ) [ A g ( η ( X , Z ) ) ] − E q ( β ) [ ( T ( β ) T λ ) ] + A g ( λ ) = A g ′ ( λ ) T E q ( Z ) [ η ( Z , X ) ] − λ A g ′ ( λ ) T + A g ( λ ) \begin{aligned} L(\lambda, \phi)&=E_{q(Z,\beta)}[logh(\beta)]+E_{q(Z,\beta)}[T(\beta)^T\eta(Z,X)]-E_{q(Z,\beta)}[A_g(\eta(X,Z))]-E_{q(Z,\beta)}[logh(\beta)]-E_{q(Z,\beta)}[(T(\beta)^T\lambda)]+E_{q(Z,\beta)}[A_g(\lambda)]\\ &=E_{q(\beta)}[T(\beta)^T]\cdot E_{q(Z)}[\eta(Z,X)]-E_{q(Z)}[A_g(\eta(X,Z))]-E_{q(\beta)}[(T(\beta)^T\lambda)]+A_g(\lambda)\\ &=A_g'(\lambda)^TE_{q(Z)}[\eta(Z,X)]-\lambda A_g'(\lambda)^T+A_g(\lambda) \end{aligned} L(λ,ϕ)=Eq(Z,β)[logh(β)]+Eq(Z,β)[T(β)Tη(Z,X)]−Eq(Z,β)[Ag(η(X,Z))]−Eq(Z,β)[logh(β)]−Eq(Z,β)[(T(β)Tλ)]+Eq(Z,β)[Ag(λ)]=Eq(β)[T(β)T]⋅Eq(Z)[η(Z,X)]−Eq(Z)[Ag(η(X,Z))]−Eq(β)[(T(β)Tλ)]+Ag(λ)=Ag′(λ)TEq(Z)[η(Z,X)]−λAg′(λ)T+Ag(λ)
-
上式对 λ \lambda λ求导
-
∂ L ( λ , ϕ ) ∂ λ = A g ′ ′ ( λ ) T ⋅ E q ( Z ) [ η ( Z , X ) ] − A g ′ ( λ ) T − λ A g ′ ′ ( λ ) T + A g ′ ( λ ) = A g ′ ′ ( λ ) T ( E q ( Z ) [ η ( Z , X ) ] − λ ) = 0 \begin{aligned} \frac{\partial L(\lambda, \phi)}{\partial \lambda}&=A_g''(\lambda)^T\cdot E_{q(Z)}[\eta(Z,X)]-A_g'(\lambda)^T-\lambda A_g''(\lambda)^T+A_g'(\lambda)\\ &=A_g''(\lambda)^T(E_{q(Z)}[\eta(Z,X)]-\lambda)=0 \end{aligned} ∂λ∂L(λ,ϕ)=Ag′′(λ)T⋅Eq(Z)[η(Z,X)]−Ag′(λ)T−λAg′′(λ)T+Ag′(λ)=Ag′′(λ)T(Eq(Z)[η(Z,X)]−λ)=0
-
如果 A g ′ ′ ( λ ) T ≠ 0 A_g''(\lambda)^T \neq 0 Ag′′(λ)T̸=0,则
λ = E q ( Z ∣ ϕ ) [ η ( Z , X ) ] \lambda=E_{q(Z|\phi)}[\eta(Z,X)] λ=Eq(Z∣ϕ)[η(Z,X)]
同样
ϕ = E q ( β ∣ λ ) [ η ( X , β ) ] \phi=E_{q(\beta|\lambda)}[\eta(X,\beta)] ϕ=Eq(β∣λ)[η(X,β)]
这篇关于指数族分布和变分推断的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!