20210914-DDPM完全解读-周阅

DDPM完全解读名词解析预定义Diffusion ProcessReverse ProcessLoss推导 $q(x_T|x_0)$ $q(x_{t-1}|x_t,x_0)$ $p_{\theta}(x_{t-1}|x_t)$ 首先考虑std其次考虑mean整合Reference

DDPM完全解读

名词解析

ddpm

$q(x_t)$
$p(x_t)$

注意：
$\beta_1,...\beta_T$ variance schedule $\theta$ $\theta$ 作用在forward input上，后续将具体介绍
$P(x_T):=N(x_T;0,I)$ $\theta$ $p_{\theta}(x_0)$ $q_{data}(x_0)$ $p_{\theta}(x_0)$ 最大，这也是diffusion model的loss定义，后续将具体介绍

预定义

Diffusion Process

$\beta_1,...\beta_T$ ：

\begin{matrix} q (x_{t} | x_{t - 1}) := N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I) \\ q (x_{1 : T} | x_{0}) := \prod_{t = 1}^{T} q (x_{t} | x_{t - 1}) \\ w h e r e β_{t} i s v a r i a n c e s c h e d u l e, a l s o c a l l e d d i f f u s i o n r a t e . \end{matrix}

$\prod$ 不是阶乘，代表嵌套的意思，类似normalizing flow里的表达

Reverse Process

$p(x_T):=N(x_T;0,I)$ ：

\begin{matrix} p (x_{T}) := N (x_{T}; 0, I) \\ p_{θ} (x_{t - 1} | x_{t}) := N (x_{t - 1}; μ_{θ} (x_{t}, t), \sum_{θ} (x_{t}, t) I) \\ p_{θ} (x_{0 : T}) = p_{θ} (x_{0}, x_{1}, . . ., x_{T}) = p (x_{T}) \cdot \prod_{t = 1}^{T} p_{θ} (x_{t - 1} | x_{t}) \end{matrix}

$\prod$ 不是阶乘，代表嵌套的意思，类似normalizing flow里的表达

Loss

$\theta$ $p_{\theta}(x_0)$ $q_{data}(x_0)$ $\theta$ ，即采用maximize log likelihood estimation，等价于minimize negative log likelihood。

$p_{\theta}(x_0)$ $p_{latent}(x_T):=N(x_T;0,I)$ , 则：

\begin{matrix} p_{θ} (x_{0}, x_{1}, . . ., x_{T - 1} | x_{T}) \cdot p (x_{T}) = p_{θ} (x_{0}, x_{1}, . . ., x_{T}) \\ p_{θ} (x_{0}) = \int p_{θ} (x_{0}, x_{1}, . . ., x_{T}) d x_{1} d x_{2} . . . d x_{T} = \int p_{θ} (x_{0 : T}) d x_{1 : T} \end{matrix}

$\theta$ 通过minimize variational bound on log likelihood求解 (该diffusion求解过程is known as the variance preserving forward SDE), log likelihood定义为：

\begin{matrix} - E_{q_{d a t a} (x_{0})} l o g p_{θ} (x_{0}) = - E_{q_{d a t a} (x_{0})} (l o g E_{q (x_{1}, . . ., x_{T} | x_{0})} [\frac{p_{θ} (x_{0}, x_{1}, . . ., x_{T - 1} | x_{T}) \cdot p (x_{T})}{q (x_{1}, . . ., x_{T} | x_{0})}]) \\ o p t i m i z i n g u s u a l v a r i a t i o n a l b o u n d ， 将 E 都 提 前 ， 合 并 下 标 概 率 ， 可 得 ： \\ - E_{q_{d a t a} (x_{0})} l o g p_{θ} (x_{0}) \leq - E_{q (x_{0}, . . ., x_{T})} l o g [\frac{p_{θ} (x_{0}, x_{1}, . . ., x_{T - 1} | x_{T}) \cdot p (x_{T})}{q (x_{1}, . . ., x_{T} | x_{0})}] \\ 简 写 为 \\ - E_{q_{d a t a} (x_{0})} l o g p_{θ} (x_{0}) \leq - E_{q (x_{0 : T})} l o g [\frac{p_{θ} (x_{0 : T})}{q (x_{1 : T} | x_{0})}] =: L \end{matrix}

$L$ 进一步可化简为：

loss

具体推导如下（注以下推导没加负号[2]，加负号的推导见[3]-Extra information-A Extended derivations）：

proof

注意，loss的计算需要用到

\begin{matrix} q (x_{T} | x_{0}) \\ q (x_{t - 1} | x_{t}, x_{0}) \\ p_{θ} (x_{t - 1} | x_{t}), p_{θ} (x_{0} | x_{1}) \end{matrix}

前两个分布以下将分别推导，第三个分布即DDPM[3] proposed distribution，which makes DDPM resembling denoising score matching[4].

推导

$q(x_T|x_0)$

$x_t$ $x_0$ $q(x_{1:T}|x_0):=\prod_{t=1}^Tq(x_t|x_{t-1})$

$q(x_3|x_0)$ $t=3$ 时：

\begin{matrix} q (x_{1} | x_{0}) := N (x_{1}; \sqrt{1 - β_{1}} x_{0}, β_{1} I_{1}), s o x_{1} = \sqrt{1 - β_{1}} x_{0} + \sqrt{β_{1}} I_{1} \\ q (x_{2} | x_{1}) := N (x_{2}; \sqrt{1 - β_{2}} x_{1}, β_{2} I_{2}), s o x_{2} = \sqrt{1 - β_{2}} (\sqrt{1 - β_{1}} x_{0} + \sqrt{β_{1}} I_{1}) + \sqrt{β_{2}} I_{2} \\ q (x_{3} | x_{2}) := N (x_{3}; \sqrt{1 - β_{3}} x_{2}, β_{3} I_{3}), s o x_{3} = \sqrt{1 - β_{3}} [\sqrt{1 - β_{2}} (\sqrt{1 - β_{1}} x_{0} + \sqrt{β_{1}} I_{1}) + \sqrt{β_{2}} I_{2}] + \sqrt{β_{3}} I_{3} \end{matrix}

$\alpha_t=1-\beta_t,\overline {\alpha_t}=\prod_{s=1}^t\alpha_s$ $q(x_3|x_0)$ 的均值，方差为

\begin{matrix} m e a n = \sqrt{1 - β_{3}} \sqrt{1 - β_{2}} \sqrt{1 - β_{1}} = \sqrt{\overset{―}{α_{3}}} \\ s t d^{2} = (\sqrt{1 - β_{3}} \sqrt{1 - β_{2}} \sqrt{β_{1}})^{2} + (\sqrt{1 - β_{3}} \sqrt{β_{2}})^{2} + (\sqrt{β_{3}})^{2} = α_{3} α_{2} (1 - α_{1}) + α_{3} (1 - α_{2}) + (1 - α_{3}) = 1 - α_{1} α_{2} α_{3} = 1 - \overset{―}{α_{3}} \end{matrix}

$q(x_3|x_0):=N(x_3;\sqrt{\overline {\alpha_3}}\cdot x_0,(1-\overline {\alpha_3})I$ ，同理，推广到所有的t可得：

q (x_{t} | x_{0}) := N (x_{t}; \sqrt{\overset{―}{α_{t}}} \cdot x_{0}, (1 - \overset{―}{α_{t}}) I)

或者，迭代的理论推导如下：

qt0

$q(x_{t-1}|x_t,x_0)$

根据贝叶斯公式，全概率公式等，可得：

\begin{matrix} q (x_{t - 1} | x_{t}, x_{0}) = \frac{q (x_{t - 1}, x_{t}, x_{0})}{q (x_{t}, x_{0})} = \frac{q (x_{t} | x_{t - 1}, x_{0}) \cdot q (x_{0}, x_{t - 1})}{q (x_{t}, x_{0})} = \frac{q (x_{t} | x_{t - 1}, x_{0}) \cdot q (x_{0}, x_{t - 1})}{q (x_{t}, x_{0})} \\ = \frac{q (x_{t} | x_{t - 1}, x_{0}) \cdot q (x_{t - 1} | x_{0}) \cdot q (x_{0})}{q (x_{t} | x_{0}) \cdot q (x_{0})} = \frac{q (x_{t} | x_{t - 1}, x_{0}) \cdot q (x_{t - 1} | x_{0})}{q (x_{t} | x_{0})} \end{matrix}

$x_{t-1}$ $x_t,x_0$ 的分布可进一步化简得：

proof2

$\overline {\beta_t}$ 代表：

\overset{―}{β_{t}} = \frac{1 - \overset{―}{α_{t - 1}}}{1 - \overset{―}{α_{t}}} \cdot β_{t}

$p_{\theta}(x_{t-1}|x_t)$

$L_{t-1}$ $q(x_{t-1}|x_t,x_0)$ $p_{\theta}(x_{t-1}|x_t)$ $q(x_{t-1}|x_t,x_0)$ $p_{\theta}(x_{t-1}|x_t)$ $p_{\theta}(x_{t-1}|x_t):=N(x_{t-1};\mu_{\theta}(x_t,t),\sum _{\theta}(x_t,t)I)$ $\mu_{\theta}(x_t,t),\sum _{\theta}(x_t,t)$ 的形式，实现了closed-form expression，并与denoising score matching对应起来，以下将具体介绍。

根据前文reverse process预定义，

p_{θ} (x_{t - 1} | x_{t}) := N (x_{t - 1}; μ_{θ} (x_{t}, t), \sum_{θ} (x_{t}, t) I)

Loss中与其相关的项为：

L_{t - 1} = \sum_{t > 1} D_{K L} (q (x_{t - 1} | x_{t}, x_{0}) | | p_{θ} (x_{t - 1} | x_{t}))

$q(x_{t-1}|x_t,x_0)$ 简写为：

\begin{matrix} q (x_{t - 1} | x_{t}, x_{0}) := N (x_{t - 1}; \overset{―}{μ_{t}} (x_{t}, x_{0}), \overset{―}{β_{t}} I) \\ w h e r e \overset{―}{μ_{t}} (x_{t}, x_{0}) = \frac{\sqrt{\overset{―}{α_{t - 1}}} \cdot β_{t}}{1 - \overset{―}{α_{t}}} \cdot x_{0} + \frac{\sqrt{α_{t}} \cdot (1 - \overset{―}{α_{t - 1}})}{1 - \overset{―}{α_{t}}} \cdot x_{t} \end{matrix}

首先考虑std

Ho通过实验发现，令

$\sum _{\theta}(x_t,t)=\sigma_t^2=\beta_t$ $\beta_t趋向1$
$\sum _{\theta}(x_t,t)=\sigma_t^2=\overline {\beta_t}$ $\overline {\beta_t}<\beta_t$

$x_0:=N(x_0;0,I)$ $x_0$ deterministically set to one point. These are the two extreme choicesc orresponding to upper and lower bounds on reverse process entropy for data with coordinatewise unit variance[1].

$\sum _{\theta}(x_t,t)$ $\theta$ $q(x_{t-1}|x_t,x_0)$ $\theta$ 也无关，因此两个分布的std项带入KL divergence中计算得到常数C

因此，实验中采用第二种方式。

其次考虑mean

将C带入L_{t-1}可化简得：

lt-1

根据上文：

\begin{matrix} q (x_{t} | x_{0}) := N (x_{t}; \sqrt{\overset{―}{α_{t}}} \cdot x_{0}, (1 - \overset{―}{α_{t}}) I) \to x_{t} = \sqrt{\overset{―}{α_{t}}} \cdot x_{0} + \sqrt{1 - \overset{―}{α_{t}}} \cdot ϵ \to x_{0} = \frac{1}{\sqrt{\overset{―}{α_{t}}}} \cdot (x_{t} - \sqrt{1 - \overset{―}{α_{t}}} \cdot ϵ) \\ \overset{―}{μ_{t}} (x_{t}, x_{0}) = \frac{\sqrt{\overset{―}{α_{t - 1}}} \cdot β_{t}}{1 - \overset{―}{α_{t}}} \cdot x_{0} + \frac{\sqrt{α_{t}} \cdot (1 - \overset{―}{α_{t - 1}})}{1 - \overset{―}{α_{t}}} \cdot x_{t} \end{matrix}

带入上式得：

lt-1-c

$\mu_{\theta}$ $\frac {1}{\sqrt{\alpha_t}}(x_t(x_0,\epsilon)-\frac {\beta_t}{\sqrt {1-\overline {\alpha_t}}}\epsilon)$ $x_t(x_0,\epsilon)$ $\mu_{\theta}$ $\epsilon$ $x_0$ $\epsilon$ $\theta$ 嵌入到train过程用来优化； $x_t$ $\mu_{\theta}$ $x_t,t$ 的函数，不会引入新的自变量），因此Ho将mean定义为：

l-1mean

$\epsilon_{\theta}$ $x_t$ $\epsilon$ ,即guassion noise.

整合

$x_{t-1}$ 的解析：

\begin{matrix} x_{t - 1} = m e a n + s t d \cdot z = \frac{1}{\sqrt{α_{t}}} \cdot (x_{t} - \frac{β_{t}}{\sqrt{1 - \overset{―}{α_{t}}}} ϵ_{θ} (x_{t}, t)) + σ_{t} z \\ w h e r e z := N (0, I) \end{matrix}

$x_{t-1}$ $\epsilon_{\theta}$ 可作用于diffusion variable

$\epsilon_{\theta}$ as a learned gradient of the data density. 同时，Eq10.可简化为：

E_{x_{0}, ϵ} [\frac{β_{t}^{2}}{2 σ_{t}^{2} α_{t} (1 - \overset{―}{α_{t}})} | | ϵ - ϵ_{θ} (\sqrt{\overset{―}{α_{t}}} \cdot x_{0} + \sqrt{1 - \overset{―}{α_{t}}} \cdot ϵ, t) | |]

which resembles denoising score matching over multiple noise scales indexed by t. 上式 is equal to (one term of) the variational bound for the Langevin-like reverse process, we see that optimizing an objective resembling denoising score matching is equivalent to using variational inference to fit the finite-time marginal of a sampling chain resembling Langevin dynamics

$x_t$ $\mu_{\theta}$ $x_0$ ，但作者通过实验发现这种方式生成的图片质量更差。

因此，DDPM的diffusion process和reverse process可概括如下：

ddpmalg

$t\in Uniform{1,...,T}$ $\beta$ , 实际实现上，每次循环，不同的样本的t都不同；对同一样本，每次loop随机取t，即相当于每次训练了不同长度的markov chain，足够多次的iteration之后，遍历多次训练了整个链；不同长度的链都尽量优化到最小loss。
eg. DDPM论文中，T取1000，即链的长度为1000

Reference

[1]. Deep Unsupervised Learning using Nonequilibrium Thermodynamics, 2015.

[2]. DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS, 2021.

[3]. Denoising Diffusion Probabilistic Models, 2020.

[4]. Generative Modeling by Estimating Gradients of the Data Distribution, 2019.