目录

 

DDPM完全解读

名词解析

ddpm

注意:

  • diffusion process:is fixed to a Markov chain that gradually adds Gaussian noise to the data according to variance schedule β1,...βT, 即变换前后满足高斯分布,当前状态只与前一时刻有关; 下一小节将给出diffusion过程的分布预定义形式,即variance schedule是自定义的constant;且从上图可以看出diffusion过程与θ无关,只是为了求loss,将θ作用在forward input上,后续将具体介绍
  • reverse process: is defined as a Markov chain with learned Gaussian transition starting at P(xT):=N(xT;0,I),但reverse过程的mean and std是与θ相关的函数,为了使pθ(x0)尽量接近qdata(x0),需要找到mean和std的最佳定义,使likelihood of pθ(x0)最大,这也是diffusion model的loss定义,后续将具体介绍

预定义

Diffusion Process

根据Diffusion Model[1]的定义,定义了diffusion process is fixed to a Markov chain that gradually adds Gaussian noise to the data according to variance schedule β1,...βT

q(xt|xt1):=N(xt;1βtxt1,βtI)q(x1:T|x0):=t=1Tq(xt|xt1)where βt is variance schedule, also called diffusion rate.

注意此处不是阶乘,代表嵌套的意思,类似normalizing flow里的表达

Reverse Process

根据定义,sampling/reverse process is defined as a Markov chain with learned Gaussian transition starting at p(xT):=N(xT;0,I)

p(xT):=N(xT;0,I)pθ(xt1|xt):=N(xt1;μθ(xt,t),θ(xt,t)I)pθ(x0:T)=pθ(x0,x1,...,xT)=p(xT)t=1Tpθ(xt1|xt)

注意此处不是阶乘,代表嵌套的意思,类似normalizing flow里的表达

Loss

Reverse process is θ parameterization, 为了使reverse process尽可能的得到fidelity result,则需要找到使pθ(x0)最接近qdata(x0)分布的参数θ,即采用maximize log likelihood estimation,等价于minimize negative log likelihood。

下式中的pθ(x0)通常很难准确求解,Sampling process start from platent(xT):=N(xT;0,I), 则:

pθ(x0,x1,...,xT1|xT)p(xT)=pθ(x0,x1,...,xT)pθ(x0)=pθ(x0,x1,...,xT)dx1dx2...dxT=pθ(x0:T)dx1:T

θ通过minimize variational bound on log likelihood求解 (该diffusion求解过程is known as the variance preserving forward SDE), log likelihood定义为:

Eqdata(x0)logpθ(x0)=Eqdata(x0)(logEq(x1,...,xT|x0)[pθ(x0,x1,...,xT1|xT)p(xT)q(x1,...,xT|x0)])optimizingusualvariational boundEEqdata(x0)logpθ(x0)Eq(x0,...,xT)log[pθ(x0,x1,...,xT1|xT)p(xT)q(x1,...,xT|x0)]Eqdata(x0)logpθ(x0)Eq(x0:T)log[pθ(x0:T)q(x1:T|x0)]=:L

将上式L进一步可化简为:

loss

具体推导如下(注以下推导没加负号[2],加负号的推导见[3]-Extra information-A Extended derivations):

proof

注意,loss的计算需要用到

q(xT|x0)q(xt1|xt,x0)pθ(xt1|xt),pθ(x0|x1)

前两个分布以下将分别推导,第三个分布即DDPM[3] proposed distribution,which makes DDPM resembling denoising score matching[4].

推导

q(xT|x0)

根据diffusion model预定义,从1至T的任意时刻,xt相对x0的后验:q(x1:T|x0):=t=1Tq(xt|xt1)

q(x3|x0)为例,即t=3时:

q(x1|x0):=N(x1;1β1x0,β1I1),so x1=1β1x0+β1I1q(x2|x1):=N(x2;1β2x1,β2I2),so x2=1β2(1β1x0+β1I1)+β2I2q(x3|x2):=N(x3;1β3x2,β3I3),so x3=1β3[1β2(1β1x0+β1I1)+β2I2]+β3I3

αt=1βt,αt=s=1tαs,则q(x3|x0)的均值,方差为

mean=1β31β21β1=α3std2=(1β31β2β1)2+(1β3β2)2+(β3)2=α3α2(1α1)+α3(1α2)+(1α3)=1α1α2α3=1α3

因此,q(x3|x0):=N(x3;α3x0,(1α3)I,同理,推广到所有的t可得:

q(xt|x0):=N(xt;αtx0,(1αt)I)

或者,迭代的理论推导如下:

qt0

q(xt1|xt,x0)

根据贝叶斯公式,全概率公式等,可得:

q(xt1|xt,x0)=q(xt1,xt,x0)q(xt,x0)=q(xt|xt1,x0)q(x0,xt1)q(xt,x0)=q(xt|xt1,x0)q(x0,xt1)q(xt,x0)=q(xt|xt1,x0)q(xt1|x0)q(x0)q(xt|x0)q(x0)=q(xt|xt1,x0)q(xt1|x0)q(xt|x0)

根据各维独立高斯分布,将xt1关于xt,x0的分布可进一步化简得:

proof2

其中,βt代表:

βt=1αt11αtβt

pθ(xt1|xt)

由于Lt1q(xt1|xt,x0)pθ(xt1|xt)的KL散度,q(xt1|xt,x0)已求得,因此需要确定pθ(xt1|xt)的分布,使loss有closed-form calculation,因此DDPM[3]给出了pθ(xt1|xt):=N(xt1;μθ(xt,t),θ(xt,t)I)μθ(xt,t),θ(xt,t)的形式,实现了closed-form expression,并与denoising score matching对应起来,以下将具体介绍。

根据前文reverse process预定义,

pθ(xt1|xt):=N(xt1;μθ(xt,t),θ(xt,t)I)

Loss中与其相关的项为:

Lt1=t>1DKL(q(xt1|xt,x0)||pθ(xt1|xt))

其中,根据上文推导,将q(xt1|xt,x0)简写为:

q(xt1|xt,x0):=N(xt1;μt(xt,x0),βtI)where μt(xt,x0)=αt1βt1αtx0+αt(1αt1)1αtxt
首先考虑std

Ho通过实验发现,令

有相似的结果,The first choice is optimal for x0:=N(x0;0,I), and thes econd is optimal for x0 deterministically set to one point. These are the two extreme choicesc orresponding to upper and lower bounds on reverse process entropy for data with coordinatewise unit variance[1].

无论第一种还是第二种方式,θ(xt,t)都与θ无关,由于q(xt1|xt,x0)的variance与θ也无关,因此两个分布的std项带入KL divergence中计算得到常数C

因此,实验中采用第二种方式。

其次考虑mean

将C带入L_{t-1}可化简得:

lt-1

根据上文:

q(xt|x0):=N(xt;αtx0,(1αt)I)xt=αtx0+1αtϵx0=1αt(xt1αtϵ)μt(xt,x0)=αt1βt1αtx0+αt(1αt1)1αtxt

带入上式得:

lt-1-c

根据上式可知,μθ需要预测(尽可能近似)1αt(xt(x0,ϵ)βt1αtϵ),由于其中xt(x0,ϵ)是reverse input,是已知的(gaussian noise),因此,μθ的proposion将用来估计于ϵ,因为diffusion过程刚好也是通过t次变换从x0映射到gaussian即ϵ,巧妙的将θ嵌入到train过程用来优化;同时将xt也作为其输入(因为μθ是关于xt,t的函数,不会引入新的自变量),因此Ho将mean定义为:

l-1mean

其中,ϵθ is a function approximator,用来根据xt估计ϵ,即guassion noise.

整合

有了std和mean的定义,即可给出xt1的解析:

xt1=mean+stdz=1αt(xtβt1αtϵθ(xt,t))+σtzwhere z:=N(0,I)

xt1既可以视为reverse所得,也可视为diffusion所得,因此training过程,ϵθ可作用于diffusion variable

The complete sampling procedure resembles Langevin dynamics with ϵθ as a learned gradient of the data density. 同时,Eq10.可简化为:

Ex0,ϵ[βt22σt2αt(1αt)||ϵϵθ(αtx0+1αtϵ,t)||]

which resembles denoising score matching over multiple noise scales indexed by t. 上式 is equal to (one term of) the variational bound for the Langevin-like reverse process, we see that optimizing an objective resembling denoising score matching is equivalent to using variational inference to fit the finite-time marginal of a sampling chain resembling Langevin dynamics

此外,通过不同的化简形式,也可以约掉xt,使μθ化简为关于x0,但作者通过实验发现这种方式生成的图片质量更差。

因此,DDPM的diffusion process和reverse process可概括如下:

ddpmalg

注意,training过程中,取tUniform1,...,T, 不一样的的t对应不一样的β, 实际实现上,每次循环,不同的样本的t都不同;对同一样本,每次loop随机取t,即相当于每次训练了不同长度的markov chain,足够多次的iteration之后,遍历多次训练了整个链;不同长度的链都尽量优化到最小loss。

eg. DDPM论文中,T取1000,即链的长度为1000

Reference

[1]. Deep Unsupervised Learning using Nonequilibrium Thermodynamics, 2015.

[2]. DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS, 2021.

[3]. Denoising Diffusion Probabilistic Models, 2020.

[4]. Generative Modeling by Estimating Gradients of the Data Distribution, 2019.