Diffusion

DDPM

The Ornstein–Uhlenbeck process (OUP) is a unique Gaussian, stationary diffusion process. Originating as a model for the Brownian motion of a particle, it has a wide range of applications in biology and elsewhere.
20210914-DDPM完全解读-周阅
DDPM完全解读名词解析预定义Diffusion ProcessReverse ProcessLoss推导q(x_T|x_0)q(x_{t-1}|x_t,x_0)p_{\theta}(x_{t-1}|x_t)首先考虑std其次考虑mean整合Reference
https://yuezhou-oh.github.io/blog/paperreading/Understanding_diffusion_model.html

Evaluation:

DDIM

Denoising Diffusion Implicit Model: deterministic sampling methods

💡
保证noising/diffusion的目标分布与DDPM一致(q(xtx0)=N(xt;αt^x0,(1αt^)I)q(x_t|x_0)=\Bbb N(x_t;\sqrt{\hat {\alpha_t}}x_0,(1-\hat {\alpha_t})I)),构建新的采样分布p(xt1xt,x0)p(x_{t-1}|x_t,x_0)替代马尔科夫链,可以无需依赖过长的MC,并可以通过trajectory控制reverse生成路径,加快生成速度,trade off between computation and sample quality
Denoising Diffusion Implicit Models
Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a...
https://arxiv.org/abs/2010.02502

Classifier Guidance Diffusion

论文中包括深入浅出的对background和previous work的review和theoretical illustration,现在大多数论文里都用协方差矩阵表示log高斯分布(logp(x)=12(xμ)T1(xμ)logp(x)=-\frac {1} {2} (x-\mu)^T\sum^{-1}(x-\mu)

训练与推理过程中,

Diffusion Models Beat GANs on Image Synthesis
We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better...
https://arxiv.org/abs/2105.05233

Classifier-Free Diffusion Guidance

https://openreview.net/pdf?id=qw8AKxfYbI
💡
Classifier-Free Guidance的核心是通过一个隐式分类器来替代显示分类器,而无需直接计算显式分类器及其梯度

Classifier Guidance 使用显式的分类器引导条件生成有几个问题

In initial experiments with unconditional ImageNet models, we found it necessary to scale the
classifier gradients by a constant factor larger than 1. When using a scale of 1, we observed that the classifier assigned reasonable probabilities (around 50%) to the desired classes for the final samples, but these samples did not match the intended classes upon visual inspection. Scaling up the classifier gradients remedied this problem, and the class probabilities from the classifier increased to nearly 100%.

2022年谷歌提出Classifier-Free Guidance diffusion方案,可以规避上述问题,而且可以通过调节引导权重,控制生成图像的逼真性和多样性的平衡,DALL·E 2和Imagen等模型都是以它为基础进行训练和推理。

Efficient sampling

生成速度主要受step的影响,可以从多个方面提升生成速度:

https://arxiv.org/pdf/2209.00796.pdf
Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models
One of the key components within diffusion models is the UNet for noise prediction. While several works have explored basic properties of the UNet decoder, its encoder largely remains unexplored....
https://arxiv.org/abs/2312.09608

Transformer-based Diffusion

Stable Diffusion

The Illustrated Stable Diffusion
Translations: Chinese, Vietnamese. (V2 Nov 2022: Updated images for more precise description of forward diffusion. A few more images in this version) AI image generation is the most recent AI capability blowing people’s minds (mine included). The ability to create striking visuals from text descriptions has a magical quality to it and points clearly to a shift in how humans create art. The release of Stable Diffusion is a clear milestone in this development because it made a high-performance model available to the masses (performance in terms of image quality, as well as speed and relatively low resource/memory requirements). After experimenting with AI image generation, you may start to wonder how it works. This is a gentle introduction to how Stable Diffusion works. Stable Diffusion is versatile in that it can be used in a number of different ways. Let’s focus at first on image generation from text only (text2img). The image above shows an example text input and the resulting generated image (The actual complete prompt is here). Aside from text to image, another main way of using it is by making it alter images (so inputs are text + image).
https://jalammar.github.io/illustrated-stable-diffusion/

DALL.E 2

OpenAI DALL-E 2 是一种基于语言的人工智能图像生成器,可以根据文本提示创建高质量的图像和艺术作品。

CLIP+Diffusion

https://cdn.openai.com/papers/dall-e-2.pdf

Imagen

Google Text-to-Image Diffusion Model

https://arxiv.org/pdf/2205.11487.pdf

Sora

Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.

Video generation models as world simulators
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
https://openai.com/research/video-generation-models-as-world-simulators

An Overview of Diffusion Models: Applications, Guided Generation,...
Diffusion models, a powerful and universal generative AI technology, have achieved tremendous success in computer vision, audio, reinforcement learning, and computational biology. In these...
https://arxiv.org/abs/2404.07771