Diffusion Model 笔记

Generative Model

四种 DL-based 生成模型：

Variational Autoencoders (VAE)
Flow-based models
Generative Adversarial Networks.
Diffusion

这四种模型都有其缺点，如下图：


	VAE	FLOW	GAN	Diffusion
Pros	Fast Sampling rate. Diverse sample generation	Fast Sampling rate. Diverse sample generation	Fast Sampling rate. High sample generation quality.	High sample generation quality. Diverse sample generation
Cons	Low sample generation quality	Need specialized architecture, low sample generation quality	Unstable training, low sample generation diversity (Mode Collapse)	Low sampling rate

Diffusion Model

主要思想

…systematically and slowly destroy the structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows us to rapidly learn, sample from, and evaluate probabilities in deep generative models …

Forward 过程（data to noise）：
- 原始图像 $x_{0}$
- 重复 $T$ 次加噪： $x_{t - 1} + ϵ_{t - 1} \to x_{t}$ ，最终得到 $x_{T}$
- 无模型参与
Backward 过程（noise to data）：
- 有模型 $θ$ 参与： $θ : (x_{t}, t) \to \overset{ϵ}{^}_{t - 1}$

DDPM

NOTE

DDPM 是 Diffusion Model 的一个子类，其特点在于：

预测噪声而非预测图像： 广义的扩散模型可以尝试在每一步预测上一张稍微清晰一点的图片。但 DDPM 发现，让神经网络（通常是 U-Net 架构）直接去预测当前图片上被添加了多少噪声（ $ϵ$ ），效果出奇的好，数学优化上也更加稳定。

固定的加噪过程： DDPM 使用了基于马尔可夫链（Markov Chain）的设定，规定了每次加噪的幅度（Variance Schedule）是预先设定好的、固定的，不需要模型去学习，这大大降低了训练难度。

简化的损失函数： DDPM 的作者通过巧妙的数学化简，去除了复杂的变分下界（VLB）中一些难计算的项，得到了一个极其简单、优雅的均方误差损失函数（MSE）。模型只需要比较“真实的噪声”和“预测的噪声”的差异即可。

Forward 过程

q (x_{1}, x_{2}, \dots, x_{T} ∣ x_{0}) q (x_{t} ∣ x_{t - 1}) = t = 1 \prod T q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)

$1 - β_{t}$ 这个系数是如何确定的

答案是为了保持方差。重参数化 $x_{t}$ 为 $x_{t} = 1 - β_{t} x_{t - 1} + β_{t} z, z \sim N (0, I)$ ，则 $x_{t}$ 的方差为：
$V a r (x_{t}) = V a r (1 - β_{t} x_{t - 1} + β_{t} z)$
假设 $x_{t - 1}$ 已经是一个近似方差为 $I$ 的分布，且 $x_{t - 1}$ 与新加入的噪声 $z$ 是相互独立的，根据方差的性质：
$V a r (x_{t}) = (1 - β_{t})^{2} V a r (x_{t - 1}) + (β_{t})^{2} V a r (z) = (1 - β_{t}) \cdot I + β_{t} \cdot I = I$
这是最终 $x_{t}$ 能够顺利的收敛到 $N (0, I)$ 的必要条件

在实际使用时，DDPM 的作者让 $β_{t}$ 取值为 $[0.001, 0.02]$ ，总步数 $T = 1000$ ，使用线性取值，如下图：

为了解决“获取 $x_{t}$ 就需要前向 $t$ 次的问题”，可以通过数学推导来求解，即求解 $q (x_{t} ∣ x_{0})$ ：

x_{1} x_{2} = 1 - β_{0} x_{0} + β_{0} ϵ_{0} = 1 - β_{1} x_{1} + β_{1} ϵ_{1} = (1 - β_{0}) (1 - β_{1}) x_{0} + (β_{0} (1 - β_{1}) ϵ_{0} + β_{1} ϵ_{1})

这里 $ϵ_{0}, ϵ_{1} \sim N (0, I)$ 且相互独立，则有

ϵ_{1}^{'} = β_{0} (1 - β_{1}) ϵ_{0} + β_{1} ϵ_{1} \sim N (0, (β_{0} (1 - β_{1}) + β_{1}) I)

记 $α_{t} = 1 - β_{t}$ ，则

β_{0} (1 - β_{1}) + β_{1} (1 - β_{0}) (1 - β_{1}) = (1 - α_{0}) α_{1} + (1 - α_{1}) = 1 - α_{0} α_{1} = α_{0} α_{1}

再记 $\overset{α}{ˉ}_{t} = \prod_{i = 1}^{t} α_{i}$ 进而

x_{2} = \overset{α}{ˉ}_{1} x_{0} + 1 - \overset{α}{ˉ}_{1} ϵ_{1}^{'}

同理可以得到

x_{t} ⟹ = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ ϵ \sim N (0, I) q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)

如此便可以一步得到 $x_{t}$ ，而无需反复进行前向过程

Backward 过程

(图片中 $q (x_{t} ∣ x_{0})$ 有误)

In 1949, W. Feller showed that, for gaussian (and binomial) distributions, the diffusion process’s reversal has the same functional form as the forward process. 这意味着可以假设后向过程 $(x_{t}, t) \to x_{t - 1}$ 也遵循高斯分布

我们需要求得一个模型 $θ$ ，即 $p_{θ} (x_{t - 1} ∣ x_{t})$ ，由上面这段话，可以假设

p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))

Loss Function

设计 loss 的 insight 为“最大化反向过程中生成的 $x_{0}$ 在原始数据分布中的 log-likelihood”，即

L = E_{q (x_{0})} [- lo g p_{θ} (x_{0})]

这一点和 VAE 是相同的，可以仿照 VAE 引入变分推断的方法，即引入

D_{KL} (q (x_{1 : T} ∣ x_{0}) ∣ p_{θ} (x_{1 : T} ∣ x_{0}))

NOTE

回忆 VAE 中推导 loss 时，引入了 $D_{KL} (q (z ∣ x) ∣ p_{θ} (z ∣ x))$ 。在 diffusion model 中， $x_{1; T}$ 可以视为 VAE 中的隐变量 $z$

0 ⟹ - lo g p_{θ} (x_{0}) ⟹ L \leq D_{KL} (q (x_{1 : T} ∣ x_{0}) ∣ p_{θ} (x_{1 : T} ∣ x_{0})) = E_{q (x_{1 : T} ∣ x_{0})} [lo g \frac{q ( x _{1 : T} ∣ x _{0} )}{p _{θ} ( x _{1 : T} ∣ x _{0} )}] = E_{q (x_{1 : T} ∣ x_{0})} [lo g \frac{q ( x _{1 : T} ∣ x _{0} )}{p _{θ} ( x _{0 : T} )} + lo g p_{θ} (x_{0})] = E_{q (x_{1 : T} ∣ x_{0})} [lo g \frac{q ( x _{1 : T} ∣ x _{0} )}{p _{θ} ( x _{0 : T} )}] + lo g p_{θ} (x_{0}) \leq E_{q (x_{1 : T} ∣ x_{0})} [lo g \frac{q ( x _{1 : T} ∣ x _{0} )}{p _{θ} ( x _{0 : T} )}] \leq E_{q (x_{0})} [E_{q (x_{1 : T} ∣ x_{0})} [lo g \frac{q ( x _{1 : T} ∣ x _{0} )}{p _{θ} ( x _{0 : T} )}]] = E_{q (x_{0 : T})} [lo g \frac{q ( x _{1 : T} ∣ x _{0} )}{p _{θ} ( x _{0 : T} )}]

而

可以将 loss 总结为：

看上去较为复杂，原作者进行了如下化简：

抛弃 $L_{0}$ ，在实验中获得了更好的结果
抛弃 $L_{T}$ ，因为该项不涉及神经网络参数

于是便只需要考虑 $L_{t - 1}$

其中

q (x_{t - 1} ∣ x_{t}, x_{0}) where \tilde{μ}_{t} (x_{t}, x_{0}) = N (x_{t - 1}; \tilde{μ}_{t} (x_{t}, x_{0}), \tilde{β}_{t} I), := \frac{α ˉ _{t - 1} β _{t}}{1 - α ˉ _{t}} x_{0} + \frac{α _{t} ( 1 - α ˉ _{t - 1} )}{1 - α ˉ _{t}} x_{t} and \tilde{β}_{t} := \frac{1 - α ˉ _{t - 1}}{1 - α ˉ _{t}} β_{t}

在实验中，作者发现固定 $p_{θ}$ 中的方差，对实验结果没有很大影响，即

p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ^{2} I)

进而可以推导得到

L_{t - 1} = E_{q} [\frac{1}{2 σ _{t}^{2}} ∥ \tilde{μ}_{t} (x_{t}, x_{0}) - μ_{θ} (x_{t}, t) ∥^{2}] + C

通过将 $x_{0}$ 用 $x_{t}$ 和 $ϵ$ 表达，可以将上面的式子转化为每步对噪声 $ϵ$ 进行优化（这里去掉了范数前的系数）：

E_{t, x_{0} \sim q, ϵ \sim N} [∥ ϵ - ϵ_{θ} (\overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ, t) ∥^{2}]

也即

E_{t, x_{0}, ϵ} [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}]

description	—
tags	diffusion-model, generative-model

Kinnari

探索

Diffusion Model 笔记

Generative Model

Diffusion Model

DDPM

Forward 过程

Backward 过程

Loss Function

参考资料

关系图谱

目录

最近的笔记

Fedora 环境搭建记录

Welcome

Learning Latent Dynamics for Planning from Pixels

反向链接