|
290 | 290 | " * **DownBlock**: Downsampling using stride-2 convolution.\n", |
291 | 291 | " * **UpBlock**: Upsampling using bilinear interpolation followed by convolution.\n", |
292 | 292 | "\n", |
293 | | - "3. **U-Net in `02.vae-without-encoder.ipynb`**:\n", |
| 293 | + "3. **U-Net in [`02.vae-without-encoder.ipynb`](./02.vae-without-encoder.ipynb)**:\n", |
294 | 294 | " * The `Unet` class implements this straightforward architecture.\n", |
295 | 295 | " * Crucially, for this single-step model, the U-Net **does not use time embeddings**. The corruption is fixed (a single $\\alpha$ value), so the network doesn't need to adapt to different noise levels. Its task is to denoise $x_1$ which always has similar noise characteristics defined by the chosen $\\alpha$.\n", |
296 | 296 | "\n", |
|
4140 | 4140 | }, |
4141 | 4141 | { |
4142 | 4142 | "cell_type": "markdown", |
4143 | | - "id": "1dbe51d7", |
| 4143 | + "id": "28e0614a", |
4144 | 4144 | "metadata": {}, |
4145 | 4145 | "source": [ |
4146 | | - "As you can see, it kinda learns how to generate the digits, but not really. One step diffusion is not easy. Even with a better neural network (U-Net vs. Conv) that is often used in diffusion, we are not getting better results.\n", |
| 4146 | + "As you can see, the model learns *some* structure, but struggles to generate realistic digits — even when using a strong architecture like U-Net.\n", |
4147 | 4147 | "\n", |
4148 | | - "That's why multi step diffusion models are more common. Also the problem is that corrupting $x_{0}$ with noise to make $x_{1}$, won't make it a normal Gaussian, unless $\\alpha$ is 0. But if it's set to 0, then the decoder can't learn anything since there is no signal in $x_{1}$, but only noise." |
| 4148 | + "This highlights a fundamental challenge: **one-step denoising is hard**.\n", |
| 4149 | + "\n", |
| 4150 | + "In multi-step diffusion models like DDPM, the model solves a **sequence of easier sub-problems** — gradually denoising from a high-noise image to a clean one. But in this one-step setup, the model has to learn to **jump all the way from noise to signal in a single step**.\n", |
| 4151 | + "\n", |
| 4152 | + "Another challenge is that, for the corrupted input $x_1 = \\sqrt{\\alpha} x_0 + \\sqrt{1 - \\alpha} \\epsilon$, the distribution of $x_1$ only resembles a standard Gaussian $\\mathcal{N}(0, I)$ **when $\\alpha \\to 0$**. But when $\\alpha$ is near 0, the model sees **almost no signal** from $x_0$ — it’s all noise. On the other hand, if $\\alpha$ is too high, the latent $x_1$ carries more signal but **deviates from the Gaussian prior**, which can hurt generation quality.\n", |
| 4153 | + "\n", |
| 4154 | + "This tension makes one-step models hard to train and sample from. **Multi-step diffusion models strike a better balance**: they allow the model to progressively refine the sample, without needing to generate clean images from scratch in one step.\n" |
4149 | 4155 | ] |
4150 | 4156 | } |
4151 | 4157 | ], |
|
0 commit comments