Skip to content

Commit b05b889

Browse files
committed
Update README.md to enhance clarity and detail on VAE and diffusion models
- Improved formatting for better readability in the implementation sections. - Added a comparison of classical CNN and U-Net architectures. - Clarified the beta schedule description in the diffusion model section. - Expanded the "Next" section to include more specific future directions for research.
1 parent 5b1434a commit b05b889

File tree

3 files changed

+1087
-989
lines changed

3 files changed

+1087
-989
lines changed

02.vae-without-encoder.ipynb

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,7 @@
290290
" * **DownBlock**: Downsampling using stride-2 convolution.\n",
291291
" * **UpBlock**: Upsampling using bilinear interpolation followed by convolution.\n",
292292
"\n",
293-
"3. **U-Net in `02.vae-without-encoder.ipynb`**:\n",
293+
"3. **U-Net in [`02.vae-without-encoder.ipynb`](./02.vae-without-encoder.ipynb)**:\n",
294294
" * The `Unet` class implements this straightforward architecture.\n",
295295
" * Crucially, for this single-step model, the U-Net **does not use time embeddings**. The corruption is fixed (a single $\\alpha$ value), so the network doesn't need to adapt to different noise levels. Its task is to denoise $x_1$ which always has similar noise characteristics defined by the chosen $\\alpha$.\n",
296296
"\n",
@@ -4140,12 +4140,18 @@
41404140
},
41414141
{
41424142
"cell_type": "markdown",
4143-
"id": "1dbe51d7",
4143+
"id": "28e0614a",
41444144
"metadata": {},
41454145
"source": [
4146-
"As you can see, it kinda learns how to generate the digits, but not really. One step diffusion is not easy. Even with a better neural network (U-Net vs. Conv) that is often used in diffusion, we are not getting better results.\n",
4146+
"As you can see, the model learns *some* structure, but struggles to generate realistic digits — even when using a strong architecture like U-Net.\n",
41474147
"\n",
4148-
"That's why multi step diffusion models are more common. Also the problem is that corrupting $x_{0}$ with noise to make $x_{1}$, won't make it a normal Gaussian, unless $\\alpha$ is 0. But if it's set to 0, then the decoder can't learn anything since there is no signal in $x_{1}$, but only noise."
4148+
"This highlights a fundamental challenge: **one-step denoising is hard**.\n",
4149+
"\n",
4150+
"In multi-step diffusion models like DDPM, the model solves a **sequence of easier sub-problems** — gradually denoising from a high-noise image to a clean one. But in this one-step setup, the model has to learn to **jump all the way from noise to signal in a single step**.\n",
4151+
"\n",
4152+
"Another challenge is that, for the corrupted input $x_1 = \\sqrt{\\alpha} x_0 + \\sqrt{1 - \\alpha} \\epsilon$, the distribution of $x_1$ only resembles a standard Gaussian $\\mathcal{N}(0, I)$ **when $\\alpha \\to 0$**. But when $\\alpha$ is near 0, the model sees **almost no signal** from $x_0$ — it’s all noise. On the other hand, if $\\alpha$ is too high, the latent $x_1$ carries more signal but **deviates from the Gaussian prior**, which can hurt generation quality.\n",
4153+
"\n",
4154+
"This tension makes one-step models hard to train and sample from. **Multi-step diffusion models strike a better balance**: they allow the model to progressively refine the sample, without needing to generate clean images from scratch in one step.\n"
41494155
]
41504156
}
41514157
],

0 commit comments

Comments
 (0)