Update blog

cmdr2 · cmdr2 · commit 1cf0d070ebf0 · 2025-08-19T15:31:09.000+05:30
diff --git a/content/blog/2024-09-04-1725463249.md b/content/blog/2024-09-04-1725463249.md
@@ -14,7 +14,7 @@ And that felt odd - a few hundred MBs being used on a 12 GB graphics card. Would
 
 The summary is that, strangely enough, that optimization did not result in a real improvement. Sometimes it was barely any faster. So, IMO not worth the added complexity. Quantization probably has better ROI.
 
-## Idea:
+## Idea
 
 The way a diffusion pipeline usually works is - it first runs the `text encoder` module(s) once, and then runs the `vae` module once (for encoding), and then loops over the `unet`/`transformer` module several times (i.e. `inference steps`), and then finally runs the `vae` module once again (for decoding).
 
@@ -28,7 +28,7 @@ For deciding which modules to "pin" to the GPU, I tried both orders - sorting by
 Neither approach seemed to change the result.
 
 
-## Results:
+## Results
 
 Unfortunately, the performance gain is non-existent, to very marginal. I ran each test twice, to ensure that the OS would have the page files warmed up equally.
 
@@ -41,7 +41,7 @@ In other runs with 4 steps, the optimization was sometimes faster by 5-10 second
 With increased steps (e.g. 10 steps), the optimization is usually better by 15-20 seconds (i.e. 160 seconds total vs 180 seconds).
 
 
-## Possible Explanation (of why it didn't work):
+## Possible Explanation (of why it didn't work)
 
 OS Paging or Driver caching or PyTorch caching. The first loop iteration would obviously be very slow, since it would read everything (including the "pinned" modules) from the CPU to the GPU.