Skip to content

Commit 1cf0d07

Browse files
committed
Update blog
1 parent 36ffb9c commit 1cf0d07

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

content/blog/2024-09-04-1725463249.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ And that felt odd - a few hundred MBs being used on a 12 GB graphics card. Would
1414

1515
The summary is that, strangely enough, that optimization did not result in a real improvement. Sometimes it was barely any faster. So, IMO not worth the added complexity. Quantization probably has better ROI.
1616

17-
## Idea:
17+
## Idea
1818

1919
The way a diffusion pipeline usually works is - it first runs the `text encoder` module(s) once, and then runs the `vae` module once (for encoding), and then loops over the `unet`/`transformer` module several times (i.e. `inference steps`), and then finally runs the `vae` module once again (for decoding).
2020

@@ -28,7 +28,7 @@ For deciding which modules to "pin" to the GPU, I tried both orders - sorting by
2828
Neither approach seemed to change the result.
2929

3030

31-
## Results:
31+
## Results
3232

3333
Unfortunately, the performance gain is non-existent, to very marginal. I ran each test twice, to ensure that the OS would have the page files warmed up equally.
3434

@@ -41,7 +41,7 @@ In other runs with 4 steps, the optimization was sometimes faster by 5-10 second
4141
With increased steps (e.g. 10 steps), the optimization is usually better by 15-20 seconds (i.e. 160 seconds total vs 180 seconds).
4242

4343

44-
## Possible Explanation (of why it didn't work):
44+
## Possible Explanation (of why it didn't work)
4545

4646
OS Paging or Driver caching or PyTorch caching. The first loop iteration would obviously be very slow, since it would read everything (including the "pinned" modules) from the CPU to the GPU.
4747

0 commit comments

Comments
 (0)