Woo Hoo! .496 seconds per 512x512 image 20 steps #6405
Replies: 6 comments 3 replies
-
Amazing, I dream of getting these optimizations up and running, but just haven't had success with the necessary steps. Looking forward to seeing this go more mainstream. |
Beta Was this translation helpful? Give feedback.
-
Have you tried the wheel here? #5962 |
Beta Was this translation helpful? Give feedback.
-
You might be right about CUDA. Before I got the compile to work I tried
downloading the Tensor2/CUDA-11.7 nightly build and seem to recall that I
saw a good perf boost.
Regarding CUDA-12. NVidia's announcement for this claims "Support for new
NVIDIA Hopper and NVIDIA Ada Lovelace" My 4090 is Lovelace. There are
other optimizations in CUDA-12 however they might have to be "used" to gain
the advantage.
I'll be doing more testing today to clarify things.
…On Fri, Jan 6, 2023 at 3:07 AM Billy Cao ***@***.***> wrote:
Have you tried the wheel here? #5962
<#5962>
I am interested to see if cuda 12 brings any improvements after all, as I
am seeing big speed ups from just building xformers with torch2.0 (and some
even reported a double speed boost)
—
Reply to this email directly, view it on GitHub
<#6405 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3YFZAHUHBYTQHT5IGAJCX3WQ74IRANCNFSM6AAAAAATST6CVU>
.
You are receiving this because you authored the thread.Message ID:
<AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6405/comments/4610866
@github.com>
|
Beta Was this translation helpful? Give feedback.
-
As Billy has mentioned CUDA 12 might not help. Most of the gain is from
pytorch 2.0.
Plus the other changes I did to processing.py.
Give me a bit more time to flesh out the details.
…On Fri, Jan 6, 2023 at 4:31 AM hippopotamus1000 ***@***.***> wrote:
Impressive speed. Could you write a short tutorial, how you build pytorch
2.0/xformers with CUDA 12?
—
Reply to this email directly, view it on GitHub
<#6405 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3YFZACTLVCXG7IZOWB6WFLWRAGB7ANCNFSM6AAAAAATST6CVU>
.
You are receiving this because you authored the thread.Message ID:
<AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6405/comments/4611662
@github.com>
|
Beta Was this translation helpful? Give feedback.
-
Waiting... |
Beta Was this translation helpful? Give feedback.
-
Started discussion #6932 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I managed to build pytorch 2.0 with CUDA 12 and then build xformers with this. I also reverted the decode_first_stage change which causes a perf regression. With this I got .538 seconds per image. Then I added my change to overlap the remaining CPU processing of images with the GPU processing for the next batch. This got me to .496! I finally broke 1/2 second.
Ubuntu, a 4090, Euler_a, 20 steps, v2-1_512-ema-pruned, 64 images with batch count 4 and batch size 16. Batch size 16 seems optimal.
With 768x768 images and the matching v2.1 ckpt file I average 1.296 seconds.
Beta Was this translation helpful? Give feedback.
All reactions