Woo Hoo! .496 seconds per 512x512 image 20 steps #6405

aifartist · 2023-01-06T02:52:35Z

aifartist
Jan 6, 2023

I managed to build pytorch 2.0 with CUDA 12 and then build xformers with this. I also reverted the decode_first_stage change which causes a perf regression. With this I got .538 seconds per image. Then I added my change to overlap the remaining CPU processing of images with the GPU processing for the next batch. This got me to .496! I finally broke 1/2 second.

Ubuntu, a 4090, Euler_a, 20 steps, v2-1_512-ema-pruned, 64 images with batch count 4 and batch size 16. Batch size 16 seems optimal.
With 768x768 images and the matching v2.1 ckpt file I average 1.296 seconds.

Generated 64 images in 31.747612 seconds
Time per image 0.4960564375 seconds
Total progress: 100%|███████████████████████████| 80/80 [00:31<00:00,  2.55it/s]

DustyCooper · 2023-01-06T04:27:16Z

DustyCooper
Jan 6, 2023

Amazing, I dream of getting these optimizations up and running, but just haven't had success with the necessary steps.

Looking forward to seeing this go more mainstream.

2 replies

aifartist Jan 6, 2023
Author

It took me a day to sort out how to build pytorch 2.0. One problem is that torchvision only wanted a torch build against CUDA 11.7 and I was trying to use CUDA 12. So I also had to rebuild torchvision also.
Currently the network pytorch 2.0 prerelease builds are based on the older cuda but I wanted to test the maximum performance so I had to build my own to get the newest CUDA.

Have you tried either CUDA 12 or pytorch 2.0? If you have Ubuntu and would actually give it a try I can provide details of my build/install process. But even on a 5.8 GHz i9 processor with parallel build ninja in use it takes a very long time to build. Even worse than xformers. Another thing to be careful of is to throttle the max build threads with ninja or you'll run out of memory.

hippopotamus1000 Jan 6, 2023

Impressive speed. Could you write a short tutorial, how you build pytorch 2.0/xformers with CUDA 12?

aliencaocao · 2023-01-06T11:07:42Z

aliencaocao
Jan 6, 2023

Have you tried the wheel here? #5962
I am interested to see if cuda 12 brings any improvements after all, as I am seeing big speed ups from just building xformers with torch2.0 (and some even reported a double speed boost)

0 replies

aifartist · 2023-01-06T18:47:47Z

aifartist
Jan 6, 2023
Author

You might be right about CUDA. Before I got the compile to work I tried downloading the Tensor2/CUDA-11.7 nightly build and seem to recall that I saw a good perf boost. Regarding CUDA-12. NVidia's announcement for this claims "Support for new NVIDIA Hopper and NVIDIA Ada Lovelace" My 4090 is Lovelace. There are other optimizations in CUDA-12 however they might have to be "used" to gain the advantage. I'll be doing more testing today to clarify things.

…

On Fri, Jan 6, 2023 at 3:07 AM Billy Cao ***@***.***> wrote: Have you tried the wheel here? #5962 <#5962> I am interested to see if cuda 12 brings any improvements after all, as I am seeing big speed ups from just building xformers with torch2.0 (and some even reported a double speed boost) — Reply to this email directly, view it on GitHub <#6405 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A3YFZAHUHBYTQHT5IGAJCX3WQ74IRANCNFSM6AAAAAATST6CVU> . You are receiving this because you authored the thread.Message ID: <AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6405/comments/4610866 @github.com>

0 replies

aifartist · 2023-01-06T18:52:43Z

aifartist
Jan 6, 2023
Author

As Billy has mentioned CUDA 12 might not help. Most of the gain is from pytorch 2.0. Plus the other changes I did to processing.py. Give me a bit more time to flesh out the details.

…

On Fri, Jan 6, 2023 at 4:31 AM hippopotamus1000 ***@***.***> wrote: Impressive speed. Could you write a short tutorial, how you build pytorch 2.0/xformers with CUDA 12? — Reply to this email directly, view it on GitHub <#6405 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A3YFZACTLVCXG7IZOWB6WFLWRAGB7ANCNFSM6AAAAAATST6CVU> . You are receiving this because you authored the thread.Message ID: <AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6405/comments/4611662 @github.com>

0 replies

kakaxixx · 2023-01-07T05:38:29Z

kakaxixx
Jan 7, 2023

Waiting...

1 reply

aifartist Jan 7, 2023
Author

Waiting??? I reported on a new top level post today that I got surprise fast runs in a newly created env before trying torch 2. After debugging I found that I was now using torch 1.13.1+cu117 as my baseline. Then when I tried the torch 2 from the official repro I saw NO improvement. I no longer consider that pytorch 2 is faster. I'm getting .6 second image generations without it. The .496 seconds I reported was with two OTHER changes I made which I will eventually present. Today was a waste due to sorting out this torch 2 issue.

vladmandic · 2023-01-19T14:31:41Z

vladmandic
Jan 19, 2023
Collaborator

Started discussion #6932

0 replies

Woo Hoo! .496 seconds per 512x512 image 20 steps #6405

Uh oh!

Uh oh!

aifartist Jan 6, 2023

Replies: 6 comments · 3 replies

Uh oh!

DustyCooper Jan 6, 2023

Uh oh!

aifartist Jan 6, 2023 Author

Uh oh!

hippopotamus1000 Jan 6, 2023

Uh oh!

aliencaocao Jan 6, 2023

Uh oh!

aifartist Jan 6, 2023 Author

Uh oh!

aifartist Jan 6, 2023 Author

Uh oh!

kakaxixx Jan 7, 2023

Uh oh!

aifartist Jan 7, 2023 Author

Uh oh!

vladmandic Jan 19, 2023 Collaborator

aifartist
Jan 6, 2023

Replies: 6 comments 3 replies

DustyCooper
Jan 6, 2023

aifartist Jan 6, 2023
Author

aliencaocao
Jan 6, 2023

aifartist
Jan 6, 2023
Author

aifartist
Jan 6, 2023
Author

kakaxixx
Jan 7, 2023

aifartist Jan 7, 2023
Author

vladmandic
Jan 19, 2023
Collaborator