12.5 percent improvement in image generation throughput. #6621

aifartist · 2023-01-10T23:28:07Z

aifartist
Jan 10, 2023

Sometimes you want to generate 100 images to pick and find the best. I've tested two changes to improve throughput. One fixes #5409 and the other implements an idea of mine. I'm posted some research on memory usage 2 days ago. #5409 is the fact that performance drops with the commit 67efee3
It you have enough memory there is no reason not to use the faster version of decode_first_stage(). So I add an option "--hivram" when you don't care about usage(within reason) and want performance.
The second change moves the post GPU image processing, like image save, onto a second thread so the main thread can move to the next batch without waiting. I put this under a "--pllcpugpu" option.
With a batch size of 16 and 6 batches I got a 12.5 percent improvement in image generation throughput.

Given that I have never done a github contribution before I need to know what I need to do as the first couple of steps.
?Create a feature request to track the work? Should it be a bug?
I know how to clone, create a branch, make changes, commit but don't know the github stuff.

BASELINE
Generated 96 (6 X 16) images in 57.048285 seconds
Time per image 0.59425296875 seconds

HIVRAM
Generated 96 (6 X 16) images in 53.527927 seconds
Time per image 0.5575825729166667 seconds

HIVRAM + PLLCPUGPU
Generated 96 (6 X 16) images in 49.773048 seconds
Time per image 0.51846925 seconds

AlUlkesh · 2023-01-10T23:37:48Z

AlUlkesh
Jan 10, 2023

In general, first fork this repository:

Then you can make the changes in your fork and when you're done submit a pull request from there.

There's some more good info here:
https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Contributing

0 replies

user2734283 · 2023-01-11T00:30:02Z

user2734283
Jan 11, 2023

Ive seen other posts from you on reddit and I think you are a genius. I hope A1111 takes some advice from you.

You said you didnt want to make an integration of voltaML into A1111 as to not duplicate efforts - here we are weeks later and nothing has been done, there were only performance degrading updates to the A1111 repo. You made the impression that you are able to do it - it seems like the voltaML engineers dont care about integrating it into other UIs anymore so, once you figured out how to do github contributions, would you be able to? The voltaML license would allow it, but only with mentioning the creators

Im just asking because progress on SD seems to be coming to a crawl since SD 2.1 and some fresh wind would be very much appreciated by all of us

eagerly awaiting the distilled diffusion models and hoping they make one for 1.5 too

11 replies

aliencaocao Jan 12, 2023

That should not be possible, if not nvidia would be doing it already. The reason being each time a model is executed, it will compile into a single CUDA graph internally. Such graphs cannot be separated without introducing significant overhead.

vladmandic Jan 12, 2023
Collaborator

but even if a monolithic graph cannot be seprated, there are quite a few models within automatic1111 repo being used in end-to-end workflow, so there is a big opportunity for early-completion and hand-off to second model that executes differently.

btw, tensorrt prototype at https://www.photoroom.com/tech/stable-diffusion-25-percent-faster-and-save-seconds/ looks promising - and that was over 3 months ago.

aliencaocao Jan 13, 2023

Yea, i actually posted about this tensorrt in discussions very long ago. There is a repo that already has SD + tensor rt implemented

aliencaocao Jan 13, 2023

#4161

aifartist Jan 13, 2023
Author

nvidia would be doing it already

I'm not even an amateur on NVidia hardware but having worked at IBM, Sybase, Oracle, Salesforce, Amazon, and Microsoft, to name a few, I know that more complex ideas take resources and executives willing to prioritize them. For those that understand virtualization Intel just didn't get there over night. It worked from the beginning but the overhead of trapping cpu instructions to the hypervisor(?) caused overhead for many workloads. People came up with ideas and more ideas and we now have near baremetal performance when running in a VM.

Think outside the box. I can talk about cache coherency protocols, NUMA, and other deep Intel subjects but GPU's I can barely spell. I'll get there. Just give me time.

12.5 percent improvement in image generation throughput. #6621

Uh oh!

Uh oh!

aifartist Jan 10, 2023

Replies: 2 comments · 11 replies

Uh oh!

AlUlkesh Jan 10, 2023

Uh oh!

Uh oh!

user2734283 Jan 11, 2023

Uh oh!

aliencaocao Jan 12, 2023

Uh oh!

vladmandic Jan 12, 2023 Collaborator

Uh oh!

aliencaocao Jan 13, 2023

Uh oh!

aliencaocao Jan 13, 2023

Uh oh!

Uh oh!

aifartist Jan 13, 2023 Author

aifartist
Jan 10, 2023

Replies: 2 comments 11 replies

AlUlkesh
Jan 10, 2023

user2734283
Jan 11, 2023

vladmandic Jan 12, 2023
Collaborator

aifartist Jan 13, 2023
Author