Enabling cuDNN benchmarking for small (~1.25x) speed improvement #3353

emoose · 2022-10-21T17:15:02Z

emoose
Oct 21, 2022

(posted this as an issue originally but probably would be better as discussion, hopefully might help more people see it)

With my GTX 1080 it seems enabling cuDNN benchmarking reliably gives a small speed boost, training goes from ~1.27s/it to ~1.05s/it (saving about 2 hours on the training estimate), while txt2img went from 5.30s/it to 5.07s/it.

From others that tested it I've been hearing good things as well, across pretty much all NV GPUs, seems it can give around 10-25% improvement.

To enable it I just edited modules/sd_models.py, and underneath def setup_model(): added:

    torch.backends.cudnn.benchmark = True

Like so:

One issue with this is that the first txt2img run seems to take a little while to start up, and then also stays stuck at 100% for a little longer too, nullifying the speed boost from the reduced s/it...
That only seems to happen on the first txt2img run though, any runs after that don't seem to have that issue (and "time taken" becomes ~30 seconds faster than without cuDNN)
I guess this is because of cuDNN benchmarking each new action being made for the first time, but not sure.

Would be happy to hear how it works for others too - though make sure to ignore the first run/generation if you do try it out (since it's using that to benchmark/test different algos for your HW), the runs after that should then hopefully be an improvement over the original webui.
(I think changing image size & other params might also cause it to rebench as well, but not sure what exactly can cause it...)

Wonder if there's some way to work around the first-run-benchmarking slowdown, maybe some way to call the img2txt pytorch operations during startup so cuDNN could benchmark them (or just calling txt2img directly during startup with a single iteration?)

E: you can get it to run txt2img during startup by editing modules/ui.py, and underneath sd_hijack.model_hijack.embedding_db.load_textual_inversion_embeddings() (line ~1245) add:

    # Run a single img2txt iteration with typical settings, allows cuDNN to benchmark and settle on HW-specific algos during startup
    # (otherwise this would happen during the users first img2txt run, slowing it down a lot compared to later runs)
    if torch.backends.cudnn.benchmark:
        print("Running cuDNN benchmark to let it settle on best algorithms...");
        modules.txt2img.txt2img("", "", "None", "None", 1, 1, False, False, 1, 1, 12, -1, -1, 0, 0, 0, False, 512, 512, False, 0, 0, 0, 0)

This cuts down the first-run time to be a little faster than vanillas first run time, though for me second run onwards still seems to have a slight speed boost (but looks like that happens even with cuDNN disabled...)

ghost · 2022-10-21T19:02:52Z

ghost
Oct 21, 2022

Indeed it works! Enabled it, got from ~6it/s to 6.4 it/s on 512x640 generation. --xformers turned on.

Interesting thing is, i didn't get those pauses like you described, or they were too small to notice.
RTX 3070Ti.

1 reply

cibernicola Oct 21, 2022

Tested here too

RTX 2060 6GB vRAM: I think I perceive some improvement (not sure, I don't have enough data to compare). --xformers turned on.

and same "Interesting thing is, i didn't get those pauses like you described, or they were too small to notice."

Interesting: 

Q-What does torch.backends.cudnn.benchmark do?
A-This flag allows you to enable the inbuilt cudnn auto-tuner to find the best algorithm to use for your hardware

https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936

More info: https://pytorch.org/tutorials/recipes/recipes/benchmark.html

At this point, my question is: what is the best place to initialize this?

4lt3r3go · 2022-10-23T14:54:33Z

4lt3r3go
Oct 23, 2022

i'm impressed how this passed unobserved.
working here too!
unfortunatly each time git pull need to change something on those files i have to rewrite everything
but is ok

0 replies

cmp-nct · 2022-10-25T10:07:34Z

cmp-nct
Oct 25, 2022

3090: first run was slower by 450%, subsequent runs took the same time as before

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enabling cuDNN benchmarking for small (~1.25x) speed improvement #3353

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Enabling cuDNN benchmarking for small (~1.25x) speed improvement #3353

Uh oh!

Uh oh!

emoose Oct 21, 2022

Replies: 3 comments · 1 reply

Uh oh!

Uh oh!

ghost Oct 21, 2022

Uh oh!

Uh oh!

cibernicola Oct 21, 2022

Uh oh!

4lt3r3go Oct 23, 2022

Uh oh!

cmp-nct Oct 25, 2022

emoose
Oct 21, 2022

Replies: 3 comments 1 reply

ghost
Oct 21, 2022

4lt3r3go
Oct 23, 2022

cmp-nct
Oct 25, 2022