The perf mystery of opt-channelslast and benchmark=True. #8181

aifartist · 2023-02-28T00:41:43Z

aifartist
Feb 28, 2023

Executive summary:
With a fast GPU like a 4090 and a slower CPU they don't help or even hurt. With a 5.8 GHz i9-13000K, using both options, I get an 8.5% improvement. Having said that, there are GPU's where it seems to hurt no matter how fast the CPU is. Given the GPU differences and which CPU is used we see a variety of confusing results in using these options.

I have data for runs on both a 5.8 GHz P-core or a 4.3 GHz E-core. The "BASE" is using neither of these options, OCL is opt-channels last, BENCH is torch.backends.cudnn.benchmark=True and xformers is used in all cases. The test run is a batchcount of 11. I throw out the first run, and the low and high it/s and average the remaining 8 generations. The model is v1-5-pruned-emaonly.ckpt. Percentage perf diff is relative to the base number.

                 5.8GHz            4.3GHz
BASE         36.68  -----      29.92  -----
OCL          37.15  +1.3%      28.68  -4.1%
BENCH        38.73  +5.6%      30.12  +0.7%
OCL+BENCH    39.80  +8.5%      29.10  -2.7%

At 5.8 GHz there is still enough capacity left on the CPU to push the GPU faster if faster options are used. This is not true on slower CPU's. Also some GPU's have architectural differences where these might not help. Even 5.8 GHz is not fast enough to push the GPU to 100% if I use torch.compile on the model. There I get about 45 it/s but no longer can keep the GPU at 100% busy. What that means is that 'compile' could even be faster if pushed harder by the CPU.
@vladmandic

vladmandic · 2023-02-28T00:51:41Z

vladmandic
Feb 28, 2023
Collaborator

Yup, I pretty much agree with all the notes.
I wish there was a deterministic way to know if cuddn benchmark or channels-last helps on a given platform or not, as it is, it's up to each individual user to test and determime.

2 replies

aifartist Feb 28, 2023
Author

Yup, I pretty much agree with all the notes. I wish there was a deterministic way to know if cuddn benchmark or channels-last helps on a given platform or not, as it is, it's up to each individual user to test and determime.

Maybe a calibrate option could be written. Internally toggle the settings on a small number of built in runs and set these options based on the results. Of course, it just occurred to me that once you do larger images or a larger batch size even my setup won't see the benefit of using these. The only question is whether they would hurt perf on my 4090 with a 768x768 image or batchsize=2 or 3. Is a possible 8.5% worth having a calibrate option? Of course the calibrate option could also be the "benchmark" option. Instead of asking people to run a specific test to compare against some standard we just bake all the parameters into some benchmark extension.

vladmandic Feb 28, 2023
Collaborator

I already have a benchmark extension - see "system info".
problem with running any meaninful 'calibration' would require multiple restarts of the webui server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The perf mystery of opt-channelslast and benchmark=True. #8181

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

The perf mystery of opt-channelslast and benchmark=True. #8181

Uh oh!

Uh oh!

aifartist Feb 28, 2023

Replies: 1 comment · 2 replies

Uh oh!

vladmandic Feb 28, 2023 Collaborator

Uh oh!

aifartist Feb 28, 2023 Author

Uh oh!

Uh oh!

vladmandic Feb 28, 2023 Collaborator

aifartist
Feb 28, 2023

Replies: 1 comment 2 replies

vladmandic
Feb 28, 2023
Collaborator

aifartist Feb 28, 2023
Author

vladmandic Feb 28, 2023
Collaborator