Replies: 1 comment 2 replies
-
Yup, I pretty much agree with all the notes. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Executive summary:
With a fast GPU like a 4090 and a slower CPU they don't help or even hurt. With a 5.8 GHz i9-13000K, using both options, I get an 8.5% improvement. Having said that, there are GPU's where it seems to hurt no matter how fast the CPU is. Given the GPU differences and which CPU is used we see a variety of confusing results in using these options.
I have data for runs on both a 5.8 GHz P-core or a 4.3 GHz E-core. The "BASE" is using neither of these options, OCL is opt-channels last, BENCH is torch.backends.cudnn.benchmark=True and xformers is used in all cases. The test run is a batchcount of 11. I throw out the first run, and the low and high it/s and average the remaining 8 generations. The model is v1-5-pruned-emaonly.ckpt. Percentage perf diff is relative to the base number.
At 5.8 GHz there is still enough capacity left on the CPU to push the GPU faster if faster options are used. This is not true on slower CPU's. Also some GPU's have architectural differences where these might not help. Even 5.8 GHz is not fast enough to push the GPU to 100% if I use torch.compile on the model. There I get about 45 it/s but no longer can keep the GPU at 100% busy. What that means is that 'compile' could even be faster if pushed harder by the CPU.
@vladmandic
Beta Was this translation helpful? Give feedback.
All reactions