Perf Improvment by Splitting GPU/CPU processing into 2 threads #5602

aifartist · 2022-12-10T20:11:31Z

aifartist
Dec 10, 2022

On a 4090 there's a noticeable delay between each image generation.
Instrumentation shows this delay is from just after:
samples_ddim = p.sample(conditioning=c, ...
x_samples_ddim = [decode_first_stage(p.sd_model, ...
in processing.py where I believe it is the case that x_samples_ddim is now back on the cpu for the remaining steps, which includes the save_image, until we are done and can start the next image generation.

I see perhaps a 7.5% improvement and that is with a fast image save on a Samsung 990 Pro. Furthermore, as inference improves(testing voltaML fast SD now), this ratio will only increase.

I propose having a second thread take the results from the GPU and do all the post processing there allowing the main thread to continue with the next batch. Obviously I'd need to be careful with synchronization. If I did this I'd probably improve the time reporting to include milliseconds given the 4090 and even better hardware in the future.

BTW, how the heck do you turn of the darn tqdm stuff to the console? The closer things get to 1 second per image the importance of watching the console show the progress becomes less. The GUI progress bar is ok. On the console I just want to see:
image 1: .879
image 2: .834
...
Ave time per image .851

ice051128 · 2022-12-11T06:10:56Z

ice051128
Dec 11, 2022

Known issue even for 3080 10GB, has sth to do with commit 67efee3
#5409

1 reply

aifartist Dec 11, 2022
Author

#5409 is about a degradation with the post sample()'ing processing.
My idea is to HIDE this processing overhead, whether it has a degradation or not, by doing it in parallel with the GPU processing.
FYI, yesterday I learned PY threads and queues and successful got this to work. It does seem to actually get me a 7+% perf improvement. Also, there could be another 15? percent improvement if I could do the decode_first_stage on the cpu. I tried this without success due to fp16 and "everything needs to be on the same devices" issues. Even if I work around those I get:
otImplementedError(f"No operator found for this attention: {self}")
NotImplementedError: No operator found for this attention: AttentionOpDispatch(dtype=torch.float32, device=device(type='cpu'), k=512, has_dropout=False, attn_bias_type=<class 'NoneType'>, kv_len=4096, q_len=4096, kv=512, batch_size=1, num_heads=1, has_custom_scale=False, requires_grad=False)

Today I'll try to figure this out. If I can reach over a 20% speed up I'll get excited.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf Improvment by Splitting GPU/CPU processing into 2 threads #5602

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Perf Improvment by Splitting GPU/CPU processing into 2 threads #5602

Uh oh!

Uh oh!

aifartist Dec 10, 2022

Replies: 1 comment · 1 reply

Uh oh!

ice051128 Dec 11, 2022

Uh oh!

aifartist Dec 11, 2022 Author

aifartist
Dec 10, 2022

Replies: 1 comment 1 reply

ice051128
Dec 11, 2022

aifartist Dec 11, 2022
Author