Replies: 3 comments 2 replies
-
Conclusion and solution? |
Beta Was this translation helpful? Give feedback.
-
Still researching. Posted something about this to reddit r/pytorch. Look like others are wanting the same thing: https://discuss.pytorch.org/t/choose-a-different-conv-algorithm/27518 |
Beta Was this translation helpful? Give feedback.
-
torch.cuda.set_per_process_memory_fraction() can control how much is used. I found this in the torch code for selecting an algorithm. I tried it and it reduced the max used for a batch size of 18. But because I had set a hard limit I couldn't use batch size like 60 without OOM'ing which might need more memory when using a memory efficient algorithm. Thus it might be inflexible. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Bug 5409 complains about the performance regression due to commit https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/67efee33a6c65e58b3f6c788993d0e68a33e4fd0
I've conducted a detail max tensor memory used analysis at different batch sizes for both the original code and the new code(which is slower). First the current state with the fix. I show the batch size, the max memory used for p.sample and decode_first_stage and the per image generation time. After the '=" I also show the times for the sample, the decode_first_stage, and misc. time. Again they are "per image times". The change that was made affected decode_first_stage().
As can be seen the memory use for the decode stays low. But it is slower than without the fix. NOTE the anomaly with sample() memory use at batch size 18 and higher. The ?same? anomaly occurs for the decode without the fix at a lower batch size which is why the fix was done.
As can be seen at batch size 4 the decode memory use skyrockets.
This is why the change was done although I'm unclear whether it was understood they could have done divided up the work into 3 images at a time and not had a problem. Of course, without testing 768x768 models and some other models some more investigation is needed to do it right.
I did debug down to ?root cause? the huge increase in memory usage for p.sample(). It occurs in conv.py:_conv_forward- F.conv2d().
I captured the max_allocate_memory before and after this call and there was a huge jump for the larger batch sizes. So far I've found:
https://discuss.pytorch.org/t/memory-usage-suddenly-increase-with-specific-input-shape-on-torch-nn-conv2d/99681
A 6X jump in memory usage is hardly justified for just one more image in the batch. I tried both torch.backends.cudnn.benchmark and torch.backends.cudnn.deterministic without luck. But I don't know what I'm doing so there may be more to restricting cudnn from switching algorithms to do the conv2d because it thinks it will be more efficient. OOM'ing is never efficient! :-)
Beta Was this translation helpful? Give feedback.
All reactions