-
Couldn't load subscription status.
- Fork 6.4k
Fix group offloading with block_level and use_stream=True #11375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you.
|
I did some testing and we get the following numbers: No record_stream=== System Memory Stats (Before encode prompt) ===
Total system memory: 1999.99 GB
Available system memory:1942.53 GB
=== CUDA Memory Stats Before encode prompt ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB
=== System Memory Stats (After encode prompt) ===
Total system memory: 1999.99 GB
Available system memory:1932.83 GB
=== CUDA Memory Stats After encode prompt ===
Current allocated: 15.05 GB
Max allocated: 15.05 GB
Current reserved: 15.29 GB
Max reserved: 15.29 GB
=== System Memory Stats (Before transformer.) ===
Total system memory: 1999.99 GB
Available system memory:1917.84 GB
=== CUDA Memory Stats Before transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 0.10 GB
Max reserved: 0.10 GB
=== System Memory Stats (After loading transformer.) ===
Total system memory: 1999.99 GB
Available system memory:1880.56 GB
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [03:30<00:00, 4.20s/it]
latents.shape=torch.Size([1, 16, 128, 128])
=== CUDA Memory Stats After inference with transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 5.68 GB
Max reserved: 5.68 GB
record_stream=== System Memory Stats (start) ===
Total system memory: 1999.99 GB
Available system memory:1941.94 GB
=== CUDA Memory Stats start ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB
=== System Memory Stats (Before encode prompt) ===
Total system memory: 1999.99 GB
Available system memory:1940.32 GB
=== CUDA Memory Stats Before encode prompt ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB
=== System Memory Stats (After encode prompt) ===
Total system memory: 1999.99 GB
Available system memory:1930.62 GB
=== CUDA Memory Stats After encode prompt ===
Current allocated: 15.05 GB
Max allocated: 15.05 GB
Current reserved: 15.29 GB
Max reserved: 15.29 GB
=== System Memory Stats (Before transformer.) ===
Total system memory: 1999.99 GB
Available system memory:1915.65 GB
=== CUDA Memory Stats Before transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 0.10 GB
Max reserved: 0.10 GB
=== System Memory Stats (After loading transformer.) ===
Total system memory: 1999.99 GB
Available system memory:1883.74 GB
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [03:14<00:00, 3.89s/it]
latents.shape=torch.Size([1, 16, 128, 128])
=== CUDA Memory Stats After inference with transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 4.30 GB
Max reserved: 4.30 GB
- 🤗 Diffusers version: 0.34.0.dev0
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.8.0.dev20250417+cu126 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.30.2
- Transformers version: 4.52.0.dev0
- Accelerate version: 1.4.0.dev0
- PEFT version: 0.15.2.dev0
- Bitsandbytes version: 0.45.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA H100 80GB HBM3, 81559 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test! Just two comments.
|
Failing tests seem unrelated |
| option only matters when using streamed CPU offloading (i.e. `use_stream=True`). This can be useful when | ||
| the CPU memory is a bottleneck but may counteract the benefits of using streams. | ||
| """ | ||
| if stream is not None and num_blocks_per_group != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially breaking no? What if there is existing code with num_blocks_per_group>1 and stream=True? If so, it might be better to raise a warning and set the num_blocks_per_group to 1 if stream is True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc: @a-r-r-o-w
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has been addressed in #11425
Fixes #11307
The previous implementation assumed that the layers were instantiated in order of invocation. This is not true for HiDream (caption projection layers are instantiated after transformer layers).
The new implementation makes sure to first capture invocation order and then apply group offloading. In the case of
use_stream=True, it does not really make sense to onload more than 1 block at a time, so we also now raise an error ifnum_blocks_per_group != 1whenuse_stream=TrueAnother possible fix is to simply move the initialization of the caption layers above the transformer blocks.
@sayakpaul @asomoza Could you verify if this fixes it for you?