[docs] Flux group offload #10847

stevhliu · 2025-02-20T16:53:58Z

From discussion in #10840, this PR adds an example of group offloading to the Flux docs as well as a note on memory requirements.

HuggingFaceDocBuilderDev · 2025-02-20T17:00:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

asomoza · 2025-02-20T17:25:28Z

docs/source/en/api/pipelines/flux.md

+## Optimize

-## Running FP16 inference
+Flux is a very large model and requires ~50GB of RAM. Enable some of the optimizations below to lower the memory requirements.


nice!, but the 50 GB of RAM are used when using group offloading not before, also @a-r-r-o-w was going to check if this is the real number or not, I get this but maybe there's something in my env that makes it go that high. It should use around 20GB for the transformer model in theory.

Ah ok, I'll update this number once we get a clearer value from @a-r-r-o-w!

nitinmukesh · 2025-02-20T18:07:21Z

May I request to add all 3 examples as they cover different aspects of this feature.
1st example covers leaf_level
3rd example covers block_level + leaf_level (demonstrates mix is allowed for different components)
2nd example explains that you need to apply on all components otherwise there will be error " Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!". Another solution is to use .to() for component for which you don't want to use apply_group_offloading. This scenario/issue is covered here #10797

*component means text_encoder(s), transformer, vae

a-r-r-o-w · 2025-02-21T04:01:39Z

@asomoza @stevhliu So, I looked into the CPU memory usage, and it is indeed higher for group offloading compared to model/sequential offloading. To improve, I'll need @SunMarc's help in understanding what's going on, since it looks like a component remains on disk (is not loaded on to the CPU) when not required -- otherwise memory usage should have been much higher from just loading model weights.

Code: https://gist.github.com/a-r-r-o-w/f5c9fb5c515d24f9a06001adb5c6cf18

Configuration	Time (s)	CUDA Model Memory (GB)	CUDA Inference Memory (GB)	CPU Offload Memory (GB)	CPU Inference Memory (GB)
full_cuda	25.77	31.45	36.07	0.86	1.41
model_offload	230.95	0.0	23.22	0.8	10.51
sequential_offload	2660.6	0.0	2.4	0.92	32.69
group_offload_block_1	306.01	0.17	13.41	0.91	37.55
group_offload_leaf	375.29	0.17	4.56	0.92	37.41
group_offload_block_1_stream	58.79	0.17	14.49	47.99	57.84
group_offload_leaf_stream	55.26	0.17	5.52	47.99	48.54

^{For group offloading, we do not offload VAE since the forward is never invoked (which is required to trigger the hooks) as we call decode. So the numbers are not really a fair comparision, but I can benchmark properly again if someone needs.}

So, as Alvaro pointed out, we do require a lot of RAM. There are a few ways we can reduce the RAM requirements in the near future:

Applying disk offloading in combination with group offloading or whatever sorcery is happening in model_offload 👀
Allowing loading in torch.float8_* types

We should definitely mention the limitations for now, but revisit once improvements have been made. There is slightly higher CPU usage compared to sequential offloading because we require pinned memory tensors to allow streams to be used. The docs explain this:

In general, the transfer is blocking on the device side (even if it isn’t on the host side): the copy on the device cannot occur while another operation is being executed. However, in some advanced scenarios, a copy and a kernel execution can be done simultaneously on the GPU side. As the following example will show, three requirements must be met to enable this:

The device must have at least one free DMA (Direct Memory Access) engine. Modern GPU architectures such as Volterra, Tesla, or H100 devices have more than one DMA engine.

The transfer must be done on a separate, non-default cuda stream. In PyTorch, cuda streams can be handles using Stream.

The source data must be in pinned memory.

a-r-r-o-w

Thanks @stevhliu! Just some thoughts

docs/source/en/api/pipelines/flux.md

stevhliu · 2025-02-21T21:39:11Z

May I request to add all 3 examples as they cover different aspects of this feature.

I added a note that should cover the other 2 examples (mixing block and leaf offloading, and using apply_group_offloading to all components). I think adding the full code examples for the other 2 might be a bit much when you're only making small changes to the code.

I'll update the exact RAM/VRAM requirements later in a separate PR pending @a-r-r-o-w's investigation :)

a-r-r-o-w · 2025-02-21T22:14:59Z

@stevhliu It is indeed 50 GB at the moment for group offloading as mentioned in the table in my previous comment :( I'll work with Marc on improving this by understanding what accelerate does for model offloading + we might be directly supporting this in accelerate

sayakpaul · 2025-03-12T03:39:14Z

docs/source/en/api/pipelines/flux.md

-## Running FP16 inference
+Flux is a very large model and requires ~50GB of RAM/VRAM to load all the modeling components. Enable some of the optimizations below to lower the memory requirements.
+
+### Group offloading


@stevhliu it might be a good idea to consider making it more generally available to all major pipelines with high usage.

flux group-offload

4750ce6

stevhliu requested review from a-r-r-o-w and asomoza February 20, 2025 17:01

asomoza reviewed Feb 20, 2025

View reviewed changes

a-r-r-o-w approved these changes Feb 21, 2025

View reviewed changes

docs/source/en/api/pipelines/flux.md Outdated Show resolved Hide resolved

docs/source/en/api/pipelines/flux.md Outdated Show resolved Hide resolved

docs/source/en/api/pipelines/flux.md Show resolved Hide resolved

docs/source/en/api/pipelines/flux.md Show resolved Hide resolved

shethaadit approved these changes Feb 21, 2025

View reviewed changes

feedback

e635406

stevhliu merged commit db21c97 into huggingface:main Feb 24, 2025
1 check passed

stevhliu deleted the flux branch February 24, 2025 16:47

a-r-r-o-w mentioned this pull request Mar 7, 2025

WanImageToVideoPipeline - swap out a limited number of blocks #10999

Open

sayakpaul reviewed Mar 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[docs] Flux group offload #10847

[docs] Flux group offload #10847

Uh oh!

stevhliu commented Feb 20, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Feb 20, 2025

Uh oh!

asomoza Feb 20, 2025

Uh oh!

stevhliu Feb 20, 2025

Uh oh!

nitinmukesh commented Feb 20, 2025

Uh oh!

a-r-r-o-w commented Feb 21, 2025

Uh oh!

a-r-r-o-w left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevhliu commented Feb 21, 2025

Uh oh!

a-r-r-o-w commented Feb 21, 2025

Uh oh!

Uh oh!

sayakpaul Mar 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[docs] Flux group offload #10847

[docs] Flux group offload #10847

Uh oh!

Conversation

stevhliu commented Feb 20, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Feb 20, 2025

Uh oh!

asomoza Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

stevhliu Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

nitinmukesh commented Feb 20, 2025

Uh oh!

a-r-r-o-w commented Feb 21, 2025

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevhliu commented Feb 21, 2025

Uh oh!

a-r-r-o-w commented Feb 21, 2025

Uh oh!

Uh oh!

sayakpaul Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants