Skip to content

Conversation

@stevhliu
Copy link
Member

From discussion in #10840, this PR adds an example of group offloading to the Flux docs as well as a note on memory requirements.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

## Optimize

## Running FP16 inference
Flux is a very large model and requires ~50GB of RAM. Enable some of the optimizations below to lower the memory requirements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!, but the 50 GB of RAM are used when using group offloading not before, also @a-r-r-o-w was going to check if this is the real number or not, I get this but maybe there's something in my env that makes it go that high. It should use around 20GB for the transformer model in theory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I'll update this number once we get a clearer value from @a-r-r-o-w!

@nitinmukesh
Copy link

May I request to add all 3 examples as they cover different aspects of this feature.
1st example covers leaf_level
3rd example covers block_level + leaf_level (demonstrates mix is allowed for different components)
2nd example explains that you need to apply on all components otherwise there will be error " Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!". Another solution is to use .to() for component for which you don't want to use apply_group_offloading. This scenario/issue is covered here #10797

*component means text_encoder(s), transformer, vae

@a-r-r-o-w
Copy link
Contributor

@asomoza @stevhliu So, I looked into the CPU memory usage, and it is indeed higher for group offloading compared to model/sequential offloading. To improve, I'll need @SunMarc's help in understanding what's going on, since it looks like a component remains on disk (is not loaded on to the CPU) when not required -- otherwise memory usage should have been much higher from just loading model weights.

Code: https://gist.github.com/a-r-r-o-w/f5c9fb5c515d24f9a06001adb5c6cf18

Configuration Time (s) CUDA Model Memory (GB) CUDA Inference Memory (GB) CPU Offload Memory (GB) CPU Inference Memory (GB)
full_cuda 25.77 31.45 36.07 0.86 1.41
model_offload 230.95 0.0 23.22 0.8 10.51
sequential_offload 2660.6 0.0 2.4 0.92 32.69
group_offload_block_1 306.01 0.17 13.41 0.91 37.55
group_offload_leaf 375.29 0.17 4.56 0.92 37.41
group_offload_block_1_stream 58.79 0.17 14.49 47.99 57.84
group_offload_leaf_stream 55.26 0.17 5.52 47.99 48.54

For group offloading, we do not offload VAE since the forward is never invoked (which is required to trigger the hooks) as we call decode. So the numbers are not really a fair comparision, but I can benchmark properly again if someone needs.

So, as Alvaro pointed out, we do require a lot of RAM. There are a few ways we can reduce the RAM requirements in the near future:

  • Applying disk offloading in combination with group offloading or whatever sorcery is happening in model_offload 👀
  • Allowing loading in torch.float8_* types

We should definitely mention the limitations for now, but revisit once improvements have been made. There is slightly higher CPU usage compared to sequential offloading because we require pinned memory tensors to allow streams to be used. The docs explain this:

In general, the transfer is blocking on the device side (even if it isn’t on the host side): the copy on the device cannot occur while another operation is being executed. However, in some advanced scenarios, a copy and a kernel execution can be done simultaneously on the GPU side. As the following example will show, three requirements must be met to enable this:

The device must have at least one free DMA (Direct Memory Access) engine. Modern GPU architectures such as Volterra, Tesla, or H100 devices have more than one DMA engine.

The transfer must be done on a separate, non-default cuda stream. In PyTorch, cuda streams can be handles using Stream.

The source data must be in pinned memory.

Copy link
Contributor

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stevhliu! Just some thoughts

@stevhliu
Copy link
Member Author

May I request to add all 3 examples as they cover different aspects of this feature.

I added a note that should cover the other 2 examples (mixing block and leaf offloading, and using apply_group_offloading to all components). I think adding the full code examples for the other 2 might be a bit much when you're only making small changes to the code.

I'll update the exact RAM/VRAM requirements later in a separate PR pending @a-r-r-o-w's investigation :)

@a-r-r-o-w
Copy link
Contributor

@stevhliu It is indeed 50 GB at the moment for group offloading as mentioned in the table in my previous comment :( I'll work with Marc on improving this by understanding what accelerate does for model offloading + we might be directly supporting this in accelerate

@stevhliu stevhliu merged commit db21c97 into huggingface:main Feb 24, 2025
1 check passed
@stevhliu stevhliu deleted the flux branch February 24, 2025 16:47
## Running FP16 inference
Flux is a very large model and requires ~50GB of RAM/VRAM to load all the modeling components. Enable some of the optimizations below to lower the memory requirements.

### Group offloading
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevhliu it might be a good idea to consider making it more generally available to all major pipelines with high usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants