-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[docs] Flux group offload #10847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Flux group offload #10847
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
docs/source/en/api/pipelines/flux.md
Outdated
| ## Optimize | ||
|
|
||
| ## Running FP16 inference | ||
| Flux is a very large model and requires ~50GB of RAM. Enable some of the optimizations below to lower the memory requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!, but the 50 GB of RAM are used when using group offloading not before, also @a-r-r-o-w was going to check if this is the real number or not, I get this but maybe there's something in my env that makes it go that high. It should use around 20GB for the transformer model in theory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I'll update this number once we get a clearer value from @a-r-r-o-w!
|
May I request to add all 3 examples as they cover different aspects of this feature. *component means text_encoder(s), transformer, vae |
|
@asomoza @stevhliu So, I looked into the CPU memory usage, and it is indeed higher for group offloading compared to model/sequential offloading. To improve, I'll need @SunMarc's help in understanding what's going on, since it looks like a component remains on disk (is not loaded on to the CPU) when not required -- otherwise memory usage should have been much higher from just loading model weights. Code: https://gist.github.com/a-r-r-o-w/f5c9fb5c515d24f9a06001adb5c6cf18
For group offloading, we do not offload VAE since the So, as Alvaro pointed out, we do require a lot of RAM. There are a few ways we can reduce the RAM requirements in the near future:
We should definitely mention the limitations for now, but revisit once improvements have been made. There is slightly higher CPU usage compared to sequential offloading because we require pinned memory tensors to allow streams to be used. The docs explain this:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @stevhliu! Just some thoughts
I added a note that should cover the other 2 examples (mixing I'll update the exact RAM/VRAM requirements later in a separate PR pending @a-r-r-o-w's investigation :) |
|
@stevhliu It is indeed 50 GB at the moment for group offloading as mentioned in the table in my previous comment :( I'll work with Marc on improving this by understanding what accelerate does for model offloading + we might be directly supporting this in accelerate |
| ## Running FP16 inference | ||
| Flux is a very large model and requires ~50GB of RAM/VRAM to load all the modeling components. Enable some of the optimizations below to lower the memory requirements. | ||
|
|
||
| ### Group offloading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevhliu it might be a good idea to consider making it more generally available to all major pipelines with high usage.
From discussion in #10840, this PR adds an example of group offloading to the Flux docs as well as a note on memory requirements.