Add sequential_weight_offload_device to skip unnecessary weight offloading#2478
Add sequential_weight_offload_device to skip unnecessary weight offloading#2478changjonathanc wants to merge 2 commits intovllm-project:mainfrom
Conversation
Separate weight offloading from activation offloading in the sequential pipeline. The new `sequential_weight_offload_device` parameter controls where model weights are stored between layer forward passes, while the existing `sequential_offload_device` continues to control activation offloading. Setting `sequential_weight_offload_device="none"` disables weight offloading entirely, keeping the model on the main device. This avoids unnecessary CPU<->GPU transfers when the model fits in GPU memory, yielding ~3x faster calibration on small models while still benefiting from the sequential pipeline's memory-efficient activation caching. Default remains "cpu" for backward compatibility. Signed-off-by: Jonathan Chang <changjonathanc@users.noreply.github.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request optimizes the sequential pipeline by decoupling weight offloading from activation offloading. It introduces a new parameter that allows users to prevent model weights from being unnecessarily transferred to the CPU, leading to improved performance, especially for models that can fully reside on the GPU. This change enhances efficiency by reducing redundant data movement while maintaining the benefits of sequential activation caching. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a sequential_weight_offload_device parameter to allow disabling weight offloading in the sequential pipeline, which can lead to significant performance improvements. The changes are well-implemented, separating weight offloading from activation offloading and including a new unit test. My feedback includes a minor suggestion to improve the robustness of user input handling for the new parameter.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jonathan Chang <31893406+changjonathanc@users.noreply.github.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
kylesayrs
left a comment
There was a problem hiding this comment.
Hi Johnathan.
Please see the Big Model Loading Guide.
As I mentioned in a previous comment, the intended design is that users will load models on the offload directly, rather than unnecessarily moving model weights after loading.
@Etelis is working on #2480, which seems to be the feature that you're interested in?
|
@kylesayrs i don't think thats what this PR is getting at, we currently don't support enabling activation offloading but disable weight offloading in oneshot which is what i think this is attempting to enable. @changjonathanc edit: talking with kyle actually this isn't the case, if the model is loaded on with device_map="auto" then the offloading will maintain that so this is already doable. I also see another PR in compressed tensors so i'm not sure what you're trying to do exactly. |
|
Hi @changjonathanc, feel free to join the LLM Compressor Slack and we can sync about what you're trying to achieve! |
|
Thanks @kylesayrs |
SUMMARY:
Separate weight offloading from activation offloading in the sequential pipeline.
Today
dispatch_for_sequentialalways offloads weights to CPU, even when the modelfits on the GPU. The new
sequential_weight_offload_deviceparameter (default"cpu",backward-compatible) lets users set
"none"to disable weight offloading, eliminatingunnecessary CPU↔GPU transfers while still benefiting from sequential activation caching.
Benchmark (TinyLlama-1.1B, W8A8, 32 samples, seq_len=512):
sequential_weight_offload_device="cpu": 17.1ssequential_weight_offload_device="none": 14.5s (~17% faster)TEST PLAN:
dispatch_for_sequentialwithNone,"none", and"None".cpuandnonemodes.