Skip to content

Add sequential_weight_offload_device to skip unnecessary weight offloading#2478

Closed
changjonathanc wants to merge 2 commits intovllm-project:mainfrom
changjonathanc:feat/sequential-weight-offload-device
Closed

Add sequential_weight_offload_device to skip unnecessary weight offloading#2478
changjonathanc wants to merge 2 commits intovllm-project:mainfrom
changjonathanc:feat/sequential-weight-offload-device

Conversation

@changjonathanc
Copy link

SUMMARY:
Separate weight offloading from activation offloading in the sequential pipeline.
Today dispatch_for_sequential always offloads weights to CPU, even when the model
fits on the GPU. The new sequential_weight_offload_device parameter (default "cpu",
backward-compatible) lets users set "none" to disable weight offloading, eliminating
unnecessary CPU↔GPU transfers while still benefiting from sequential activation caching.

Benchmark (TinyLlama-1.1B, W8A8, 32 samples, seq_len=512):

  • sequential_weight_offload_device="cpu": 17.1s
  • sequential_weight_offload_device="none": 14.5s (~17% faster)

TEST PLAN:

  • Added parametrized unit test for dispatch_for_sequential with None, "none", and "None".
  • Verified quantized output is identical between cpu and none modes.
  • All existing tests pass.

Separate weight offloading from activation offloading in the sequential
pipeline. The new `sequential_weight_offload_device` parameter controls
where model weights are stored between layer forward passes, while the
existing `sequential_offload_device` continues to control activation
offloading.

Setting `sequential_weight_offload_device="none"` disables weight
offloading entirely, keeping the model on the main device. This avoids
unnecessary CPU<->GPU transfers when the model fits in GPU memory,
yielding ~3x faster calibration on small models while still benefiting
from the sequential pipeline's memory-efficient activation caching.

Default remains "cpu" for backward compatibility.

Signed-off-by: Jonathan Chang <changjonathanc@users.noreply.github.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes the sequential pipeline by decoupling weight offloading from activation offloading. It introduces a new parameter that allows users to prevent model weights from being unnecessarily transferred to the CPU, leading to improved performance, especially for models that can fully reside on the GPU. This change enhances efficiency by reducing redundant data movement while maintaining the benefits of sequential activation caching.

Highlights

  • New Configuration Parameter: Introduced sequential_weight_offload_device to provide granular control over weight offloading in the sequential pipeline, separate from activation offloading.
  • Performance Improvement: Allows users to disable weight offloading by setting sequential_weight_offload_device to "none", eliminating unnecessary CPU↔GPU transfers when the model fits on the GPU. Benchmarks show a ~17% speedup for TinyLlama-1.1B.
  • Backward Compatibility: The default value for sequential_weight_offload_device is "cpu", ensuring backward compatibility with existing configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/args/dataset_arguments.py
    • Added sequential_weight_offload_device argument to DatasetArguments.
  • src/llmcompressor/entrypoints/oneshot.py
    • Added sequential_weight_offload_device parameter to the oneshot function signature.
    • Updated the docstring for the oneshot function to describe the new sequential_weight_offload_device parameter.
  • src/llmcompressor/pipelines/sequential/helpers.py
    • Modified dispatch_for_sequential to accept onload_device and offload_device parameters.
    • Implemented logic within dispatch_for_sequential to disable weight offloading if offload_device is "none" or None.
  • src/llmcompressor/pipelines/sequential/pipeline.py
    • Updated the __call__ method to pass the new sequential_weight_offload_device from dataset_args to dispatch_for_sequential.
    • Renamed offload_device to activation_offload_device for clarity in the __call__ method.
  • tests/llmcompressor/pipelines/sequential/test_helpers.py
    • Imported pytest and dispatch_for_sequential.
    • Added a new parametrized unit test test_dispatch_for_sequential_no_offload to verify that the model remains on the onload_device when offload_device is set to None, "none", or "None".
Activity
  • The author has implemented a new feature to optimize weight offloading.
  • Unit tests have been added to cover the new functionality and ensure correct behavior when offloading is disabled.
  • Performance benchmarks were provided, demonstrating a significant speedup (~17%) with the new configuration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sequential_weight_offload_device parameter to allow disabling weight offloading in the sequential pipeline, which can lead to significant performance improvements. The changes are well-implemented, separating weight offloading from activation offloading and including a new unit test. My feedback includes a minor suggestion to improve the robustness of user input handling for the new parameter.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Jonathan Chang <31893406+changjonathanc@users.noreply.github.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Johnathan.

Please see the Big Model Loading Guide.

As I mentioned in a previous comment, the intended design is that users will load models on the offload directly, rather than unnecessarily moving model weights after loading.

@Etelis is working on #2480, which seems to be the feature that you're interested in?

@HDCharles
Copy link
Collaborator

HDCharles commented Mar 17, 2026

@kylesayrs i don't think thats what this PR is getting at, we currently don't support enabling activation offloading but disable weight offloading in oneshot which is what i think this is attempting to enable.

@changjonathanc
i think its reasonable to want to do this, however the sequential pipeline is fundamentally about sequentially onloading/offloading the model and in general the way to enable running the whole model on gpu is using the the basic pipeline. The basic pipeline doesn't have activation offloading though I think it would make sense to add it as an option there, rather than removing sequential offloading from the sequential offloading pipeline. while it may be easier in some cases to just delete the hooks and don't offload in the sequential pipeline, in general sequential pipeline offloading/onloading is under heavy development between DDP work and various offloading context (e.g. the big model stuff kyle mentioned earlier) so adding more responsibilities for this path seems like a bad idea.

edit: talking with kyle actually this isn't the case, if the model is loaded on with device_map="auto" then the offloading will maintain that so this is already doable. I also see another PR in compressed tensors so i'm not sure what you're trying to do exactly.

@kylesayrs
Copy link
Collaborator

Hi @changjonathanc, feel free to join the LLM Compressor Slack and we can sync about what you're trying to achieve!

@changjonathanc
Copy link
Author

Thanks @kylesayrs
I see the issue now. I updated my code to put model on the correct devices before it touches llmcompressor code and it works. i think this pr is not needed anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants