Add sequential_weight_offload_device to skip unnecessary weight offloading by changjonathanc · Pull Request #2478 · vllm-project/llm-compressor

changjonathanc · 2026-03-17T11:33:23Z

SUMMARY:
Separate weight offloading from activation offloading in the sequential pipeline.
Today dispatch_for_sequential always offloads weights to CPU, even when the model
fits on the GPU. The new sequential_weight_offload_device parameter (default "cpu",
backward-compatible) lets users set "none" to disable weight offloading, eliminating
unnecessary CPU↔GPU transfers while still benefiting from sequential activation caching.

Benchmark (TinyLlama-1.1B, W8A8, 32 samples, seq_len=512):

sequential_weight_offload_device="cpu": 17.1s
sequential_weight_offload_device="none": 14.5s (~17% faster)

TEST PLAN:

Added parametrized unit test for dispatch_for_sequential with None, "none", and "None".
Verified quantized output is identical between cpu and none modes.
All existing tests pass.

Separate weight offloading from activation offloading in the sequential pipeline. The new `sequential_weight_offload_device` parameter controls where model weights are stored between layer forward passes, while the existing `sequential_offload_device` continues to control activation offloading. Setting `sequential_weight_offload_device="none"` disables weight offloading entirely, keeping the model on the main device. This avoids unnecessary CPU<->GPU transfers when the model fits in GPU memory, yielding ~3x faster calibration on small models while still benefiting from the sequential pipeline's memory-efficient activation caching. Default remains "cpu" for backward compatibility. Signed-off-by: Jonathan Chang <changjonathanc@users.noreply.github.com>

gemini-code-assist · 2026-03-17T11:33:41Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes the sequential pipeline by decoupling weight offloading from activation offloading. It introduces a new parameter that allows users to prevent model weights from being unnecessarily transferred to the CPU, leading to improved performance, especially for models that can fully reside on the GPU. This change enhances efficiency by reducing redundant data movement while maintaining the benefits of sequential activation caching.

Highlights

New Configuration Parameter: Introduced sequential_weight_offload_device to provide granular control over weight offloading in the sequential pipeline, separate from activation offloading.
Performance Improvement: Allows users to disable weight offloading by setting sequential_weight_offload_device to "none", eliminating unnecessary CPU↔GPU transfers when the model fits on the GPU. Benchmarks show a ~17% speedup for TinyLlama-1.1B.
Backward Compatibility: The default value for sequential_weight_offload_device is "cpu", ensuring backward compatibility with existing configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/args/dataset_arguments.py
- Added sequential_weight_offload_device argument to DatasetArguments.
src/llmcompressor/entrypoints/oneshot.py
- Added sequential_weight_offload_device parameter to the oneshot function signature.
- Updated the docstring for the oneshot function to describe the new sequential_weight_offload_device parameter.
src/llmcompressor/pipelines/sequential/helpers.py
- Modified dispatch_for_sequential to accept onload_device and offload_device parameters.
- Implemented logic within dispatch_for_sequential to disable weight offloading if offload_device is "none" or None.
src/llmcompressor/pipelines/sequential/pipeline.py
- Updated the __call__ method to pass the new sequential_weight_offload_device from dataset_args to dispatch_for_sequential.
- Renamed offload_device to activation_offload_device for clarity in the __call__ method.
tests/llmcompressor/pipelines/sequential/test_helpers.py
- Imported pytest and dispatch_for_sequential.
- Added a new parametrized unit test test_dispatch_for_sequential_no_offload to verify that the model remains on the onload_device when offload_device is set to None, "none", or "None".

Activity

The author has implemented a new feature to optimize weight offloading.
Unit tests have been added to cover the new functionality and ensure correct behavior when offloading is disabled.
Performance benchmarks were provided, demonstrating a significant speedup (~17%) with the new configuration.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a sequential_weight_offload_device parameter to allow disabling weight offloading in the sequential pipeline, which can lead to significant performance improvements. The changes are well-implemented, separating weight offloading from activation offloading and including a new unit test. My feedback includes a minor suggestion to improve the robustness of user input handling for the new parameter.

src/llmcompressor/pipelines/sequential/helpers.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jonathan Chang <31893406+changjonathanc@users.noreply.github.com>

github-actions · 2026-03-17T12:48:29Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

kylesayrs

Hi Johnathan.

Please see the Big Model Loading Guide.

As I mentioned in a previous comment, the intended design is that users will load models on the offload directly, rather than unnecessarily moving model weights after loading.

@Etelis is working on #2480, which seems to be the feature that you're interested in?

HDCharles · 2026-03-17T17:38:39Z

@kylesayrs i don't think thats what this PR is getting at, we currently don't support enabling activation offloading but disable weight offloading in oneshot which is what i think this is attempting to enable.

@changjonathanc
i think its reasonable to want to do this, however the sequential pipeline is fundamentally about sequentially onloading/offloading the model and in general the way to enable running the whole model on gpu is using the the basic pipeline. The basic pipeline doesn't have activation offloading though I think it would make sense to add it as an option there, rather than removing sequential offloading from the sequential offloading pipeline. while it may be easier in some cases to just delete the hooks and don't offload in the sequential pipeline, in general sequential pipeline offloading/onloading is under heavy development between DDP work and various offloading context (e.g. the big model stuff kyle mentioned earlier) so adding more responsibilities for this path seems like a bad idea.

edit: talking with kyle actually this isn't the case, if the model is loaded on with device_map="auto" then the offloading will maintain that so this is already doable. I also see another PR in compressed tensors so i'm not sure what you're trying to do exactly.

kylesayrs · 2026-03-17T17:58:55Z

Hi @changjonathanc, feel free to join the LLM Compressor Slack and we can sync about what you're trying to achieve!

changjonathanc · 2026-03-18T11:29:15Z

Thanks @kylesayrs
I see the issue now. I updated my code to put model on the correct devices before it touches llmcompressor code and it works. i think this pr is not needed anymore.

changjonathanc requested review from HDCharles, brian-dellabetta, dsikka and kylesayrs as code owners March 17, 2026 11:33

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

src/llmcompressor/pipelines/sequential/helpers.py Outdated Show resolved Hide resolved

Update src/llmcompressor/pipelines/sequential/helpers.py

1a494f7

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jonathan Chang <31893406+changjonathanc@users.noreply.github.com>

kylesayrs requested changes Mar 17, 2026

View reviewed changes

changjonathanc closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sequential_weight_offload_device to skip unnecessary weight offloading#2478

Add sequential_weight_offload_device to skip unnecessary weight offloading#2478
changjonathanc wants to merge 2 commits intovllm-project:mainfrom
changjonathanc:feat/sequential-weight-offload-device

changjonathanc commented Mar 17, 2026

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

kylesayrs left a comment

Uh oh!

HDCharles commented Mar 17, 2026 •

edited

Loading

Uh oh!

kylesayrs commented Mar 17, 2026

Uh oh!

changjonathanc commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

changjonathanc commented Mar 17, 2026

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

HDCharles commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Mar 17, 2026

Uh oh!

changjonathanc commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDCharles commented Mar 17, 2026 •

edited

Loading