[DDP][GPTQ] Fixes for big models by HDCharles · Pull Request #2400 · vllm-project/llm-compressor

HDCharles · 2026-02-24T18:26:28Z

big models were failing with DDP for a few reasons, primarily related to overloading shared memory or having too many mmaps.

this was primarily an issue with DDP + cpu offloading but even with disk offloading, the moe context stuff would not use the same offloading as the original module and would revert to cpu offloading and cause problems.

Additionally storing all the original modules could still cause mmap issues so those are only stored if needed now.

Finally when saving i saw situations where one thread would to the saving and another thread would go past it and then timeout so i added a barrier there.

this PR depends on vllm-project/compressed-tensors#650

test plan: 596c6b7

github-actions · 2026-02-24T18:26:37Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-02-24T18:26:52Z

Summary of Changes

Hello @HDCharles, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the DDP (Distributed Data Parallel) quantization capabilities by introducing a suite of new examples for various large language models. It addresses a critical nccl fp8 broadcasting issue that previously hindered DDP with NVFP4 quantization, ensuring robust distributed training and inference. The added examples cover both INT4 and NVFP4 schemes, and an accompanying automation script simplifies their execution and evaluation, making it easier to test and validate these advanced quantization techniques.

Highlights

Expanded DDP Quantization Examples: Introduced new Distributed Data Parallel (DDP) examples for GPTQ quantization across a variety of large language models, including Llama4, Qwen3-30B-MoE, Qwen3-VL-235B-MoE, and Qwen3-VL-8B.
Support for INT4 and NVFP4 Schemes: Implemented DDP examples for both INT4 (W4A16) and NVFP4 quantization schemes, demonstrating broad compatibility.
NCCL FP8 Broadcasting Workaround: Included a workaround to address the nccl fp8 broadcasting requirement, enabling DDP with NVFP4 quantization.
Automated Testing and Evaluation Script: Added a run.sh script to automate the execution and evaluation of the new DDP quantization examples, streamlining testing workflows.
DDP Integration: Integrated DDP setup components such as init_dist, load_offloaded_model, and get_rank_partition into the quantization examples for efficient distributed processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/quantization_w4a16/llama3_nvfp4.py
- Added a new example script for Llama3 NVFP4 quantization.
examples/quantization_w4a16/llama4_gptq_int4_ddp_example.py
- Added a DDP example for Llama4 GPTQ INT4 quantization.
examples/quantization_w4a16/llama4_gptq_nvfp4_ddp_example.py
- Added a DDP example for Llama4 GPTQ NVFP4 quantization.
examples/quantization_w4a16/qwen3_30b_moe_gptq_int4_ddp_example.py
- Added a DDP example for Qwen3-30B-MoE GPTQ INT4 quantization, including specific ignore patterns for MoE gate layers.
examples/quantization_w4a16/qwen3_30b_moe_gptq_nvfp4_ddp_example.py
- Added a DDP example for Qwen3-30B-MoE GPTQ NVFP4 quantization, including specific ignore patterns for MoE gate layers.
examples/quantization_w4a16/qwen3_vl_235b_moe_gptq_int4_ddp_example.py
- Added a DDP example for data-free Qwen3-VL-235B-MoE GPTQ INT4 quantization, with ignore patterns for visual and MoE layers.
examples/quantization_w4a16/qwen3_vl_235b_moe_nvfp4_ddp_example.py
- Added a DDP example for data-free Qwen3-VL-235B-MoE GPTQ NVFP4 quantization, with ignore patterns for visual and MoE layers.
examples/quantization_w4a16/qwen3_vl_8b_gptq_int4_ddp_example.py
- Added a DDP example for Qwen3-VL-8B GPTQ INT4 quantization, including multimodal data handling and sequential target processing.
examples/quantization_w4a16/qwen3_vl_8b_gptq_nvfp4_ddp_example.py
- Added a DDP example for Qwen3-VL-8B GPTQ NVFP4 quantization, including multimodal data handling and sequential target processing.
examples/quantization_w4a16/run.sh
- Added a shell script to automate the execution and evaluation of the quantization examples, supporting both single and multi-GPU setups.

Activity

The author is currently testing DDP functionality.
A workaround was added to address nccl's requirement for fp8 broadcasting to enable DDP with nvfp4 quantization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several new example scripts for DDP quantization and a corresponding test execution script. The additions are valuable for demonstrating distributed quantization capabilities. However, I've identified several areas for improvement across the new files, including typos, misleading naming conventions, inconsistent process group cleanup in DDP scripts, and minor style issues. The run.sh script also contains a commented-out safety feature and a misleading log message. My review includes specific suggestions to address these points and enhance the overall quality and consistency of the new examples.

I am having trouble creating individual review comments. Click here to see my feedback.

examples/quantization_w4a16/llama4_gptq_nvfp4_ddp_example.py (42-52)

There is a typo in the variable name messgages. It should be messages. This typo appears in the variable's declaration and its subsequent uses within the preprocess_function.

    messages = []
    for message in example["messages"]:
        messages.append(
            {
                "role": message["role"],
                "content": [{"type": "text", "text": message["content"]}],
            }
        )

    return processor.apply_chat_template(
        messages,

examples/quantization_w4a16/llama4_gptq_int4_ddp_example.py (42-52)

There is a typo in the variable name messgages. It should be messages. This typo appears in the variable's declaration and its subsequent uses within the preprocess_function.

    messages = []
    for message in example["messages"]:
        messages.append(
            {
                "role": message["role"],
                "content": [{"type": "text", "text": message["content"]}],
            }
        )

    return processor.apply_chat_template(
        messages,

examples/quantization_w4a16/run.sh (2)

It's a good practice to enable set -e in shell scripts. This will cause the script to exit immediately if a command exits with a non-zero status, preventing potential issues from cascading. Please consider uncommenting this line.

set -e  # Exit on error

examples/quantization_w4a16/qwen3_30b_moe_gptq_nvfp4_ddp_example.py (113)

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

tokenizer.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/llama4_gptq_nvfp4_ddp_example.py (123)

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/qwen3_30b_moe_gptq_int4_ddp_example.py (107)

This line is quite long. For better readability, consider wrapping the SAVE_DIR definition in parentheses, as done in other example scripts.

SAVE_DIR = (
    model_id.rstrip("/").split("/")[-1]
    + "-GPTQ-W4A16-G128-DDP"
    + str(torch.distributed.get_world_size())
)

examples/quantization_w4a16/llama3_nvfp4.py (76)

The SAVE_DIR name is misleading. The quantization scheme used is NVFP4A16, but the directory name is constructed with W4A16-G128. To avoid confusion, the directory name should accurately reflect the quantization scheme.

SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-NVFP4A16"

examples/quantization_w4a16/qwen3_vl_235b_moe_gptq_int4_ddp_example.py (42)

It's a good practice to place all imports at the top of the file, following PEP 8 guidelines. The import torch statement should be moved to the top with other imports.

examples/quantization_w4a16/qwen3_vl_235b_moe_gptq_int4_ddp_example.py (51)

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/qwen3_vl_235b_moe_nvfp4_ddp_example.py (42)

It's a good practice to place all imports at the top of the file, following PEP 8 guidelines. The import torch statement should be moved to the top with other imports.

examples/quantization_w4a16/qwen3_vl_235b_moe_nvfp4_ddp_example.py (51)

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/qwen3_vl_8b_gptq_int4_ddp_example.py (157)

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/llama4_gptq_int4_ddp_example.py (123)

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/run.sh (55)

This log message appears to be a copy-paste error from the previous block. It should indicate that the evaluation is being retried with hf since the vllm evaluation failed.

            echo "Evaluation with vllm failed, retrying with hf..."

mergify · 2026-02-24T18:35:36Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

kylesayrs

We should definitely put these in a separate folder, right?

HDCharles · 2026-02-24T19:00:20Z

its a draft!

mergify · 2026-02-25T06:49:53Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-02-25T19:27:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @HDCharles.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

dsikka

FYI, NVFP4 examples should go here:
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a4_fp4

NVFPA16 examples should go here: https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a16_fp4/nvfp4

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

brian-dellabetta

LGTM, one question about also adding to the new norm calibration context as well

brian-dellabetta · 2026-03-25T18:44:43Z

src/llmcompressor/modeling/moe_context.py

                calibrate_all_experts=calibrate_all_experts,
            )
+            # Apply the same offloading settings as the original module
+            _apply_offloading_to_replacement(module, replacement)


do we need to do this to the (just merged) norm context as well?

kylesayrs

FYI when we transition to vllm-project/compressed-tensors#624, this will be unnecessary as the algorithm optimally handles applying offloads directly

HDCharles marked this pull request as draft February 24, 2026 18:26

mergify bot added the documentation Improvements or additions to documentation label Feb 24, 2026

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

mergify bot added the quality-failed label Feb 24, 2026

kylesayrs reviewed Feb 24, 2026

View reviewed changes

mergify bot removed the quality-failed label Feb 25, 2026

mergify bot added the quality-failed label Feb 25, 2026

mergify bot added the needs-rebase label Feb 25, 2026

dsikka reviewed Feb 25, 2026

View reviewed changes

HDCharles force-pushed the 92_ddp_fixes_and_testing branch from dc55eb0 to 596c6b7 Compare March 24, 2026 21:11

mergify bot removed the quality-failed label Mar 24, 2026

HDCharles changed the title ~~[DDP][GPTQ] Fixes and Testing~~ [DDP][GPTQ] Fixes for big models Mar 25, 2026

HDCharles requested review from brian-dellabetta, dsikka and kylesayrs March 25, 2026 17:57

HDCharles added 9 commits March 25, 2026 13:58

test code

ae20046

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

fixing the nvfp4 bug

154405b

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

testing

64b83c9

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

more testing

5828e0f

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

tests

75fc964

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

fix

5580efb

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

trying to fix big model stuff

d1c25ce

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

finalizing fixes

a1d20bb

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

cleaning up extra tests

dec4b61

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

ruff praise be to you

a1f1335

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

HDCharles force-pushed the 92_ddp_fixes_and_testing branch from 8df7650 to a1f1335 Compare March 25, 2026 18:06

HDCharles marked this pull request as ready for review March 25, 2026 18:07

mergify bot removed the needs-rebase label Mar 25, 2026

add tests

c541dff

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

HDCharles force-pushed the 92_ddp_fixes_and_testing branch from 1b9e6b3 to c541dff Compare March 25, 2026 18:29

HDCharles added ready When a PR is ready for review dist Work pertaining to distributed work and removed documentation Improvements or additions to documentation labels Mar 25, 2026

brian-dellabetta approved these changes Mar 25, 2026

View reviewed changes

kylesayrs approved these changes Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DDP][GPTQ] Fixes for big models#2400

[DDP][GPTQ] Fixes for big models#2400
HDCharles wants to merge 11 commits intomainfrom
92_ddp_fixes_and_testing

HDCharles commented Feb 24, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 24, 2026

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

kylesayrs left a comment

Uh oh!

HDCharles commented Feb 24, 2026

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

dsikka left a comment

Uh oh!

brian-dellabetta left a comment

Uh oh!

brian-dellabetta Mar 25, 2026

Uh oh!

kylesayrs left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

HDCharles commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 24, 2026

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

examples/quantization_w4a16/llama4_gptq_nvfp4_ddp_example.py (42-52)

examples/quantization_w4a16/llama4_gptq_int4_ddp_example.py (42-52)

examples/quantization_w4a16/run.sh (2)

examples/quantization_w4a16/qwen3_30b_moe_gptq_nvfp4_ddp_example.py (113)

examples/quantization_w4a16/llama4_gptq_nvfp4_ddp_example.py (123)

examples/quantization_w4a16/qwen3_30b_moe_gptq_int4_ddp_example.py (107)

examples/quantization_w4a16/llama3_nvfp4.py (76)

examples/quantization_w4a16/qwen3_vl_235b_moe_gptq_int4_ddp_example.py (42)

examples/quantization_w4a16/qwen3_vl_235b_moe_gptq_int4_ddp_example.py (51)

examples/quantization_w4a16/qwen3_vl_235b_moe_nvfp4_ddp_example.py (42)

examples/quantization_w4a16/qwen3_vl_235b_moe_nvfp4_ddp_example.py (51)

examples/quantization_w4a16/qwen3_vl_8b_gptq_int4_ddp_example.py (157)

examples/quantization_w4a16/llama4_gptq_int4_ddp_example.py (123)

examples/quantization_w4a16/run.sh (55)

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

HDCharles commented Feb 24, 2026

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDCharles commented Feb 24, 2026 •

edited

Loading