Skip to content

[DDP][GPTQ] Fixes for big models#2400

Open
HDCharles wants to merge 11 commits intomainfrom
92_ddp_fixes_and_testing
Open

[DDP][GPTQ] Fixes for big models#2400
HDCharles wants to merge 11 commits intomainfrom
92_ddp_fixes_and_testing

Conversation

@HDCharles
Copy link
Collaborator

@HDCharles HDCharles commented Feb 24, 2026

big models were failing with DDP for a few reasons, primarily related to overloading shared memory or having too many mmaps.

this was primarily an issue with DDP + cpu offloading but even with disk offloading, the moe context stuff would not use the same offloading as the original module and would revert to cpu offloading and cause problems.

Additionally storing all the original modules could still cause mmap issues so those are only stored if needed now.

Finally when saving i saw situations where one thread would to the saving and another thread would go past it and then timeout so i added a barrier there.

this PR depends on vllm-project/compressed-tensors#650

test plan: 596c6b7

@HDCharles HDCharles marked this pull request as draft February 24, 2026 18:26
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @HDCharles, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the DDP (Distributed Data Parallel) quantization capabilities by introducing a suite of new examples for various large language models. It addresses a critical nccl fp8 broadcasting issue that previously hindered DDP with NVFP4 quantization, ensuring robust distributed training and inference. The added examples cover both INT4 and NVFP4 schemes, and an accompanying automation script simplifies their execution and evaluation, making it easier to test and validate these advanced quantization techniques.

Highlights

  • Expanded DDP Quantization Examples: Introduced new Distributed Data Parallel (DDP) examples for GPTQ quantization across a variety of large language models, including Llama4, Qwen3-30B-MoE, Qwen3-VL-235B-MoE, and Qwen3-VL-8B.
  • Support for INT4 and NVFP4 Schemes: Implemented DDP examples for both INT4 (W4A16) and NVFP4 quantization schemes, demonstrating broad compatibility.
  • NCCL FP8 Broadcasting Workaround: Included a workaround to address the nccl fp8 broadcasting requirement, enabling DDP with NVFP4 quantization.
  • Automated Testing and Evaluation Script: Added a run.sh script to automate the execution and evaluation of the new DDP quantization examples, streamlining testing workflows.
  • DDP Integration: Integrated DDP setup components such as init_dist, load_offloaded_model, and get_rank_partition into the quantization examples for efficient distributed processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/quantization_w4a16/llama3_nvfp4.py
    • Added a new example script for Llama3 NVFP4 quantization.
  • examples/quantization_w4a16/llama4_gptq_int4_ddp_example.py
    • Added a DDP example for Llama4 GPTQ INT4 quantization.
  • examples/quantization_w4a16/llama4_gptq_nvfp4_ddp_example.py
    • Added a DDP example for Llama4 GPTQ NVFP4 quantization.
  • examples/quantization_w4a16/qwen3_30b_moe_gptq_int4_ddp_example.py
    • Added a DDP example for Qwen3-30B-MoE GPTQ INT4 quantization, including specific ignore patterns for MoE gate layers.
  • examples/quantization_w4a16/qwen3_30b_moe_gptq_nvfp4_ddp_example.py
    • Added a DDP example for Qwen3-30B-MoE GPTQ NVFP4 quantization, including specific ignore patterns for MoE gate layers.
  • examples/quantization_w4a16/qwen3_vl_235b_moe_gptq_int4_ddp_example.py
    • Added a DDP example for data-free Qwen3-VL-235B-MoE GPTQ INT4 quantization, with ignore patterns for visual and MoE layers.
  • examples/quantization_w4a16/qwen3_vl_235b_moe_nvfp4_ddp_example.py
    • Added a DDP example for data-free Qwen3-VL-235B-MoE GPTQ NVFP4 quantization, with ignore patterns for visual and MoE layers.
  • examples/quantization_w4a16/qwen3_vl_8b_gptq_int4_ddp_example.py
    • Added a DDP example for Qwen3-VL-8B GPTQ INT4 quantization, including multimodal data handling and sequential target processing.
  • examples/quantization_w4a16/qwen3_vl_8b_gptq_nvfp4_ddp_example.py
    • Added a DDP example for Qwen3-VL-8B GPTQ NVFP4 quantization, including multimodal data handling and sequential target processing.
  • examples/quantization_w4a16/run.sh
    • Added a shell script to automate the execution and evaluation of the quantization examples, supporting both single and multi-GPU setups.
Activity
  • The author is currently testing DDP functionality.
  • A workaround was added to address nccl's requirement for fp8 broadcasting to enable DDP with nvfp4 quantization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 24, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several new example scripts for DDP quantization and a corresponding test execution script. The additions are valuable for demonstrating distributed quantization capabilities. However, I've identified several areas for improvement across the new files, including typos, misleading naming conventions, inconsistent process group cleanup in DDP scripts, and minor style issues. The run.sh script also contains a commented-out safety feature and a misleading log message. My review includes specific suggestions to address these points and enhance the overall quality and consistency of the new examples.

I am having trouble creating individual review comments. Click here to see my feedback.

examples/quantization_w4a16/llama4_gptq_nvfp4_ddp_example.py (42-52)

high

There is a typo in the variable name messgages. It should be messages. This typo appears in the variable's declaration and its subsequent uses within the preprocess_function.

    messages = []
    for message in example["messages"]:
        messages.append(
            {
                "role": message["role"],
                "content": [{"type": "text", "text": message["content"]}],
            }
        )

    return processor.apply_chat_template(
        messages,

examples/quantization_w4a16/llama4_gptq_int4_ddp_example.py (42-52)

high

There is a typo in the variable name messgages. It should be messages. This typo appears in the variable's declaration and its subsequent uses within the preprocess_function.

    messages = []
    for message in example["messages"]:
        messages.append(
            {
                "role": message["role"],
                "content": [{"type": "text", "text": message["content"]}],
            }
        )

    return processor.apply_chat_template(
        messages,

examples/quantization_w4a16/run.sh (2)

high

It's a good practice to enable set -e in shell scripts. This will cause the script to exit immediately if a command exits with a non-zero status, preventing potential issues from cascading. Please consider uncommenting this line.

set -e  # Exit on error

examples/quantization_w4a16/qwen3_30b_moe_gptq_nvfp4_ddp_example.py (113)

medium

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

tokenizer.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/llama4_gptq_nvfp4_ddp_example.py (123)

medium

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/qwen3_30b_moe_gptq_int4_ddp_example.py (107)

medium

This line is quite long. For better readability, consider wrapping the SAVE_DIR definition in parentheses, as done in other example scripts.

SAVE_DIR = (
    model_id.rstrip("/").split("/")[-1]
    + "-GPTQ-W4A16-G128-DDP"
    + str(torch.distributed.get_world_size())
)

examples/quantization_w4a16/llama3_nvfp4.py (76)

medium

The SAVE_DIR name is misleading. The quantization scheme used is NVFP4A16, but the directory name is constructed with W4A16-G128. To avoid confusion, the directory name should accurately reflect the quantization scheme.

SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-NVFP4A16"

examples/quantization_w4a16/qwen3_vl_235b_moe_gptq_int4_ddp_example.py (42)

medium

It's a good practice to place all imports at the top of the file, following PEP 8 guidelines. The import torch statement should be moved to the top with other imports.

examples/quantization_w4a16/qwen3_vl_235b_moe_gptq_int4_ddp_example.py (51)

medium

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/qwen3_vl_235b_moe_nvfp4_ddp_example.py (42)

medium

It's a good practice to place all imports at the top of the file, following PEP 8 guidelines. The import torch statement should be moved to the top with other imports.

examples/quantization_w4a16/qwen3_vl_235b_moe_nvfp4_ddp_example.py (51)

medium

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/qwen3_vl_8b_gptq_int4_ddp_example.py (157)

medium

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/llama4_gptq_int4_ddp_example.py (123)

medium

For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.

processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()

examples/quantization_w4a16/run.sh (55)

medium

This log message appears to be a copy-paste error from the previous block. It should indicate that the evaluation is being retried with hf since the vllm evaluation failed.

            echo "Evaluation with vllm failed, retrying with hf..."

@mergify
Copy link
Contributor

mergify bot commented Feb 24, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely put these in a separate folder, right?

@HDCharles
Copy link
Collaborator Author

its a draft!

image

@mergify mergify bot removed the quality-failed label Feb 25, 2026
@mergify
Copy link
Contributor

mergify bot commented Feb 25, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@mergify
Copy link
Contributor

mergify bot commented Feb 25, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @HDCharles.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 25, 2026
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HDCharles HDCharles force-pushed the 92_ddp_fixes_and_testing branch from dc55eb0 to 596c6b7 Compare March 24, 2026 21:11
@mergify mergify bot removed the quality-failed label Mar 24, 2026
@HDCharles HDCharles changed the title [DDP][GPTQ] Fixes and Testing [DDP][GPTQ] Fixes for big models Mar 25, 2026
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
@HDCharles HDCharles force-pushed the 92_ddp_fixes_and_testing branch from 8df7650 to a1f1335 Compare March 25, 2026 18:06
@HDCharles HDCharles marked this pull request as ready for review March 25, 2026 18:07
@mergify mergify bot removed the needs-rebase label Mar 25, 2026
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
@HDCharles HDCharles force-pushed the 92_ddp_fixes_and_testing branch from 1b9e6b3 to c541dff Compare March 25, 2026 18:29
@HDCharles HDCharles added ready When a PR is ready for review dist Work pertaining to distributed work and removed documentation Improvements or additions to documentation labels Mar 25, 2026
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one question about also adding to the new norm calibration context as well

calibrate_all_experts=calibrate_all_experts,
)
# Apply the same offloading settings as the original module
_apply_offloading_to_replacement(module, replacement)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this to the (just merged) norm context as well?

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI when we transition to vllm-project/compressed-tensors#624, this will be unnecessary as the algorithm optimally handles applying offloads directly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dist Work pertaining to distributed work ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants