Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @HDCharles, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the DDP (Distributed Data Parallel) quantization capabilities by introducing a suite of new examples for various large language models. It addresses a critical Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several new example scripts for DDP quantization and a corresponding test execution script. The additions are valuable for demonstrating distributed quantization capabilities. However, I've identified several areas for improvement across the new files, including typos, misleading naming conventions, inconsistent process group cleanup in DDP scripts, and minor style issues. The run.sh script also contains a commented-out safety feature and a misleading log message. My review includes specific suggestions to address these points and enhance the overall quality and consistency of the new examples.
I am having trouble creating individual review comments. Click here to see my feedback.
examples/quantization_w4a16/llama4_gptq_nvfp4_ddp_example.py (42-52)
There is a typo in the variable name messgages. It should be messages. This typo appears in the variable's declaration and its subsequent uses within the preprocess_function.
messages = []
for message in example["messages"]:
messages.append(
{
"role": message["role"],
"content": [{"type": "text", "text": message["content"]}],
}
)
return processor.apply_chat_template(
messages,
examples/quantization_w4a16/llama4_gptq_int4_ddp_example.py (42-52)
There is a typo in the variable name messgages. It should be messages. This typo appears in the variable's declaration and its subsequent uses within the preprocess_function.
messages = []
for message in example["messages"]:
messages.append(
{
"role": message["role"],
"content": [{"type": "text", "text": message["content"]}],
}
)
return processor.apply_chat_template(
messages,
examples/quantization_w4a16/run.sh (2)
It's a good practice to enable set -e in shell scripts. This will cause the script to exit immediately if a command exits with a non-zero status, preventing potential issues from cascading. Please consider uncommenting this line.
set -e # Exit on error
examples/quantization_w4a16/qwen3_30b_moe_gptq_nvfp4_ddp_example.py (113)
For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.
tokenizer.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()
examples/quantization_w4a16/llama4_gptq_nvfp4_ddp_example.py (123)
For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.
processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()
examples/quantization_w4a16/qwen3_30b_moe_gptq_int4_ddp_example.py (107)
This line is quite long. For better readability, consider wrapping the SAVE_DIR definition in parentheses, as done in other example scripts.
SAVE_DIR = (
model_id.rstrip("/").split("/")[-1]
+ "-GPTQ-W4A16-G128-DDP"
+ str(torch.distributed.get_world_size())
)
examples/quantization_w4a16/llama3_nvfp4.py (76)
The SAVE_DIR name is misleading. The quantization scheme used is NVFP4A16, but the directory name is constructed with W4A16-G128. To avoid confusion, the directory name should accurately reflect the quantization scheme.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-NVFP4A16"
examples/quantization_w4a16/qwen3_vl_235b_moe_gptq_int4_ddp_example.py (42)
It's a good practice to place all imports at the top of the file, following PEP 8 guidelines. The import torch statement should be moved to the top with other imports.
examples/quantization_w4a16/qwen3_vl_235b_moe_gptq_int4_ddp_example.py (51)
For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.
processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()
examples/quantization_w4a16/qwen3_vl_235b_moe_nvfp4_ddp_example.py (42)
It's a good practice to place all imports at the top of the file, following PEP 8 guidelines. The import torch statement should be moved to the top with other imports.
examples/quantization_w4a16/qwen3_vl_235b_moe_nvfp4_ddp_example.py (51)
For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.
processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()
examples/quantization_w4a16/qwen3_vl_8b_gptq_int4_ddp_example.py (157)
For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.
processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()
examples/quantization_w4a16/llama4_gptq_int4_ddp_example.py (123)
For consistency with other DDP examples and to ensure proper cleanup of distributed processes, it's good practice to explicitly call torch.distributed.destroy_process_group() at the end of the script.
processor.save_pretrained(SAVE_DIR)
torch.distributed.destroy_process_group()
examples/quantization_w4a16/run.sh (55)
This log message appears to be a copy-paste error from the previous block. It should indicate that the evaluation is being retried with hf since the vllm evaluation failed.
echo "Evaluation with vllm failed, retrying with hf..."
|
The quality checks have failed. Please run |
kylesayrs
left a comment
There was a problem hiding this comment.
We should definitely put these in a separate folder, right?
|
The quality checks have failed. Please run |
|
This pull request has merge conflicts that must be resolved before it can be |
dsikka
left a comment
There was a problem hiding this comment.
FYI, NVFP4 examples should go here:
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a4_fp4
NVFPA16 examples should go here: https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a16_fp4/nvfp4
dc55eb0 to
596c6b7
Compare
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
8df7650 to
a1f1335
Compare
1b9e6b3 to
c541dff
Compare
brian-dellabetta
left a comment
There was a problem hiding this comment.
LGTM, one question about also adding to the new norm calibration context as well
| calibrate_all_experts=calibrate_all_experts, | ||
| ) | ||
| # Apply the same offloading settings as the original module | ||
| _apply_offloading_to_replacement(module, replacement) |
There was a problem hiding this comment.
do we need to do this to the (just merged) norm context as well?
kylesayrs
left a comment
There was a problem hiding this comment.
FYI when we transition to vllm-project/compressed-tensors#624, this will be unnecessary as the algorithm optimally handles applying offloads directly

big models were failing with DDP for a few reasons, primarily related to overloading shared memory or having too many mmaps.
this was primarily an issue with DDP + cpu offloading but even with disk offloading, the moe context stuff would not use the same offloading as the original module and would revert to cpu offloading and cause problems.
Additionally storing all the original modules could still cause mmap issues so those are only stored if needed now.
Finally when saving i saw situations where one thread would to the saving and another thread would go past it and then timeout so i added a barrier there.
this PR depends on vllm-project/compressed-tensors#650
test plan: 596c6b7