Skip to content

Conversation

brian-dellabetta
Copy link
Collaborator

@brian-dellabetta brian-dellabetta commented Sep 26, 2025

SUMMARY:
#1772 introduced a bug when running NVFP4 quantization schemes. The call to update_fused_layer_weight_global_scales needs to be run on Attention and MLP layers, which are not included in targets consisting of quantizable layers inside Attention/MLP. This PR fixes that by running update_fused_layer_weight_global_scales on every module instead of the targeted ones, which is ok because the call is idempotent and will only modify if the modules have NVFP4 schemes. This is only a problem in QuantizationModifier, AWQ cannot be used with NVFP4.

TEST PLAN:
Confirmed that the working vs. broken global scales are mismatched because the update is never run:

model.layers.0.self_attn.k_proj.weight_global_scale -- working 9600.0, broken 12992.0
model.layers.0.self_attn.q_proj.weight_global_scale -- working 9600.0, broken 9600.0
model.layers.0.self_attn.v_proj.weight_global_scale -- working 9600.0, broken 12160.0

And these changes resolve the regression:
Before

vllm (pretrained=/home/dsikka/llm-compressor/examples/quantization_w4a4_fp4/Qwen3-30B-A3B-NVFP4,dtype=auto,max_model_len=4096,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8135|±  |0.0107|
|     |       |strict-match    |     5|exact_match|↑  |0.8097|±  |0.0108|

After

vllm (pretrained=/home/brian-dellabetta/projects/llm-compressor/Qwen3-30B-A3B-NVFP4,dtype=auto,max_model_len=4096,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8620|±  |0.0095|
|     |       |strict-match    |     5|exact_match|↑  |0.8575|±  |0.0096|

Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Contributor

Summary of Changes

Hello @brian-dellabetta, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug within the QuantizationModifier that impacted NVFP4 quantization schemes. The issue stemmed from an incorrect application scope for updating fused layer weight global scales, which led to performance degradation in models using NVFP4. By broadening the application of this update function to all model modules, the PR ensures accurate global scale calibration, thereby restoring and improving the quantization quality and model performance, as evidenced by improved GSM8K benchmark scores.

Highlights

  • Bug Fix for NVFP4 Quantization: Addresses a regression introduced in [Multi-modifier] Support scoped application of quantization config/status #1772 where the update_fused_layer_weight_global_scales function was not correctly applied to Attention and MLP layers during NVFP4 quantization, leading to incorrect global scales and degraded model performance.
  • Expanded Scope of Global Scale Update: The update_fused_layer_weight_global_scales function is now applied to all modules in the model, rather than only targeted quantizable layers. This ensures correct application to Attention and MLP layers, leveraging the function's idempotent nature to safely update scales across the entire model.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in QuantizationModifier related to NVFP4 quantization schemes. The change ensures update_fused_layer_weight_global_scales is called on all modules, not just the targeted ones, which is necessary for Attention and MLP layers. The fix is sound, as the function is idempotent. My review includes a minor suggestion to add a progress bar for better user experience, consistent with the surrounding code.

@brian-dellabetta brian-dellabetta added the ready When a PR is ready for review label Sep 26, 2025
@brian-dellabetta brian-dellabetta changed the title [QuantizationModifier] bugfix -- fused layer update on all modules [QuantizationModifier] NVFP4 bugfix -- fused layer update on all modules Sep 26, 2025
Signed-off-by: Brian Dellabetta <[email protected]>
Copy link
Collaborator

@shanjiaz shanjiaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for fixing!

@brian-dellabetta brian-dellabetta merged commit a824136 into main Sep 26, 2025
8 checks passed
@brian-dellabetta brian-dellabetta deleted the bdellabe/quant-modifier-nvfp4-fix branch September 26, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants