[QuantizationModifier] NVFP4 bugfix -- fused layer update on all modules (#1869)

brian-dellabetta · web-flow · commit a824136828e9 · 2025-09-26T13:27:49.000-04:00
SUMMARY: #1772 introduced a bug when running NVFP4 quantization schemes. The call to `update_fused_layer_weight_global_scales` needs to be run on Attention and MLP layers, which are not included in `targets` consisting of quantizable layers inside Attention/MLP. This PR fixes that by running `update_fused_layer_weight_global_scales` on every module instead of the targeted ones, which is ok because the call is idempotent and will only modify if the modules have NVFP4 schemes. This is only a problem in `QuantizationModifier`, AWQ cannot be used with NVFP4. TEST PLAN: Confirmed that the working vs. broken global scales are mismatched because the update is never run: ``` model.layers.0.self_attn.k_proj.weight_global_scale -- working 9600.0, broken 12992.0 model.layers.0.self_attn.q_proj.weight_global_scale -- working 9600.0, broken 9600.0 model.layers.0.self_attn.v_proj.weight_global_scale -- working 9600.0, broken 12160.0 ``` And these changes resolve the regression: Before ``` vllm (pretrained=/home/dsikka/llm-compressor/examples/quantization_w4a4_fp4/Qwen3-30B-A3B-NVFP4,dtype=auto,max_model_len=4096,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8135|± |0.0107| | | |strict-match | 5|exact_match|↑ |0.8097|± |0.0108| ``` After ``` vllm (pretrained=/home/brian-dellabetta/projects/llm-compressor/Qwen3-30B-A3B-NVFP4,dtype=auto,max_model_len=4096,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8620|± |0.0095| | | |strict-match | 5|exact_match|↑ |0.8575|± |0.0096| ``` --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
diff --git a/src/llmcompressor/modifiers/quantization/quantization/base.py b/src/llmcompressor/modifiers/quantization/quantization/base.py
@@ -76,11 +76,18 @@ def on_start(self, state: State, event: Event, **kwargs):
         # TODO: this step can be combined with update_weight_zp_scale
         # once update_fused_layer_weight_global_scales is removed
         # and not required by vLLM
-        for _, module in tqdm.tqdm(named_modules):
+        for _, module in tqdm.tqdm(named_modules, desc="Updating global scales"):
             update_weight_global_scale(module)
 
-        for _, module in tqdm.tqdm(named_modules, desc="Calibrating weights"):
+        # NOTE: update_fused_layer_weight_global_scales operates on Attention
+        # and MLP layers, not quantizable Linear layers. Rather than running
+        # on targeted modules, we need to run on all modules.
+        # Because this call is idempotent, setting all global_scales to the
+        # min value, it is ok to run potentially multiple times for all modules
+        for module in tqdm.tqdm(state.model.modules(), desc="Fusing global scales"):
             update_fused_layer_weight_global_scales(module)
+
+        for _, module in tqdm.tqdm(named_modules, desc="Calibrating weights"):
             update_weight_zp_scale(module)
 
     def on_event(self, state: State, event: Event, **kwargs):