Skip to content

Commit a824136

Browse files
[QuantizationModifier] NVFP4 bugfix -- fused layer update on all modules (#1869)
SUMMARY: #1772 introduced a bug when running NVFP4 quantization schemes. The call to `update_fused_layer_weight_global_scales` needs to be run on Attention and MLP layers, which are not included in `targets` consisting of quantizable layers inside Attention/MLP. This PR fixes that by running `update_fused_layer_weight_global_scales` on every module instead of the targeted ones, which is ok because the call is idempotent and will only modify if the modules have NVFP4 schemes. This is only a problem in `QuantizationModifier`, AWQ cannot be used with NVFP4. TEST PLAN: Confirmed that the working vs. broken global scales are mismatched because the update is never run: ``` model.layers.0.self_attn.k_proj.weight_global_scale -- working 9600.0, broken 12992.0 model.layers.0.self_attn.q_proj.weight_global_scale -- working 9600.0, broken 9600.0 model.layers.0.self_attn.v_proj.weight_global_scale -- working 9600.0, broken 12160.0 ``` And these changes resolve the regression: Before ``` vllm (pretrained=/home/dsikka/llm-compressor/examples/quantization_w4a4_fp4/Qwen3-30B-A3B-NVFP4,dtype=auto,max_model_len=4096,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8135|± |0.0107| | | |strict-match | 5|exact_match|↑ |0.8097|± |0.0108| ``` After ``` vllm (pretrained=/home/brian-dellabetta/projects/llm-compressor/Qwen3-30B-A3B-NVFP4,dtype=auto,max_model_len=4096,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8620|± |0.0095| | | |strict-match | 5|exact_match|↑ |0.8575|± |0.0096| ``` --------- Signed-off-by: Brian Dellabetta <[email protected]>
1 parent 832bce7 commit a824136

File tree

1 file changed

+9
-2
lines changed
  • src/llmcompressor/modifiers/quantization/quantization

1 file changed

+9
-2
lines changed

src/llmcompressor/modifiers/quantization/quantization/base.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,11 +76,18 @@ def on_start(self, state: State, event: Event, **kwargs):
7676
# TODO: this step can be combined with update_weight_zp_scale
7777
# once update_fused_layer_weight_global_scales is removed
7878
# and not required by vLLM
79-
for _, module in tqdm.tqdm(named_modules):
79+
for _, module in tqdm.tqdm(named_modules, desc="Updating global scales"):
8080
update_weight_global_scale(module)
8181

82-
for _, module in tqdm.tqdm(named_modules, desc="Calibrating weights"):
82+
# NOTE: update_fused_layer_weight_global_scales operates on Attention
83+
# and MLP layers, not quantizable Linear layers. Rather than running
84+
# on targeted modules, we need to run on all modules.
85+
# Because this call is idempotent, setting all global_scales to the
86+
# min value, it is ok to run potentially multiple times for all modules
87+
for module in tqdm.tqdm(state.model.modules(), desc="Fusing global scales"):
8388
update_fused_layer_weight_global_scales(module)
89+
90+
for _, module in tqdm.tqdm(named_modules, desc="Calibrating weights"):
8491
update_weight_zp_scale(module)
8592

8693
def on_event(self, state: State, event: Event, **kwargs):

0 commit comments

Comments
 (0)