Feature/calibrate weights dfs fused modules by GOavi101 · Pull Request #2394 · vllm-project/llm-compressor

GOavi101 · 2026-02-23T19:52:07Z

Summary

Replaces the three separate calibration loops (global scale, fused scales, zp/scale) with a single DFS traversal (calibrate_weights(model)) for better cache locality and fewer CPU→GPU onloads when using offloading.
Introduces a centralized source of truth for fused vLLM modules (fused_modules.py): traditional attention (q/k/v), fused QKV, MLA (q_a + kv_a_proj_with_mqa), and MLP (gate/up, fused gate_up).
MLA (Multi-head Latent Attention) fused global-scale support is added (previously not handled).

Changes

calibrate_weights(model, named_modules=..., update_zp_scale=..., desc=..., show_progress=...) in calibration.py: single stack-based DFS; pre-order update_weight_global_scale for targets, post-order update_fused_layer_weight_global_scales and update_weight_zp_scale for targets. API supports DDP via named_modules subset ([Performance Refactor] Extend modifiers to support weight-parallel optimization - QuantizationModifier #2220).
fused_modules.py: get_fused_attention_linears(), get_fused_mlp_linears(), plus is_fused_attention_module / is_fused_mlp_module. Defines traditional, fused QKV, MLA, and gate/up layouts.
helpers.py: update_fused_layer_weight_global_scales() refactored to use get_fused_*_linears(); MLA handled in the same path as traditional attention.
QuantizationModifier, AWQModifier, GPTQModifier: all use calibrate_weights() instead of the three loops.

- Add calibrate_weights(model) with single DFS traversal: replaces three separate loops (update_weight_global_scale, update_fused_layer_weight_global_scales, update_weight_zp_scale) for better cache locality and fewer CPU->GPU onloads. - API: named_modules or (targets, ignore), update_zp_scale, optional progress; DDP-friendly via named_modules subset (vllm-project#2220). - Use calibrate_weights in QuantizationModifier, AWQModifier, GPTQModifier. - Add fused_modules.py as central source of truth for vLLM-aligned fused layouts: attention (traditional q/k/v, fused qkv, MLA q_a + kv_a), MLP (gate/up, fused gate_up). - Add MLA (Multi-head Latent Attention) fused global-scale support: q_a_proj (or q_proj) + kv_a_proj_with_mqa. - Refactor update_fused_layer_weight_global_scales to use get_fused_*_linears(). Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

github-actions · 2026-02-23T19:52:16Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-02-23T19:52:32Z

Summary of Changes

Hello @GOavi101, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the weight calibration mechanism by consolidating disparate calibration loops into a single, efficient Depth-First Search traversal. This change is designed to boost performance by optimizing cache usage and minimizing CPU-to-GPU data movement during offloading. Concurrently, it establishes a standardized framework for identifying and managing various fused vLLM modules, ensuring uniform handling of global scales across diverse attention and MLP architectures, including the newly integrated Multi-head Latent Attention.

Highlights

Unified Calibration Process: Replaced three separate calibration loops (global scale, fused scales, zero-point/scale) with a single Depth-First Search (DFS) traversal function, calibrate_weights(model), for improved cache locality and reduced CPU→GPU data transfers, especially beneficial with offloading.
Centralized Fused Module Definitions: Introduced fused_modules.py to provide a centralized and consistent definition for various vLLM-aligned fused modules, including traditional attention (q/k/v), fused QKV, Multi-head Latent Attention (MLA), and MLP (gate/up, fused gate_up).
MLA Global-Scale Support: Added explicit support for Multi-head Latent Attention (MLA) fused global-scale, which was not previously handled in the calibration process.
Modifier Integration: Updated QuantizationModifier, AWQModifier, and GPTQModifier to utilize the new calibrate_weights() function, streamlining their calibration workflows.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/modifiers/awq/base.py
- Removed direct imports for update_weight_global_scale and update_weight_zp_scale.
- Imported the new calibrate_weights function.
- Replaced the three sequential calibration loops with a single call to calibrate_weights.
src/llmcompressor/modifiers/quantization/calibration.py
- Updated type hints to include Iterable, Set, and Tuple.
- Imported match_named_modules and update_fused_layer_weight_global_scales.
- Added calibrate_weights to the module's __all__ export list.
- Implemented the new calibrate_weights function, which performs a stack-based DFS traversal to apply global scale updates (pre-order) and fused layer global scale/zero-point updates (post-order).
src/llmcompressor/modifiers/quantization/gptq/base.py
- Removed imports for update_weight_global_scale and update_fused_layer_weight_global_scales.
- Imported calibrate_weights.
- Replaced separate global scale and fused layer updates with a call to calibrate_weights, specifically disabling update_zp_scale as GPTQ handles this in forward hooks.
src/llmcompressor/modifiers/quantization/quantization/base.py
- Removed the tqdm import.
- Removed direct imports of update_weight_global_scale, update_weight_zp_scale, and update_fused_layer_weight_global_scales.
- Imported calibrate_weights.
- Replaced the three sequential calibration loops with a single call to calibrate_weights.
src/llmcompressor/modifiers/utils/init.py
- Added imports for get_fused_attention_linears, get_fused_mlp_linears, is_fused_attention_module, and is_fused_mlp_module from the new fused_modules.
src/llmcompressor/modifiers/utils/fused_modules.py
- Added a new file defining functions to identify and retrieve linear submodules for vLLM-aligned fused attention (traditional, Fused QKV, MLA) and MLP (Gate/Up, Fused Gate-Up) groups.
- Included helper functions is_fused_attention_module and is_fused_mlp_module for easy checking.
src/llmcompressor/modifiers/utils/helpers.py
- Updated imports to use update_parameter_data instead of update_offload_parameter and removed is_attention_module.
- Imported get_fused_attention_linears and get_fused_mlp_linears from fused_modules.
- Refactored update_fused_layer_weight_global_scales to leverage the new fused_modules functions for identifying and processing fused attention and MLP layers, including MLA support.

Activity

No human activity (comments, reviews, etc.) has occurred on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a significant improvement by consolidating the weight calibration process into a single DFS traversal function, calibrate_weights. This change enhances cache locality and reduces CPU-GPU overhead, especially beneficial for offloading scenarios. The introduction of fused_modules.py centralizes the definitions for various fused attention and MLP modules, which is a good step towards better maintainability and consistency. The modifications to AWQModifier, GPTQModifier, and QuantizationModifier to utilize this new unified calibration function are well-implemented. The handling of MLA fused global-scale support is also a valuable addition. Overall, the changes are well-structured and improve the efficiency and clarity of the calibration workflow.

src/llmcompressor/modifiers/quantization/calibration.py

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

…/github.com/GOavi101/llm-compressor into feature/calibrate-weights-dfs-fused-modules

kylesayrs

I think this looks nice, I especially like the fused module utilities. However, there still a few things to consider.

Weights are still onloaded twice, once to calculate update_weight_global_scale and a second time to calculate update_weight_zp_scale. Ideally, we'd be able to onload the weight once and see significant performance benefits for NVFP4.

weight = module.weight  # onload once
weight_global_scale = calculate_gparam(weight)
weight_scale, weight_zero_point = calculate_qparams(weight)

It's still unclear how this algorithm could be used to perform parallel weight calibration. Answering this problem is definitely tricky, and we can leave it aside for now, but something to consider.

kylesayrs · 2026-02-24T02:40:22Z

src/llmcompressor/modifiers/quantization/calibration.py

+    try:
+        import tqdm
+    except ImportError:
+        tqdm = None
+
+    if show_progress and desc is not None and tqdm is not None and total_targets > 0:
+        pbar = tqdm.tqdm(total=total_targets, desc=desc)
+    else:
+        pbar = None


Assume that tqdm is available

kylesayrs · 2026-02-24T02:41:11Z

src/llmcompressor/modifiers/quantization/calibration.py

+    else:
+        named_modules = list(named_modules)
+
+    target_set = {id(m) for _, m in named_modules}


A set of modules is equivalent to a set of ids of modules (more or less). You don't need the id call.

kylesayrs · 2026-02-24T02:42:20Z

src/llmcompressor/modifiers/quantization/calibration.py

+            # Pre-order: global scale for target modules (FP4 / TENSOR_GROUP)
+            if id(module) in target_set:
+                update_weight_global_scale(module)
+            stack.append((module, True))


Please test this traversal with models which have shared modules. You may have to memoize modules to avoid traversing twice.

kylesayrs · 2026-02-24T02:43:12Z

src/llmcompressor/modifiers/quantization/calibration.py

+                "calibrate_weights requires either named_modules or both "
+                "targets and ignore"
+            )
+        from compressed_tensors.utils import match_named_modules


Please put imports at the top

kylesayrs · 2026-02-24T02:44:00Z

src/llmcompressor/modifiers/quantization/calibration.py

+        if targets is None or ignore is None:
+            raise ValueError(
+                "calibrate_weights requires either named_modules or both "
+                "targets and ignore"
+            )


Can you avoid this check by using default tuple values?

kylesayrs · 2026-02-24T02:44:42Z

src/llmcompressor/modifiers/quantization/calibration.py

+        module, children_done = stack.pop()
+
+        if not children_done:
+            # Pre-order: global scale for target modules (FP4 / TENSOR_GROUP)


Really great job with defining pre-order and post-order functions.

kylesayrs · 2026-02-24T02:46:54Z

src/llmcompressor/modifiers/quantization/calibration.py

    call_observer(module=module, base_name="weight")


+def calibrate_weights(


I think the pre-order and post-order framing is very elegant, but it may get in the way of sharing weight offloading between both calculate_gparam and calculate_qparam.

the pre/post structure was getting in the way of a single onload. I’ve added _update_weight_calibration_once(module, update_zp_scale) which onloads module.weight once and calls call_observer(..., value=value, should_calculate_gparam=..., should_calculate_qparams=...) so both use the same tensor. The sequential DFS now uses this in pre-order for target modules and no longer calls update_weight_zp_scale in post-order for them, so we get one onload per module for NVFP4.

src/llmcompressor/modeling/fused_modules.py

…, add parallel path - Move fused_modules.py to modeling/ and update imports - In update_fused_layer_weight_global_scales, rescale weight_scale s' = s*g'/g when applying fused global scale so q unchanged - Add calibrate_weights(..., parallel=True, max_workers=N) for two-phase parallel weight calibration Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

mergify · 2026-03-02T18:46:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @GOavi101.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

HDCharles · 2026-03-05T15:57:14Z

ok, taking a look

HDCharles · 2026-03-05T18:41:26Z

src/llmcompressor/modifiers/utils/helpers.py

+def _apply_fused_global_scale(lin: Linear, g_prime: torch.Tensor) -> None:
+    """Set weight_global_scale to g'; rescale weight_scale so q = x/(s*g) unchanged."""
+    old_g = lin.weight_global_scale.data
+    update_parameter_data(lin, g_prime, "weight_global_scale")


this way is deprecated

Suggested change

update_parameter_data(lin, g_prime, "weight_global_scale")

update_offload_parameter(lin, "weight_global_scale", g_prime)

you have it correct below

HDCharles · 2026-03-05T18:49:27Z

src/llmcompressor/modifiers/quantization/calibration.py

+                update_fused_layer_weight_global_scales(module)
+                if update_zp_scale and module in target_set and module not in seen_post:
+                    seen_post.add(module)
+                    if pbar is not None:


i think this progress bar update should be unindented or else it only updates when zp_scale is hit

HDCharles · 2026-03-05T18:52:16Z

src/llmcompressor/modifiers/quantization/calibration.py

+    item_weight_fn=lambda m: m.weight.numel(); see GPTQ DDP #2333 which uses
+    hessian shape for the same idea).
+
+    Benchmark: See tests/benchmark_calibrate_weights.py for onload count and


is this a real file?

HDCharles · 2026-03-05T18:53:55Z

src/llmcompressor/modifiers/utils/helpers.py

-        update_offload_parameter(submodule.up_proj, "weight_global_scale", global_scale)

-        del global_scale
+def _apply_fused_global_scale(lin: Linear, g_prime: torch.Tensor) -> None:


probably need unit test for this and other new functions that are going to be in the hotpath

HDCharles · 2026-03-05T19:00:22Z

src/llmcompressor/modifiers/quantization/calibration.py

+      (order-independent, parallelizable). Phase 2 applies fused global scales
+      and rescales per-tensor scale s' = s * (g' / g).
+
+    DDP: Works with distributed setups. Pass named_modules as this rank's


I'm not seeing any of this functionality

HDCharles

see comments

Avishek Goswami added 2 commits February 24, 2026 01:14

Merge origin/main into feature/calibrate-weights-dfs-fused-modules

c99ce38

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

src/llmcompressor/modifiers/quantization/calibration.py Outdated Show resolved Hide resolved

src/llmcompressor/modifiers/quantization/calibration.py Outdated Show resolved Hide resolved

src/llmcompressor/modifiers/quantization/calibration.py Show resolved Hide resolved

Avishek Goswami added 2 commits February 24, 2026 01:47

Merge origin/main into feature/calibrate-weights-dfs-fused-modules

97e6ca8

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

Merge branch 'feature/calibrate-weights-dfs-fused-modules' of https:/…

480294d

…/github.com/GOavi101/llm-compressor into feature/calibrate-weights-dfs-fused-modules

kylesayrs reviewed Feb 24, 2026

View reviewed changes

GOavi101 requested a review from dsikka as a code owner February 25, 2026 16:13

GOavi101 force-pushed the feature/calibrate-weights-dfs-fused-modules branch from fae990b to 6e7a4f5 Compare February 25, 2026 16:18

kylesayrs mentioned this pull request Feb 26, 2026

[Performance Refactor] Extend modifiers to support weight-parallel optimization - QuantizationModifier #2220

Open

GOavi101 force-pushed the feature/calibrate-weights-dfs-fused-modules branch from 6e7a4f5 to b3d2de9 Compare February 27, 2026 08:18

mergify bot added the needs-rebase label Mar 2, 2026

HDCharles self-requested a review March 5, 2026 15:07

HDCharles self-assigned this Mar 5, 2026

HDCharles reviewed Mar 5, 2026

View reviewed changes

HDCharles requested changes Mar 5, 2026

View reviewed changes

		call_observer(module=module, base_name="weight")


		def calibrate_weights(

	update_parameter_data(lin, g_prime, "weight_global_scale")
	update_offload_parameter(lin, "weight_global_scale", g_prime)

Conversation

GOavi101 commented Feb 23, 2026

Summary

Changes

Uh oh!

github-actions bot commented Feb 23, 2026

Uh oh!

gemini-code-assist bot commented Feb 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

HDCharles commented Mar 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HDCharles Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HDCharles Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kylesayrs Feb 24, 2026 •

edited

Loading

HDCharles Mar 5, 2026 •

edited

Loading

HDCharles Mar 5, 2026 •

edited

Loading