Skip to content

Feature/calibrate weights dfs fused modules#2394

Open
GOavi101 wants to merge 5 commits intovllm-project:mainfrom
GOavi101:feature/calibrate-weights-dfs-fused-modules
Open

Feature/calibrate weights dfs fused modules#2394
GOavi101 wants to merge 5 commits intovllm-project:mainfrom
GOavi101:feature/calibrate-weights-dfs-fused-modules

Conversation

@GOavi101
Copy link
Contributor

Summary

  • Replaces the three separate calibration loops (global scale, fused scales, zp/scale) with a single DFS traversal (calibrate_weights(model)) for better cache locality and fewer CPU→GPU onloads when using offloading.
  • Introduces a centralized source of truth for fused vLLM modules (fused_modules.py): traditional attention (q/k/v), fused QKV, MLA (q_a + kv_a_proj_with_mqa), and MLP (gate/up, fused gate_up).
  • MLA (Multi-head Latent Attention) fused global-scale support is added (previously not handled).

Changes

  • calibrate_weights(model, named_modules=..., update_zp_scale=..., desc=..., show_progress=...) in calibration.py: single stack-based DFS; pre-order update_weight_global_scale for targets, post-order update_fused_layer_weight_global_scales and update_weight_zp_scale for targets. API supports DDP via named_modules subset ([Performance Refactor] Extend modifiers to support weight-parallel optimization - QuantizationModifier #2220).
  • fused_modules.py: get_fused_attention_linears(), get_fused_mlp_linears(), plus is_fused_attention_module / is_fused_mlp_module. Defines traditional, fused QKV, MLA, and gate/up layouts.
  • helpers.py: update_fused_layer_weight_global_scales() refactored to use get_fused_*_linears(); MLA handled in the same path as traditional attention.
  • QuantizationModifier, AWQModifier, GPTQModifier: all use calibrate_weights() instead of the three loops.

Avishek Goswami added 2 commits February 24, 2026 01:14
- Add calibrate_weights(model) with single DFS traversal: replaces three
  separate loops (update_weight_global_scale, update_fused_layer_weight_global_scales,
  update_weight_zp_scale) for better cache locality and fewer CPU->GPU onloads.
- API: named_modules or (targets, ignore), update_zp_scale, optional progress;
  DDP-friendly via named_modules subset (vllm-project#2220).
- Use calibrate_weights in QuantizationModifier, AWQModifier, GPTQModifier.
- Add fused_modules.py as central source of truth for vLLM-aligned fused layouts:
  attention (traditional q/k/v, fused qkv, MLA q_a + kv_a), MLP (gate/up, fused gate_up).
- Add MLA (Multi-head Latent Attention) fused global-scale support:
  q_a_proj (or q_proj) + kv_a_proj_with_mqa.
- Refactor update_fused_layer_weight_global_scales to use get_fused_*_linears().

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @GOavi101, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the weight calibration mechanism by consolidating disparate calibration loops into a single, efficient Depth-First Search traversal. This change is designed to boost performance by optimizing cache usage and minimizing CPU-to-GPU data movement during offloading. Concurrently, it establishes a standardized framework for identifying and managing various fused vLLM modules, ensuring uniform handling of global scales across diverse attention and MLP architectures, including the newly integrated Multi-head Latent Attention.

Highlights

  • Unified Calibration Process: Replaced three separate calibration loops (global scale, fused scales, zero-point/scale) with a single Depth-First Search (DFS) traversal function, calibrate_weights(model), for improved cache locality and reduced CPU→GPU data transfers, especially beneficial with offloading.
  • Centralized Fused Module Definitions: Introduced fused_modules.py to provide a centralized and consistent definition for various vLLM-aligned fused modules, including traditional attention (q/k/v), fused QKV, Multi-head Latent Attention (MLA), and MLP (gate/up, fused gate_up).
  • MLA Global-Scale Support: Added explicit support for Multi-head Latent Attention (MLA) fused global-scale, which was not previously handled in the calibration process.
  • Modifier Integration: Updated QuantizationModifier, AWQModifier, and GPTQModifier to utilize the new calibrate_weights() function, streamlining their calibration workflows.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/modifiers/awq/base.py
    • Removed direct imports for update_weight_global_scale and update_weight_zp_scale.
    • Imported the new calibrate_weights function.
    • Replaced the three sequential calibration loops with a single call to calibrate_weights.
  • src/llmcompressor/modifiers/quantization/calibration.py
    • Updated type hints to include Iterable, Set, and Tuple.
    • Imported match_named_modules and update_fused_layer_weight_global_scales.
    • Added calibrate_weights to the module's __all__ export list.
    • Implemented the new calibrate_weights function, which performs a stack-based DFS traversal to apply global scale updates (pre-order) and fused layer global scale/zero-point updates (post-order).
  • src/llmcompressor/modifiers/quantization/gptq/base.py
    • Removed imports for update_weight_global_scale and update_fused_layer_weight_global_scales.
    • Imported calibrate_weights.
    • Replaced separate global scale and fused layer updates with a call to calibrate_weights, specifically disabling update_zp_scale as GPTQ handles this in forward hooks.
  • src/llmcompressor/modifiers/quantization/quantization/base.py
    • Removed the tqdm import.
    • Removed direct imports of update_weight_global_scale, update_weight_zp_scale, and update_fused_layer_weight_global_scales.
    • Imported calibrate_weights.
    • Replaced the three sequential calibration loops with a single call to calibrate_weights.
  • src/llmcompressor/modifiers/utils/init.py
    • Added imports for get_fused_attention_linears, get_fused_mlp_linears, is_fused_attention_module, and is_fused_mlp_module from the new fused_modules.
  • src/llmcompressor/modifiers/utils/fused_modules.py
    • Added a new file defining functions to identify and retrieve linear submodules for vLLM-aligned fused attention (traditional, Fused QKV, MLA) and MLP (Gate/Up, Fused Gate-Up) groups.
    • Included helper functions is_fused_attention_module and is_fused_mlp_module for easy checking.
  • src/llmcompressor/modifiers/utils/helpers.py
    • Updated imports to use update_parameter_data instead of update_offload_parameter and removed is_attention_module.
    • Imported get_fused_attention_linears and get_fused_mlp_linears from fused_modules.
    • Refactored update_fused_layer_weight_global_scales to leverage the new fused_modules functions for identifying and processing fused attention and MLP layers, including MLA support.
Activity
  • No human activity (comments, reviews, etc.) has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a significant improvement by consolidating the weight calibration process into a single DFS traversal function, calibrate_weights. This change enhances cache locality and reduces CPU-GPU overhead, especially beneficial for offloading scenarios. The introduction of fused_modules.py centralizes the definitions for various fused attention and MLP modules, which is a good step towards better maintainability and consistency. The modifications to AWQModifier, GPTQModifier, and QuantizationModifier to utilize this new unified calibration function are well-implemented. The handling of MLA fused global-scale support is also a valuable addition. Overall, the changes are well-structured and improve the efficiency and clarity of the calibration workflow.

Avishek Goswami added 2 commits February 24, 2026 01:47
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks nice, I especially like the fused module utilities. However, there still a few things to consider.

  1. Weights are still onloaded twice, once to calculate update_weight_global_scale and a second time to calculate update_weight_zp_scale. Ideally, we'd be able to onload the weight once and see significant performance benefits for NVFP4.
weight = module.weight  # onload once
weight_global_scale = calculate_gparam(weight)
weight_scale, weight_zero_point = calculate_qparams(weight)
  1. It's still unclear how this algorithm could be used to perform parallel weight calibration. Answering this problem is definitely tricky, and we can leave it aside for now, but something to consider.

Comment on lines +213 to +221
try:
import tqdm
except ImportError:
tqdm = None

if show_progress and desc is not None and tqdm is not None and total_targets > 0:
pbar = tqdm.tqdm(total=total_targets, desc=desc)
else:
pbar = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume that tqdm is available

else:
named_modules = list(named_modules)

target_set = {id(m) for _, m in named_modules}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A set of modules is equivalent to a set of ids of modules (more or less). You don't need the id call.

# Pre-order: global scale for target modules (FP4 / TENSOR_GROUP)
if id(module) in target_set:
update_weight_global_scale(module)
stack.append((module, True))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test this traversal with models which have shared modules. You may have to memoize modules to avoid traversing twice.

"calibrate_weights requires either named_modules or both "
"targets and ignore"
)
from compressed_tensors.utils import match_named_modules
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put imports at the top

Comment on lines +199 to +203
if targets is None or ignore is None:
raise ValueError(
"calibrate_weights requires either named_modules or both "
"targets and ignore"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid this check by using default tuple values?

module, children_done = stack.pop()

if not children_done:
# Pre-order: global scale for target modules (FP4 / TENSOR_GROUP)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great job with defining pre-order and post-order functions.

call_observer(module=module, base_name="weight")


def calibrate_weights(
Copy link
Collaborator

@kylesayrs kylesayrs Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pre-order and post-order framing is very elegant, but it may get in the way of sharing weight offloading between both calculate_gparam and calculate_qparam.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pre/post structure was getting in the way of a single onload. I’ve added _update_weight_calibration_once(module, update_zp_scale) which onloads module.weight once and calls call_observer(..., value=value, should_calculate_gparam=..., should_calculate_qparams=...) so both use the same tensor. The sequential DFS now uses this in pre-order for target modules and no longer calls update_weight_zp_scale in post-order for them, so we get one onload per module for NVFP4.

…, add parallel path

- Move fused_modules.py to modeling/ and update imports
- In update_fused_layer_weight_global_scales, rescale weight_scale s' = s*g'/g
  when applying fused global scale so q unchanged
- Add calibrate_weights(..., parallel=True, max_workers=N) for two-phase
  parallel weight calibration

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
@GOavi101 GOavi101 force-pushed the feature/calibrate-weights-dfs-fused-modules branch from 6e7a4f5 to b3d2de9 Compare February 27, 2026 08:18
@mergify
Copy link
Contributor

mergify bot commented Mar 2, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @GOavi101.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 2, 2026
@HDCharles HDCharles self-requested a review March 5, 2026 15:07
@HDCharles HDCharles self-assigned this Mar 5, 2026
@HDCharles
Copy link
Collaborator

ok, taking a look

def _apply_fused_global_scale(lin: Linear, g_prime: torch.Tensor) -> None:
"""Set weight_global_scale to g'; rescale weight_scale so q = x/(s*g) unchanged."""
old_g = lin.weight_global_scale.data
update_parameter_data(lin, g_prime, "weight_global_scale")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way is deprecated

Suggested change
update_parameter_data(lin, g_prime, "weight_global_scale")
update_offload_parameter(lin, "weight_global_scale", g_prime)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have it correct below

update_fused_layer_weight_global_scales(module)
if update_zp_scale and module in target_set and module not in seen_post:
seen_post.add(module)
if pbar is not None:
Copy link
Collaborator

@HDCharles HDCharles Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this progress bar update should be unindented or else it only updates when zp_scale is hit

item_weight_fn=lambda m: m.weight.numel(); see GPTQ DDP #2333 which uses
hessian shape for the same idea).

Benchmark: See tests/benchmark_calibrate_weights.py for onload count and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a real file?

update_offload_parameter(submodule.up_proj, "weight_global_scale", global_scale)

del global_scale
def _apply_fused_global_scale(lin: Linear, g_prime: torch.Tensor) -> None:
Copy link
Collaborator

@HDCharles HDCharles Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need unit test for this and other new functions that are going to be in the hotpath

(order-independent, parallelizable). Phase 2 applies fused global scales
and rescales per-tensor scale s' = s * (g' / g).

DDP: Works with distributed setups. Pass named_modules as this rank's
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any of this functionality

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants