fix: generalize fused layer global scales for DeepSeek MLA attention by NJX-njx · Pull Request #2437 · vllm-project/llm-compressor

NJX-njx · 2026-03-04T10:25:31Z

Summary

The update_fused_layer_weight_global_scales function previously assumed all attention modules use standard q_proj/k_proj/v_proj projections. After compressed-tensors PR vllm-project/compressed-tensors#533 updated is_attention_module to also return True for DeepSeek V2/V3 MLA attention modules, this caused an AttributeError because MLA uses different projection names.

Changes

Core fix

Introduce a configurable _ATTENTION_FUSED_GROUPS registry that lists all known attention projection groups:
- [qkv_proj] already-fused (e.g. GPT-NeoX, Falcon)
- [q_proj, k_proj, v_proj] standard separate projections
- [q_a_proj, kv_a_proj_with_mqa] DeepSeek MLA compressed
- [q_b_proj, kv_b_proj] DeepSeek MLA decompressed
The attention branch now iterates over groups and matches the first complete group, gracefully skipping unrecognized modules instead of crashing.

Refactoring

Extract _fuse_global_scales() helper to remove duplicated scale-fusion logic between attention and MLP branches
Move _valid_tensor_group_quant() to module level for reuse
Remove unused Linear import (use Module instead)

Tests

Add comprehensive test suite (test_fused_global_scales.py) covering:
- Standard Q/K/V fusion
- Already-fused QKV (no-op)
- DeepSeek MLA compressed projections
- MLA with both compressed + decompressed projections
- Unrecognized attention projections (no-op)
- MLP gate/up fusion
- Non-attention/non-MLP modules (no-op)

Fixes vllm-project#2360 The update_fused_layer_weight_global_scales function previously assumed all attention modules use standard q_proj/k_proj/v_proj projections. This caused an AttributeError for DeepSeek V2/V3 models that use Multi-head Latent Attention (MLA) with different projection names (q_a_proj, kv_a_proj_with_mqa, q_b_proj, kv_b_proj). Changes: - Introduce a configurable _ATTENTION_FUSED_GROUPS registry that lists all known attention projection groups (standard QKV, fused QKV, MLA compressed, MLA decompressed) - Refactor the attention branch to iterate over groups and match the first complete group, gracefully skipping unrecognized modules - Extract _fuse_global_scales helper to remove code duplication - Move _valid_tensor_group_quant to module level for reuse - Remove unused Linear import (use Module instead) - Add comprehensive test suite covering standard, fused, MLA, and edge case scenarios

github-actions · 2026-03-04T10:25:42Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-03-04T10:25:50Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an AttributeError that occurred during NVFP4 quantization when processing DeepSeek MLA attention modules. It enhances the update_fused_layer_weight_global_scales function by generalizing the mechanism for fusing global scales across different projection layers. This is achieved through a flexible registry of attention projection groups, ensuring that diverse attention architectures, such as DeepSeek MLA, are correctly supported, thereby improving the robustness of the quantization process.

Highlights

Generalized Attention Global Scale Fusion: Introduced a configurable _ATTENTION_FUSED_GROUPS registry to support various attention projection naming conventions, including DeepSeek MLA, preventing AttributeError when fusing global scales for NVFP4 quantization.
Refactored Global Scale Logic: Extracted common global scale fusion logic into a new helper function _fuse_global_scales and moved _valid_tensor_group_quant to the module level for improved reusability and clarity.
Comprehensive Test Coverage: Added a new, extensive test suite to validate global scale fusion across standard QKV, already-fused QKV, DeepSeek MLA, and MLP modules, ensuring correct behavior and no-op scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/modifiers/utils/helpers.py
- Added a _ATTENTION_FUSED_GROUPS list to define various attention projection groups for global scale fusion.
- Extracted global scale fusion logic into a new private helper function _fuse_global_scales.
- Updated update_fused_layer_weight_global_scales to iterate through _ATTENTION_FUSED_GROUPS for attention modules, handling different projection naming conventions.
- Moved the _valid_tensor_group_quant function to the module level for broader accessibility.
- Removed an unused Linear import and added a logging import.
tests/llmcompressor/modifiers/utils/test_fused_global_scales.py
- Added a new test file test_fused_global_scales.py.
- Implemented mock attention and MLP modules to simulate different projection configurations.
- Included tests for standard QKV, already-fused QKV, DeepSeek MLA compressed and full projections, and MLP gate/up fusion.
- Added test cases to ensure no-op behavior for unrecognized attention projections and non-attention/non-MLP modules.

Activity

No human activity (comments, reviews, or progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses the AttributeError by generalizing the fused layer global scales for DeepSeek MLA attention. The introduction of the _ATTENTION_FUSED_GROUPS registry provides a flexible and extensible way to define different attention projection groups, ensuring compatibility with various model architectures. The refactoring into _fuse_global_scales() and moving _valid_tensor_group_quant() to a module level significantly improves code readability and reduces duplication. The comprehensive test suite added in test_fused_global_scales.py thoroughly validates the new logic across standard, already-fused, and DeepSeek MLA projections, as well as MLP fusion and no-op cases. Overall, this is a well-implemented and thoroughly tested fix.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1728b91398

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-04T10:28:15Z

src/llmcompressor/modifiers/utils/helpers.py

+                return
+
+            _fuse_global_scales(layers)
+            return  # only the first matching group applies


Fuse every matching attention projection group

When an MLA attention module contains both q_a_proj/kv_a_proj_with_mqa and q_b_proj/kv_b_proj, this early return stops after fusing the first complete group, so the second pair is never synchronized even though it is explicitly listed in _ATTENTION_FUSED_GROUPS. In the calibration flows I checked (quantization/base.py, awq/base.py, gptq/base.py), this helper is invoked once per module via for module in state.model.modules(), so there is no second call on the same attention block to fuse the remaining group. That leaves part of DeepSeek MLA attention with inconsistent NVFP4 global scales.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR fixes update_fused_layer_weight_global_scales crashing on DeepSeek V2/V3 MLA attention modules by generalizing attention projection handling beyond the standard q_proj/k_proj/v_proj layout.

Changes:

Add an _ATTENTION_FUSED_GROUPS registry and iterate groups to select projection sets to fuse.
Refactor duplicated fusion logic into _fuse_global_scales() and move _valid_tensor_group_quant() to module scope.
Add unit tests covering standard QKV, fused QKV, DeepSeek MLA, no-op cases, and MLP gate/up fusion.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`src/llmcompressor/modifiers/utils/helpers.py`	Generalizes attention projection fusion via a registry and refactors shared fusion/validation helpers.
`tests/llmcompressor/modifiers/utils/test_fused_global_scales.py`	Adds tests for fused global scale behavior across attention (incl. MLA) and MLP patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T10:32:23Z

tests/llmcompressor/modifiers/utils/test_fused_global_scales.py

+class _FakeMLAAttention(nn.Module):
+    """Mock DeepSeek V2/V3 MLA-style attention with q_a_proj + kv_a_proj_with_mqa."""
+
+    def __init__(self, dim: int = 64, scales=(4.0, 8.0)):
+        super().__init__()
+        self.q_a_proj = _make_linear_with_global_scale(dim, dim, scales[0])
+        self.kv_a_proj_with_mqa = _make_linear_with_global_scale(dim, dim, scales[1])
+


The test suite doesn't cover the DeepSeek MLA naming variant where the compressed Q projection is q_proj (see repo note in src/llmcompressor/modifiers/awq/mappings.py about q_proj vs q_a_proj). Adding a test case for q_proj + kv_a_proj_with_mqa would help ensure the helper handles both forms and prevents regressions.

Copilot · 2026-03-04T10:32:24Z

src/llmcompressor/modifiers/utils/helpers.py

+_ATTENTION_FUSED_GROUPS: list[list[str]] = [
+    # Already-fused QKV (e.g. GPT-NeoX, Falcon)
+    ["qkv_proj"],
+    # Standard separate Q/K/V projections (Llama, Mistral, Qwen, etc.)
+    ["q_proj", "k_proj", "v_proj"],
+    # DeepSeek V2/V3 MLA: compressed Q + compressed KV
+    ["q_a_proj", "kv_a_proj_with_mqa"],
+    # DeepSeek V2/V3 MLA: decompressed Q + decompressed KV
+    ["q_b_proj", "kv_b_proj"],


_ATTENTION_FUSED_GROUPS doesn't account for the DeepSeek variant where the compressed Q projection is named q_proj (repo already notes some DeepSeek models use q_proj instead of q_a_proj). In that case an attention module with q_proj + kv_a_proj_with_mqa will match no group and scales won't be fused. Consider adding an additional group for ["q_proj", "kv_a_proj_with_mqa"] (ordered before the standard [q_proj,k_proj,v_proj] group, since it won't fully match anyway) or otherwise handling this alias explicitly.

Copilot · 2026-03-04T10:32:24Z

src/llmcompressor/modifiers/utils/helpers.py

+    if is_attention_module(submodule):
+        for group in _ATTENTION_FUSED_GROUPS:
+            layers = [
+                getattr(submodule, name)
+                for name in group
+                if hasattr(submodule, name)
+            ]
+            # Only fuse when ALL names in the group are present
+            if len(layers) != len(group):
+                continue
+
+            # Skip single-projection groups (already fused, e.g. qkv_proj)
+            if len(layers) <= 1:
+                return
+
+            if not _valid_tensor_group_quant(layers):
+                return
+
+            _fuse_global_scales(layers)
+            return  # only the first matching group applies
+


The attention branch returns after fusing the first matching group. For DeepSeek MLA modules that expose both compressed (q_a_proj/kv_a_proj_with_mqa) and decompressed (q_b_proj/kv_b_proj) projections on the same attention module, the second group will never be fused because update_fused_layer_weight_global_scales is only called once per module during model.modules() traversal. Consider fusing all complete, disjoint groups present on the module (or at least not returning after the first successful fusion), and similarly continue (not return) when a matched group isn't TENSOR_GROUP so later groups can still be handled.

Copilot · 2026-03-04T10:32:24Z

tests/llmcompressor/modifiers/utils/test_fused_global_scales.py

+import pytest
+import torch
+import torch.nn as nn
+from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme, QuantizationStrategy


This import line exceeds the repo's configured formatting constraints (ruff line-length 88) and will fail make quality (ruff format --check). Run ruff format or wrap this import into a parenthesized multi-line import.

Suggested change

from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme, QuantizationStrategy

from compressed_tensors.quantization import (

QuantizationArgs,

QuantizationScheme,

QuantizationStrategy,

)

Copilot · 2026-03-04T10:32:25Z

tests/llmcompressor/modifiers/utils/test_fused_global_scales.py

+    def test_deepseek_mla_full_only_first_group_matches(self, mock_is_attn):
+        """When both compressed and decompressed MLA projections exist,
+        only the first matching group (compressed) is fused per call.
+
+        In practice the module tree is traversed and the function is called
+        for each sub-module, so both groups will eventually be handled.
+        But per single call, only the first match should apply."""


The comment here is misleading: update_fused_layer_weight_global_scales is invoked once per module during model.modules() traversal, and the projection submodules (q_a_proj, q_b_proj, etc.) are Linear modules that won't hit the attention branch. So if the intent is to fuse both compressed and decompressed MLA groups, it needs to happen in a single call on the attention module (or the comment/test should be updated to reflect the actual behavior).

Suggested change

def test_deepseek_mla_full_only_first_group_matches(self, mock_is_attn):

"""When both compressed and decompressed MLA projections exist,

only the first matching group (compressed) is fused per call.

In practice the module tree is traversed and the function is called

for each sub-module, so both groups will eventually be handled.

But per single call, only the first match should apply."""

def test_deepseek_mla_full_only_compressed_group_fused(self, mock_is_attn):

"""When both compressed and decompressed MLA projections exist in a

single attention module, only the compressed group

(q_a_proj + kv_a_proj_with_mqa) is fused by this call.

The decompressed projections in the same module are not updated by

update_fused_layer_weight_global_scales in this scenario."""

Copilot AI review requested due to automatic review settings March 4, 2026 10:25

Copilot started reviewing on behalf of NJX-njx March 4, 2026 10:26 View session

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 4, 2026

View reviewed changes

Copilot AI reviewed Mar 4, 2026

View reviewed changes

dsikka closed this Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: generalize fused layer global scales for DeepSeek MLA attention#2437

fix: generalize fused layer global scales for DeepSeek MLA attention#2437
NJX-njx wants to merge 1 commit intovllm-project:mainfrom
NJX-njx:fix/deepseek-fused-layer-global-scales

NJX-njx commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NJX-njx commented Mar 4, 2026

Summary

Changes

Core fix

Refactoring

Tests

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants