Skip to content

fix: generalize fused layer global scales for DeepSeek MLA attention#2437

Closed
NJX-njx wants to merge 1 commit intovllm-project:mainfrom
NJX-njx:fix/deepseek-fused-layer-global-scales
Closed

fix: generalize fused layer global scales for DeepSeek MLA attention#2437
NJX-njx wants to merge 1 commit intovllm-project:mainfrom
NJX-njx:fix/deepseek-fused-layer-global-scales

Conversation

@NJX-njx
Copy link

@NJX-njx NJX-njx commented Mar 4, 2026

Summary

Fixes #2360

The update_fused_layer_weight_global_scales function previously assumed all attention modules use standard q_proj/k_proj/v_proj projections. After compressed-tensors PR vllm-project/compressed-tensors#533 updated is_attention_module to also return True for DeepSeek V2/V3 MLA attention modules, this caused an AttributeError because MLA uses different projection names.

Changes

Core fix

  • Introduce a configurable _ATTENTION_FUSED_GROUPS registry that lists all known attention projection groups:
    • [qkv_proj] already-fused (e.g. GPT-NeoX, Falcon)
    • [q_proj, k_proj, v_proj] standard separate projections
    • [q_a_proj, kv_a_proj_with_mqa] DeepSeek MLA compressed
    • [q_b_proj, kv_b_proj] DeepSeek MLA decompressed
  • The attention branch now iterates over groups and matches the first complete group, gracefully skipping unrecognized modules instead of crashing.

Refactoring

  • Extract _fuse_global_scales() helper to remove duplicated scale-fusion logic between attention and MLP branches
  • Move _valid_tensor_group_quant() to module level for reuse
  • Remove unused Linear import (use Module instead)

Tests

  • Add comprehensive test suite (test_fused_global_scales.py) covering:
    • Standard Q/K/V fusion
    • Already-fused QKV (no-op)
    • DeepSeek MLA compressed projections
    • MLA with both compressed + decompressed projections
    • Unrecognized attention projections (no-op)
    • MLP gate/up fusion
    • Non-attention/non-MLP modules (no-op)

Fixes vllm-project#2360

The update_fused_layer_weight_global_scales function previously assumed
all attention modules use standard q_proj/k_proj/v_proj projections. This
caused an AttributeError for DeepSeek V2/V3 models that use Multi-head
Latent Attention (MLA) with different projection names (q_a_proj,
kv_a_proj_with_mqa, q_b_proj, kv_b_proj).

Changes:
- Introduce a configurable _ATTENTION_FUSED_GROUPS registry that lists
  all known attention projection groups (standard QKV, fused QKV, MLA
  compressed, MLA decompressed)
- Refactor the attention branch to iterate over groups and match the
  first complete group, gracefully skipping unrecognized modules
- Extract _fuse_global_scales helper to remove code duplication
- Move _valid_tensor_group_quant to module level for reuse
- Remove unused Linear import (use Module instead)
- Add comprehensive test suite covering standard, fused, MLA, and
  edge case scenarios
Copilot AI review requested due to automatic review settings March 4, 2026 10:25
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an AttributeError that occurred during NVFP4 quantization when processing DeepSeek MLA attention modules. It enhances the update_fused_layer_weight_global_scales function by generalizing the mechanism for fusing global scales across different projection layers. This is achieved through a flexible registry of attention projection groups, ensuring that diverse attention architectures, such as DeepSeek MLA, are correctly supported, thereby improving the robustness of the quantization process.

Highlights

  • Generalized Attention Global Scale Fusion: Introduced a configurable _ATTENTION_FUSED_GROUPS registry to support various attention projection naming conventions, including DeepSeek MLA, preventing AttributeError when fusing global scales for NVFP4 quantization.
  • Refactored Global Scale Logic: Extracted common global scale fusion logic into a new helper function _fuse_global_scales and moved _valid_tensor_group_quant to the module level for improved reusability and clarity.
  • Comprehensive Test Coverage: Added a new, extensive test suite to validate global scale fusion across standard QKV, already-fused QKV, DeepSeek MLA, and MLP modules, ensuring correct behavior and no-op scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/modifiers/utils/helpers.py
    • Added a _ATTENTION_FUSED_GROUPS list to define various attention projection groups for global scale fusion.
    • Extracted global scale fusion logic into a new private helper function _fuse_global_scales.
    • Updated update_fused_layer_weight_global_scales to iterate through _ATTENTION_FUSED_GROUPS for attention modules, handling different projection naming conventions.
    • Moved the _valid_tensor_group_quant function to the module level for broader accessibility.
    • Removed an unused Linear import and added a logging import.
  • tests/llmcompressor/modifiers/utils/test_fused_global_scales.py
    • Added a new test file test_fused_global_scales.py.
    • Implemented mock attention and MLP modules to simulate different projection configurations.
    • Included tests for standard QKV, already-fused QKV, DeepSeek MLA compressed and full projections, and MLP gate/up fusion.
    • Added test cases to ensure no-op behavior for unrecognized attention projections and non-attention/non-MLP modules.
Activity
  • No human activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the AttributeError by generalizing the fused layer global scales for DeepSeek MLA attention. The introduction of the _ATTENTION_FUSED_GROUPS registry provides a flexible and extensible way to define different attention projection groups, ensuring compatibility with various model architectures. The refactoring into _fuse_global_scales() and moving _valid_tensor_group_quant() to a module level significantly improves code readability and reduces duplication. The comprehensive test suite added in test_fused_global_scales.py thoroughly validates the new logic across standard, already-fused, and DeepSeek MLA projections, as well as MLP fusion and no-op cases. Overall, this is a well-implemented and thoroughly tested fix.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1728b91398

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return

_fuse_global_scales(layers)
return # only the first matching group applies

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fuse every matching attention projection group

When an MLA attention module contains both q_a_proj/kv_a_proj_with_mqa and q_b_proj/kv_b_proj, this early return stops after fusing the first complete group, so the second pair is never synchronized even though it is explicitly listed in _ATTENTION_FUSED_GROUPS. In the calibration flows I checked (quantization/base.py, awq/base.py, gptq/base.py), this helper is invoked once per module via for module in state.model.modules(), so there is no second call on the same attention block to fuse the remaining group. That leaves part of DeepSeek MLA attention with inconsistent NVFP4 global scales.

Useful? React with 👍 / 👎.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes update_fused_layer_weight_global_scales crashing on DeepSeek V2/V3 MLA attention modules by generalizing attention projection handling beyond the standard q_proj/k_proj/v_proj layout.

Changes:

  • Add an _ATTENTION_FUSED_GROUPS registry and iterate groups to select projection sets to fuse.
  • Refactor duplicated fusion logic into _fuse_global_scales() and move _valid_tensor_group_quant() to module scope.
  • Add unit tests covering standard QKV, fused QKV, DeepSeek MLA, no-op cases, and MLP gate/up fusion.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/llmcompressor/modifiers/utils/helpers.py Generalizes attention projection fusion via a registry and refactors shared fusion/validation helpers.
tests/llmcompressor/modifiers/utils/test_fused_global_scales.py Adds tests for fused global scale behavior across attention (incl. MLA) and MLP patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +70 to +77
class _FakeMLAAttention(nn.Module):
"""Mock DeepSeek V2/V3 MLA-style attention with q_a_proj + kv_a_proj_with_mqa."""

def __init__(self, dim: int = 64, scales=(4.0, 8.0)):
super().__init__()
self.q_a_proj = _make_linear_with_global_scale(dim, dim, scales[0])
self.kv_a_proj_with_mqa = _make_linear_with_global_scale(dim, dim, scales[1])

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite doesn't cover the DeepSeek MLA naming variant where the compressed Q projection is q_proj (see repo note in src/llmcompressor/modifiers/awq/mappings.py about q_proj vs q_a_proj). Adding a test case for q_proj + kv_a_proj_with_mqa would help ensure the helper handles both forms and prevents regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +32
_ATTENTION_FUSED_GROUPS: list[list[str]] = [
# Already-fused QKV (e.g. GPT-NeoX, Falcon)
["qkv_proj"],
# Standard separate Q/K/V projections (Llama, Mistral, Qwen, etc.)
["q_proj", "k_proj", "v_proj"],
# DeepSeek V2/V3 MLA: compressed Q + compressed KV
["q_a_proj", "kv_a_proj_with_mqa"],
# DeepSeek V2/V3 MLA: decompressed Q + decompressed KV
["q_b_proj", "kv_b_proj"],
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ATTENTION_FUSED_GROUPS doesn't account for the DeepSeek variant where the compressed Q projection is named q_proj (repo already notes some DeepSeek models use q_proj instead of q_a_proj). In that case an attention module with q_proj + kv_a_proj_with_mqa will match no group and scales won't be fused. Consider adding an additional group for ["q_proj", "kv_a_proj_with_mqa"] (ordered before the standard [q_proj,k_proj,v_proj] group, since it won't fully match anyway) or otherwise handling this alias explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines +93 to +113
if is_attention_module(submodule):
for group in _ATTENTION_FUSED_GROUPS:
layers = [
getattr(submodule, name)
for name in group
if hasattr(submodule, name)
]
# Only fuse when ALL names in the group are present
if len(layers) != len(group):
continue

# Skip single-projection groups (already fused, e.g. qkv_proj)
if len(layers) <= 1:
return

if not _valid_tensor_group_quant(layers):
return

_fuse_global_scales(layers)
return # only the first matching group applies

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attention branch returns after fusing the first matching group. For DeepSeek MLA modules that expose both compressed (q_a_proj/kv_a_proj_with_mqa) and decompressed (q_b_proj/kv_b_proj) projections on the same attention module, the second group will never be fused because update_fused_layer_weight_global_scales is only called once per module during model.modules() traversal. Consider fusing all complete, disjoint groups present on the module (or at least not returning after the first successful fusion), and similarly continue (not return) when a matched group isn't TENSOR_GROUP so later groups can still be handled.

Copilot uses AI. Check for mistakes.
import pytest
import torch
import torch.nn as nn
from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme, QuantizationStrategy
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import line exceeds the repo's configured formatting constraints (ruff line-length 88) and will fail make quality (ruff format --check). Run ruff format or wrap this import into a parenthesized multi-line import.

Suggested change
from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme, QuantizationStrategy
from compressed_tensors.quantization import (
QuantizationArgs,
QuantizationScheme,
QuantizationStrategy,
)

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +169
def test_deepseek_mla_full_only_first_group_matches(self, mock_is_attn):
"""When both compressed and decompressed MLA projections exist,
only the first matching group (compressed) is fused per call.

In practice the module tree is traversed and the function is called
for each sub-module, so both groups will eventually be handled.
But per single call, only the first match should apply."""
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is misleading: update_fused_layer_weight_global_scales is invoked once per module during model.modules() traversal, and the projection submodules (q_a_proj, q_b_proj, etc.) are Linear modules that won't hit the attention branch. So if the intent is to fuse both compressed and decompressed MLA groups, it needs to happen in a single call on the attention module (or the comment/test should be updated to reflect the actual behavior).

Suggested change
def test_deepseek_mla_full_only_first_group_matches(self, mock_is_attn):
"""When both compressed and decompressed MLA projections exist,
only the first matching group (compressed) is fused per call.
In practice the module tree is traversed and the function is called
for each sub-module, so both groups will eventually be handled.
But per single call, only the first match should apply."""
def test_deepseek_mla_full_only_compressed_group_fused(self, mock_is_attn):
"""When both compressed and decompressed MLA projections exist in a
single attention module, only the compressed group
(q_a_proj + kv_a_proj_with_mqa) is fused by this call.
The decompressed projections in the same module are not updated by
update_fused_layer_weight_global_scales in this scenario."""

Copilot uses AI. Check for mistakes.
@dsikka dsikka closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: update_fused_layer_weight_global_scales fails for DeepSeekV2Attention

3 participants