fix: generalize fused layer global scales for DeepSeek MLA attention#2437
fix: generalize fused layer global scales for DeepSeek MLA attention#2437NJX-njx wants to merge 1 commit intovllm-project:mainfrom
Conversation
Fixes vllm-project#2360 The update_fused_layer_weight_global_scales function previously assumed all attention modules use standard q_proj/k_proj/v_proj projections. This caused an AttributeError for DeepSeek V2/V3 models that use Multi-head Latent Attention (MLA) with different projection names (q_a_proj, kv_a_proj_with_mqa, q_b_proj, kv_b_proj). Changes: - Introduce a configurable _ATTENTION_FUSED_GROUPS registry that lists all known attention projection groups (standard QKV, fused QKV, MLA compressed, MLA decompressed) - Refactor the attention branch to iterate over groups and match the first complete group, gracefully skipping unrecognized modules - Extract _fuse_global_scales helper to remove code duplication - Move _valid_tensor_group_quant to module level for reuse - Remove unused Linear import (use Module instead) - Add comprehensive test suite covering standard, fused, MLA, and edge case scenarios
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the AttributeError by generalizing the fused layer global scales for DeepSeek MLA attention. The introduction of the _ATTENTION_FUSED_GROUPS registry provides a flexible and extensible way to define different attention projection groups, ensuring compatibility with various model architectures. The refactoring into _fuse_global_scales() and moving _valid_tensor_group_quant() to a module level significantly improves code readability and reduces duplication. The comprehensive test suite added in test_fused_global_scales.py thoroughly validates the new logic across standard, already-fused, and DeepSeek MLA projections, as well as MLP fusion and no-op cases. Overall, this is a well-implemented and thoroughly tested fix.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1728b91398
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return | ||
|
|
||
| _fuse_global_scales(layers) | ||
| return # only the first matching group applies |
There was a problem hiding this comment.
Fuse every matching attention projection group
When an MLA attention module contains both q_a_proj/kv_a_proj_with_mqa and q_b_proj/kv_b_proj, this early return stops after fusing the first complete group, so the second pair is never synchronized even though it is explicitly listed in _ATTENTION_FUSED_GROUPS. In the calibration flows I checked (quantization/base.py, awq/base.py, gptq/base.py), this helper is invoked once per module via for module in state.model.modules(), so there is no second call on the same attention block to fuse the remaining group. That leaves part of DeepSeek MLA attention with inconsistent NVFP4 global scales.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR fixes update_fused_layer_weight_global_scales crashing on DeepSeek V2/V3 MLA attention modules by generalizing attention projection handling beyond the standard q_proj/k_proj/v_proj layout.
Changes:
- Add an
_ATTENTION_FUSED_GROUPSregistry and iterate groups to select projection sets to fuse. - Refactor duplicated fusion logic into
_fuse_global_scales()and move_valid_tensor_group_quant()to module scope. - Add unit tests covering standard QKV, fused QKV, DeepSeek MLA, no-op cases, and MLP gate/up fusion.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/llmcompressor/modifiers/utils/helpers.py |
Generalizes attention projection fusion via a registry and refactors shared fusion/validation helpers. |
tests/llmcompressor/modifiers/utils/test_fused_global_scales.py |
Adds tests for fused global scale behavior across attention (incl. MLA) and MLP patterns. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class _FakeMLAAttention(nn.Module): | ||
| """Mock DeepSeek V2/V3 MLA-style attention with q_a_proj + kv_a_proj_with_mqa.""" | ||
|
|
||
| def __init__(self, dim: int = 64, scales=(4.0, 8.0)): | ||
| super().__init__() | ||
| self.q_a_proj = _make_linear_with_global_scale(dim, dim, scales[0]) | ||
| self.kv_a_proj_with_mqa = _make_linear_with_global_scale(dim, dim, scales[1]) | ||
|
|
There was a problem hiding this comment.
The test suite doesn't cover the DeepSeek MLA naming variant where the compressed Q projection is q_proj (see repo note in src/llmcompressor/modifiers/awq/mappings.py about q_proj vs q_a_proj). Adding a test case for q_proj + kv_a_proj_with_mqa would help ensure the helper handles both forms and prevents regressions.
| _ATTENTION_FUSED_GROUPS: list[list[str]] = [ | ||
| # Already-fused QKV (e.g. GPT-NeoX, Falcon) | ||
| ["qkv_proj"], | ||
| # Standard separate Q/K/V projections (Llama, Mistral, Qwen, etc.) | ||
| ["q_proj", "k_proj", "v_proj"], | ||
| # DeepSeek V2/V3 MLA: compressed Q + compressed KV | ||
| ["q_a_proj", "kv_a_proj_with_mqa"], | ||
| # DeepSeek V2/V3 MLA: decompressed Q + decompressed KV | ||
| ["q_b_proj", "kv_b_proj"], |
There was a problem hiding this comment.
_ATTENTION_FUSED_GROUPS doesn't account for the DeepSeek variant where the compressed Q projection is named q_proj (repo already notes some DeepSeek models use q_proj instead of q_a_proj). In that case an attention module with q_proj + kv_a_proj_with_mqa will match no group and scales won't be fused. Consider adding an additional group for ["q_proj", "kv_a_proj_with_mqa"] (ordered before the standard [q_proj,k_proj,v_proj] group, since it won't fully match anyway) or otherwise handling this alias explicitly.
| if is_attention_module(submodule): | ||
| for group in _ATTENTION_FUSED_GROUPS: | ||
| layers = [ | ||
| getattr(submodule, name) | ||
| for name in group | ||
| if hasattr(submodule, name) | ||
| ] | ||
| # Only fuse when ALL names in the group are present | ||
| if len(layers) != len(group): | ||
| continue | ||
|
|
||
| # Skip single-projection groups (already fused, e.g. qkv_proj) | ||
| if len(layers) <= 1: | ||
| return | ||
|
|
||
| if not _valid_tensor_group_quant(layers): | ||
| return | ||
|
|
||
| _fuse_global_scales(layers) | ||
| return # only the first matching group applies | ||
|
|
There was a problem hiding this comment.
The attention branch returns after fusing the first matching group. For DeepSeek MLA modules that expose both compressed (q_a_proj/kv_a_proj_with_mqa) and decompressed (q_b_proj/kv_b_proj) projections on the same attention module, the second group will never be fused because update_fused_layer_weight_global_scales is only called once per module during model.modules() traversal. Consider fusing all complete, disjoint groups present on the module (or at least not returning after the first successful fusion), and similarly continue (not return) when a matched group isn't TENSOR_GROUP so later groups can still be handled.
| import pytest | ||
| import torch | ||
| import torch.nn as nn | ||
| from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme, QuantizationStrategy |
There was a problem hiding this comment.
This import line exceeds the repo's configured formatting constraints (ruff line-length 88) and will fail make quality (ruff format --check). Run ruff format or wrap this import into a parenthesized multi-line import.
| from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme, QuantizationStrategy | |
| from compressed_tensors.quantization import ( | |
| QuantizationArgs, | |
| QuantizationScheme, | |
| QuantizationStrategy, | |
| ) |
| def test_deepseek_mla_full_only_first_group_matches(self, mock_is_attn): | ||
| """When both compressed and decompressed MLA projections exist, | ||
| only the first matching group (compressed) is fused per call. | ||
|
|
||
| In practice the module tree is traversed and the function is called | ||
| for each sub-module, so both groups will eventually be handled. | ||
| But per single call, only the first match should apply.""" |
There was a problem hiding this comment.
The comment here is misleading: update_fused_layer_weight_global_scales is invoked once per module during model.modules() traversal, and the projection submodules (q_a_proj, q_b_proj, etc.) are Linear modules that won't hit the attention branch. So if the intent is to fuse both compressed and decompressed MLA groups, it needs to happen in a single call on the attention module (or the comment/test should be updated to reflect the actual behavior).
| def test_deepseek_mla_full_only_first_group_matches(self, mock_is_attn): | |
| """When both compressed and decompressed MLA projections exist, | |
| only the first matching group (compressed) is fused per call. | |
| In practice the module tree is traversed and the function is called | |
| for each sub-module, so both groups will eventually be handled. | |
| But per single call, only the first match should apply.""" | |
| def test_deepseek_mla_full_only_compressed_group_fused(self, mock_is_attn): | |
| """When both compressed and decompressed MLA projections exist in a | |
| single attention module, only the compressed group | |
| (q_a_proj + kv_a_proj_with_mqa) is fused by this call. | |
| The decompressed projections in the same module are not updated by | |
| update_fused_layer_weight_global_scales in this scenario.""" |
Summary
Fixes #2360
The
update_fused_layer_weight_global_scalesfunction previously assumed all attention modules use standardq_proj/k_proj/v_projprojections. After compressed-tensors PR vllm-project/compressed-tensors#533 updatedis_attention_moduleto also returnTruefor DeepSeek V2/V3 MLA attention modules, this caused anAttributeErrorbecause MLA uses different projection names.Changes
Core fix
_ATTENTION_FUSED_GROUPSregistry that lists all known attention projection groups:[qkv_proj]already-fused (e.g. GPT-NeoX, Falcon)[q_proj, k_proj, v_proj]standard separate projections[q_a_proj, kv_a_proj_with_mqa]DeepSeek MLA compressed[q_b_proj, kv_b_proj]DeepSeek MLA decompressedRefactoring
_fuse_global_scales()helper to remove duplicated scale-fusion logic between attention and MLP branches_valid_tensor_group_quant()to module level for reuseLinearimport (useModuleinstead)Tests
test_fused_global_scales.py) covering: