Skip to content

[Bugfix]Fix deepseek 3.2 C8 precision by revert quantization layers#7628

Open
Yaphets24 wants to merge 18 commits intovllm-project:mainfrom
Yaphets24:revert_c8
Open

[Bugfix]Fix deepseek 3.2 C8 precision by revert quantization layers#7628
Yaphets24 wants to merge 18 commits intovllm-project:mainfrom
Yaphets24:revert_c8

Conversation

@Yaphets24
Copy link
Contributor

@Yaphets24 Yaphets24 commented Mar 25, 2026

What this PR does / why we need it?

Support deepseek 3.2 C8 revert quantization layers from int8 to float.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

mayumeng added 14 commits March 25, 2026 11:08
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
@github-actions
Copy link
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the sparse C8 quantization mechanism by introducing layer-specific control. Previously, sparse C8 was likely a global setting, but these changes allow for more granular configuration, enabling specific attention layers to utilize sparse C8 based on detailed quantization descriptions. This refinement improves flexibility and potentially optimization for models using sparse C8.

Highlights

  • Introduced Layer-Specific Sparse C8 Configuration: The system now supports defining which specific layers should use sparse C8 quantization by parsing detailed quantization configurations from the VllmConfig.
  • Updated KV Cache Management for Layer-Wise Sparse C8: The KV cache allocation and reshaping logic has been updated to dynamically apply sparse C8 settings based on whether an individual attention layer is configured for it, rather than a global setting.
  • Refined Quantization Type Resolution: The process for determining the quantization type for attention layers has been improved to prioritize layer-specific configurations, allowing for more precise control.
  • Enhanced KV Cache Specification Merging: Assertions were added to ensure consistency in sparse C8 settings when merging KV cache specifications for grouped attention layers, preventing mismatches.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Yaphets24 Yaphets24 changed the title Revert c8 [Bugfix]Fix deepseek 3.2 C8 precision by revert quantization layers Mar 25, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces layer-wise sparse C8 quantization support for Ascend devices. It refactors the AscendConfig to manage sparse C8 layer identification, updates attention modules to use this layer-specific configuration, and modifies KV cache handling to correctly allocate and reshape tensors based on the sparse C8 status. A robustness improvement was noted in modelslim_config.py to ensure that layer IDs are valid digits before parsing.

Comment on lines +127 to +129
cache_sparse_c8_set = set(spec.cache_sparse_c8 for spec in specs)
assert len(cache_sparse_c8_set) == 1, (
"All attention layers in the same KV cache group must use the same sparse C8 setting."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This new assertion ensures that all attention layers within the same KV cache group consistently use the same sparse C8 setting. This is crucial for maintaining data integrity and preventing unexpected behavior when merging KV cache specifications.

if self.use_sparse:
# for deepseek v3.2, we split the kv cache according to the corresponding ratio
kv_cache_spec = layer_kv_cache_spec[layer_name]
current_sparse_c8 = self._kv_cache_spec_uses_sparse_c8(kv_cache_spec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Introducing current_sparse_c8 to conditionally set dsa_k_scale_tensor_split_factor is a critical correctness improvement. This ensures that the scale tensor split factor is only considered when sparse C8 is actively enabled for the current KV cache specification, preventing potential errors or incorrect memory allocation.

if "attn" in layer_name_inner and "linear_attn" not in layer_name_inner:
if self.use_sparse:
if self.use_sparse_c8_indexer:
if current_sparse_c8:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Updating the condition to if current_sparse_c8: for assigning kv_cache_raw_tensors ensures that the 4-element tuple (including dsa_k_scale_tensor) is only used when sparse C8 is enabled. This prevents potential TypeError or ValueError if sparse_kv_cache_ratio[3] is None or if the tuple structure is not expected, which is a critical correctness fix.

Comment on lines +2873 to +2874
current_sparse_c8 = self._kv_cache_spec_uses_sparse_c8(current_kv_cache_spec)
if current_sparse_c8:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The introduction of current_sparse_c8 and its use in the conditional if current_sparse_c8: statement is critical. This ensures that raw_dsa_k_scale_tensor is only unpacked when sparse C8 is enabled for the current KV cache specification. Without this, an attempt to unpack a 3-element tuple as 4 elements would result in a ValueError, leading to a crash.

Comment on lines +137 to +138
cache_dtype_str=cache_dtype_str_set.pop(),
cache_sparse_c8=cache_sparse_c8_set.pop(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cache_sparse_c8 parameter in the MLAAttentionSpec constructor is now correctly derived from the cache_sparse_c8_set. This ensures that the merged KV cache specification accurately reflects the sparse C8 configuration across all layers in the group, which is vital for the correct operation of the sparse C8 feature.

Comment on lines +381 to +383
layer_indexer_quant_type = quant_description.get(f"{prefix}.indexer.quant_type")
if layer_indexer_quant_type is not None:
return layer_indexer_quant_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Prioritizing layer_indexer_quant_type over general fa_quant_type or indexer_quant_type allows for more granular and layer-specific quantization configurations. This improves flexibility and precision in applying quantization settings.

Comment on lines +804 to +805
if _id.isdigit():
self.indexer_quant_layers.append(int(_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Adding a check if _id.isdigit(): before converting to an integer and appending to self.indexer_quant_layers enhances the robustness of the parsing logic. This prevents potential runtime errors if _id is not a valid digit, improving the stability of the quantization configuration.

if self.use_sparse:
dsa_k_tensor_size = int(kv_cache_tensor.size // dsa_k_tensor_split_factor)
if self.use_sparse_c8_indexer:
if self.use_sparse and current_sparse_c8:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition for calculating dsa_k_scale_tensor_size has been updated to if self.use_sparse and current_sparse_c8:. This change correctly links the calculation to the current_sparse_c8 flag, ensuring that dsa_k_scale_tensor_size is only computed when sparse C8 is enabled and relevant, which is essential for accurate memory management.

self.model_config.hf_text_config.index_head_dim,
)
if self.use_sparse_c8_indexer:
if current_sparse_c8:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Changing the condition to if current_sparse_c8: ensures that dsa_k_cache and dsa_k_scale_cache are only initialized when sparse C8 is active for the current KV cache specification. This maintains consistency with the new sparse C8 logic and prevents unnecessary resource allocation or incorrect data handling.

dtype=self.kv_cache_dtype,
cache_dtype_str=self.vllm_config.cache_config.cache_dtype,
cache_sparse_c8=self.use_sparse_c8_indexer,
cache_sparse_c8=self._is_sparse_c8_layer(layer_name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Updating cache_sparse_c8 to dynamically use self._is_sparse_c8_layer(layer_name) ensures that the MLAAttentionSpec accurately reflects the sparse C8 status for each specific layer. This is a crucial functional change for the correct implementation of layer-wise sparse C8 quantization.

mayumeng added 3 commits March 25, 2026 15:18
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
return True
return False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

Comment on lines +202 to +203
def _extract_layer_ids(layer_name: str) -> set[int]:
return {int(match) for match in re.findall(r"(?:^|\.)(\d+)(?:\.|$)", layer_name)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz use extract_layer_index instead

return layer_kv_cache_spec

def _is_sparse_c8_layer(self, layer_name: str) -> bool:
return bool(self.use_sparse and self.ascend_config.is_sparse_c8_layer(layer_name))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just self.ascend_config.is_sparse_c8_layer(layer_name) is enough.

Signed-off-by: mayumeng <m30059191@china.huawei.com>
self._sparse_c8_layer_ids, self._sparse_c8_layer_names = self._parse_sparse_c8_layers_from_quant_config(
quant_config
)
self._sparse_c8_layer_filter_enabled = self._has_sparse_c8_layer_config(quant_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz add e2e test in tests/e2e/multicard/2-cards/test_offline_inference_distributed.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants