[Bugfix]Fix deepseek 3.2 C8 precision by revert quantization layers by Yaphets24 · Pull Request #7628 · vllm-project/vllm-ascend

Yaphets24 · 2026-03-25T06:51:26Z

What this PR does / why we need it?

Support deepseek 3.2 C8 revert quantization layers from int8 to float.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

vLLM version: v0.18.0
vLLM main: vllm-project/vllm@ed359c4

Signed-off-by: mayumeng <m30059191@china.huawei.com>

github-actions · 2026-03-25T06:51:40Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-03-25T06:51:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the sparse C8 quantization mechanism by introducing layer-specific control. Previously, sparse C8 was likely a global setting, but these changes allow for more granular configuration, enabling specific attention layers to utilize sparse C8 based on detailed quantization descriptions. This refinement improves flexibility and potentially optimization for models using sparse C8.

Highlights

Introduced Layer-Specific Sparse C8 Configuration: The system now supports defining which specific layers should use sparse C8 quantization by parsing detailed quantization configurations from the VllmConfig.
Updated KV Cache Management for Layer-Wise Sparse C8: The KV cache allocation and reshaping logic has been updated to dynamically apply sparse C8 settings based on whether an individual attention layer is configured for it, rather than a global setting.
Refined Quantization Type Resolution: The process for determining the quantization type for attention layers has been improved to prioritize layer-specific configurations, allowing for more precise control.
Enhanced KV Cache Specification Merging: Assertions were added to ensure consistency in sparse C8 settings when merging KV cache specifications for grouped attention layers, preventing mismatches.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces layer-wise sparse C8 quantization support for Ascend devices. It refactors the AscendConfig to manage sparse C8 layer identification, updates attention modules to use this layer-specific configuration, and modifies KV cache handling to correctly allocate and reshape tensors based on the sparse C8 status. A robustness improvement was noted in modelslim_config.py to ensure that layer IDs are valid digits before parsing.

gemini-code-assist · 2026-03-25T06:55:14Z

vllm_ascend/patch/platform/patch_kv_cache_interface.py

+        cache_sparse_c8_set = set(spec.cache_sparse_c8 for spec in specs)
+        assert len(cache_sparse_c8_set) == 1, (
+            "All attention layers in the same KV cache group must use the same sparse C8 setting."


This new assertion ensures that all attention layers within the same KV cache group consistently use the same sparse C8 setting. This is crucial for maintaining data integrity and preventing unexpected behavior when merging KV cache specifications.

gemini-code-assist · 2026-03-25T06:55:14Z

vllm_ascend/worker/model_runner_v1.py

                    if self.use_sparse:
                        # for deepseek v3.2, we split the kv cache according to the corresponding ratio
                        kv_cache_spec = layer_kv_cache_spec[layer_name]
+                        current_sparse_c8 = self._kv_cache_spec_uses_sparse_c8(kv_cache_spec)


Introducing current_sparse_c8 to conditionally set dsa_k_scale_tensor_split_factor is a critical correctness improvement. This ensures that the scale tensor split factor is only considered when sparse C8 is actively enabled for the current KV cache specification, preventing potential errors or incorrect memory allocation.

gemini-code-assist · 2026-03-25T06:55:14Z

vllm_ascend/worker/model_runner_v1.py

                        if "attn" in layer_name_inner and "linear_attn" not in layer_name_inner:
                            if self.use_sparse:
-                                if self.use_sparse_c8_indexer:
+                                if current_sparse_c8:


Updating the condition to if current_sparse_c8: for assigning kv_cache_raw_tensors ensures that the 4-element tuple (including dsa_k_scale_tensor) is only used when sparse C8 is enabled. This prevents potential TypeError or ValueError if sparse_kv_cache_ratio[3] is None or if the tuple structure is not expected, which is a critical correctness fix.

gemini-code-assist · 2026-03-25T06:55:14Z

vllm_ascend/worker/model_runner_v1.py

+                        current_sparse_c8 = self._kv_cache_spec_uses_sparse_c8(current_kv_cache_spec)
+                        if current_sparse_c8:


The introduction of current_sparse_c8 and its use in the conditional if current_sparse_c8: statement is critical. This ensures that raw_dsa_k_scale_tensor is only unpacked when sparse C8 is enabled for the current KV cache specification. Without this, an attempt to unpack a 3-element tuple as 4 elements would result in a ValueError, leading to a crash.

gemini-code-assist · 2026-03-25T06:55:14Z

vllm_ascend/patch/platform/patch_kv_cache_interface.py

+            cache_dtype_str=cache_dtype_str_set.pop(),
+            cache_sparse_c8=cache_sparse_c8_set.pop(),


The cache_sparse_c8 parameter in the MLAAttentionSpec constructor is now correctly derived from the cache_sparse_c8_set. This ensures that the merged KV cache specification accurately reflects the sparse C8 configuration across all layers in the group, which is vital for the correct operation of the sparse C8 feature.

gemini-code-assist · 2026-03-25T06:55:14Z

vllm_ascend/quantization/modelslim_config.py

+        layer_indexer_quant_type = quant_description.get(f"{prefix}.indexer.quant_type")
+        if layer_indexer_quant_type is not None:
+            return layer_indexer_quant_type


Prioritizing layer_indexer_quant_type over general fa_quant_type or indexer_quant_type allows for more granular and layer-specific quantization configurations. This improves flexibility and precision in applying quantization settings.

gemini-code-assist · 2026-03-25T06:55:14Z

vllm_ascend/quantization/modelslim_config.py

+                    if _id.isdigit():
+                        self.indexer_quant_layers.append(int(_id))


Adding a check if _id.isdigit(): before converting to an integer and appending to self.indexer_quant_layers enhances the robustness of the parsing logic. This prevents potential runtime errors if _id is not a valid digit, improving the stability of the quantization configuration.

gemini-code-assist · 2026-03-25T06:55:14Z

vllm_ascend/worker/model_runner_v1.py

                    if self.use_sparse:
                        dsa_k_tensor_size = int(kv_cache_tensor.size // dsa_k_tensor_split_factor)
-                    if self.use_sparse_c8_indexer:
+                    if self.use_sparse and current_sparse_c8:


The condition for calculating dsa_k_scale_tensor_size has been updated to if self.use_sparse and current_sparse_c8:. This change correctly links the calculation to the current_sparse_c8 flag, ensuring that dsa_k_scale_tensor_size is only computed when sparse C8 is enabled and relevant, which is essential for accurate memory management.

gemini-code-assist · 2026-03-25T06:55:14Z

vllm_ascend/worker/model_runner_v1.py

                            self.model_config.hf_text_config.index_head_dim,
                        )
-                        if self.use_sparse_c8_indexer:
+                        if current_sparse_c8:


Changing the condition to if current_sparse_c8: ensures that dsa_k_cache and dsa_k_scale_cache are only initialized when sparse C8 is active for the current KV cache specification. This maintains consistency with the new sparse C8 logic and prevents unnecessary resource allocation or incorrect data handling.

gemini-code-assist · 2026-03-25T06:55:14Z

vllm_ascend/worker/model_runner_v1.py

                        dtype=self.kv_cache_dtype,
                        cache_dtype_str=self.vllm_config.cache_config.cache_dtype,
-                        cache_sparse_c8=self.use_sparse_c8_indexer,
+                        cache_sparse_c8=self._is_sparse_c8_layer(layer_name),


Updating cache_sparse_c8 to dynamically use self._is_sparse_c8_layer(layer_name) ensures that the MLAAttentionSpec accurately reflects the sparse C8 status for each specific layer. This is a crucial functional change for the correct implementation of layer-wise sparse C8 quantization.

Signed-off-by: mayumeng <m30059191@china.huawei.com>

MengqingCao · 2026-03-25T08:41:13Z

vllm_ascend/quantization/modelslim_config.py

                return True
        return False
-
+        


remove this

MengqingCao · 2026-03-25T08:42:33Z

vllm_ascend/ascend_config.py

+    def _extract_layer_ids(layer_name: str) -> set[int]:
+        return {int(match) for match in re.findall(r"(?:^|\.)(\d+)(?:\.|$)", layer_name)}


plz use extract_layer_index instead

rjg-lyh · 2026-03-25T09:20:54Z

vllm_ascend/worker/model_runner_v1.py

        return layer_kv_cache_spec

+    def _is_sparse_c8_layer(self, layer_name: str) -> bool:
+        return bool(self.use_sparse and self.ascend_config.is_sparse_c8_layer(layer_name))


Just self.ascend_config.is_sparse_c8_layer(layer_name) is enough.

Signed-off-by: mayumeng <m30059191@china.huawei.com>

rjg-lyh · 2026-03-25T09:22:54Z

vllm_ascend/ascend_config.py

+        self._sparse_c8_layer_ids, self._sparse_c8_layer_names = self._parse_sparse_c8_layers_from_quant_config(
+            quant_config
+        )
+        self._sparse_c8_layer_filter_enabled = self._has_sparse_c8_layer_config(quant_config)


Plz add e2e test in tests/e2e/multicard/2-cards/test_offline_inference_distributed.py

mayumeng added 14 commits March 25, 2026 11:08

add deepseek 3.2 C8 rot tensor

8411e39

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clena code

123c17c

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clean code

22b4215

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clena code

cdac595

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clean code

cf4faf8

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clean code

60339c8

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clean code

3cceeb3

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clean code

4a8393d

Signed-off-by: mayumeng <m30059191@china.huawei.com>

fix suggest

5c0c566

Signed-off-by: mayumeng <m30059191@china.huawei.com>

deepseek 3.2 C8 add revert function

706e904

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clean code

20c055e

Signed-off-by: mayumeng <m30059191@china.huawei.com>

fix bug

56a949d

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clean code

d7390ed

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clean code

55cc8ba

Signed-off-by: mayumeng <m30059191@china.huawei.com>

Yaphets24 requested review from MengqingCao, wangxiyuan, weijinqian0 and whx-sjtu as code owners March 25, 2026 06:51

github-actions bot added module:core module:quantization labels Mar 25, 2026

Yaphets24 changed the title ~~Revert c8~~ [Bugfix]Fix deepseek 3.2 C8 precision by revert quantization layers Mar 25, 2026

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

mayumeng added 3 commits March 25, 2026 15:18

clean code

0953171

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clean code

4c7dbd9

Signed-off-by: mayumeng <m30059191@china.huawei.com>

clean code

455bc6b

Signed-off-by: mayumeng <m30059191@china.huawei.com>

MengqingCao reviewed Mar 25, 2026

View reviewed changes

vllm_ascend/quantization/modelslim_config.py

return True

return False

Copy link

Collaborator

MengqingCao Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

MengqingCao reviewed Mar 25, 2026

View reviewed changes

rjg-lyh reviewed Mar 25, 2026

View reviewed changes

clean code

ff51fc6

Signed-off-by: mayumeng <m30059191@china.huawei.com>

rjg-lyh reviewed Mar 25, 2026

View reviewed changes

rjg-lyh approved these changes Mar 25, 2026

View reviewed changes

		current_sparse_c8 = self._kv_cache_spec_uses_sparse_c8(current_kv_cache_spec)
		if current_sparse_c8:

		cache_dtype_str=cache_dtype_str_set.pop(),
		cache_sparse_c8=cache_sparse_c8_set.pop(),

		def _extract_layer_ids(layer_name: str) -> set[int]:
		return {int(match) for match in re.findall(r"(?:^\|\.)(\d+)(?:\.\|$)", layer_name)}

Conversation

Yaphets24 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

MengqingCao Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

MengqingCao Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

rjg-lyh Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

rjg-lyh Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yaphets24 commented Mar 25, 2026 •

edited

Loading