[Bugfix]Fix deepseek 3.2 C8 precision by revert quantization layers#7628
[Bugfix]Fix deepseek 3.2 C8 precision by revert quantization layers#7628Yaphets24 wants to merge 18 commits intovllm-project:mainfrom
Conversation
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the sparse C8 quantization mechanism by introducing layer-specific control. Previously, sparse C8 was likely a global setting, but these changes allow for more granular configuration, enabling specific attention layers to utilize sparse C8 based on detailed quantization descriptions. This refinement improves flexibility and potentially optimization for models using sparse C8. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces layer-wise sparse C8 quantization support for Ascend devices. It refactors the AscendConfig to manage sparse C8 layer identification, updates attention modules to use this layer-specific configuration, and modifies KV cache handling to correctly allocate and reshape tensors based on the sparse C8 status. A robustness improvement was noted in modelslim_config.py to ensure that layer IDs are valid digits before parsing.
| cache_sparse_c8_set = set(spec.cache_sparse_c8 for spec in specs) | ||
| assert len(cache_sparse_c8_set) == 1, ( | ||
| "All attention layers in the same KV cache group must use the same sparse C8 setting." |
| if self.use_sparse: | ||
| # for deepseek v3.2, we split the kv cache according to the corresponding ratio | ||
| kv_cache_spec = layer_kv_cache_spec[layer_name] | ||
| current_sparse_c8 = self._kv_cache_spec_uses_sparse_c8(kv_cache_spec) |
There was a problem hiding this comment.
Introducing current_sparse_c8 to conditionally set dsa_k_scale_tensor_split_factor is a critical correctness improvement. This ensures that the scale tensor split factor is only considered when sparse C8 is actively enabled for the current KV cache specification, preventing potential errors or incorrect memory allocation.
| if "attn" in layer_name_inner and "linear_attn" not in layer_name_inner: | ||
| if self.use_sparse: | ||
| if self.use_sparse_c8_indexer: | ||
| if current_sparse_c8: |
There was a problem hiding this comment.
Updating the condition to if current_sparse_c8: for assigning kv_cache_raw_tensors ensures that the 4-element tuple (including dsa_k_scale_tensor) is only used when sparse C8 is enabled. This prevents potential TypeError or ValueError if sparse_kv_cache_ratio[3] is None or if the tuple structure is not expected, which is a critical correctness fix.
| current_sparse_c8 = self._kv_cache_spec_uses_sparse_c8(current_kv_cache_spec) | ||
| if current_sparse_c8: |
There was a problem hiding this comment.
The introduction of current_sparse_c8 and its use in the conditional if current_sparse_c8: statement is critical. This ensures that raw_dsa_k_scale_tensor is only unpacked when sparse C8 is enabled for the current KV cache specification. Without this, an attempt to unpack a 3-element tuple as 4 elements would result in a ValueError, leading to a crash.
| cache_dtype_str=cache_dtype_str_set.pop(), | ||
| cache_sparse_c8=cache_sparse_c8_set.pop(), |
There was a problem hiding this comment.
The cache_sparse_c8 parameter in the MLAAttentionSpec constructor is now correctly derived from the cache_sparse_c8_set. This ensures that the merged KV cache specification accurately reflects the sparse C8 configuration across all layers in the group, which is vital for the correct operation of the sparse C8 feature.
| layer_indexer_quant_type = quant_description.get(f"{prefix}.indexer.quant_type") | ||
| if layer_indexer_quant_type is not None: | ||
| return layer_indexer_quant_type |
| if _id.isdigit(): | ||
| self.indexer_quant_layers.append(int(_id)) |
There was a problem hiding this comment.
| if self.use_sparse: | ||
| dsa_k_tensor_size = int(kv_cache_tensor.size // dsa_k_tensor_split_factor) | ||
| if self.use_sparse_c8_indexer: | ||
| if self.use_sparse and current_sparse_c8: |
There was a problem hiding this comment.
The condition for calculating dsa_k_scale_tensor_size has been updated to if self.use_sparse and current_sparse_c8:. This change correctly links the calculation to the current_sparse_c8 flag, ensuring that dsa_k_scale_tensor_size is only computed when sparse C8 is enabled and relevant, which is essential for accurate memory management.
| self.model_config.hf_text_config.index_head_dim, | ||
| ) | ||
| if self.use_sparse_c8_indexer: | ||
| if current_sparse_c8: |
There was a problem hiding this comment.
Changing the condition to if current_sparse_c8: ensures that dsa_k_cache and dsa_k_scale_cache are only initialized when sparse C8 is active for the current KV cache specification. This maintains consistency with the new sparse C8 logic and prevents unnecessary resource allocation or incorrect data handling.
| dtype=self.kv_cache_dtype, | ||
| cache_dtype_str=self.vllm_config.cache_config.cache_dtype, | ||
| cache_sparse_c8=self.use_sparse_c8_indexer, | ||
| cache_sparse_c8=self._is_sparse_c8_layer(layer_name), |
There was a problem hiding this comment.
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Signed-off-by: mayumeng <m30059191@china.huawei.com>
| return True | ||
| return False | ||
|
|
||
vllm_ascend/ascend_config.py
Outdated
| def _extract_layer_ids(layer_name: str) -> set[int]: | ||
| return {int(match) for match in re.findall(r"(?:^|\.)(\d+)(?:\.|$)", layer_name)} |
There was a problem hiding this comment.
plz use extract_layer_index instead
| return layer_kv_cache_spec | ||
|
|
||
| def _is_sparse_c8_layer(self, layer_name: str) -> bool: | ||
| return bool(self.use_sparse and self.ascend_config.is_sparse_c8_layer(layer_name)) |
There was a problem hiding this comment.
Just self.ascend_config.is_sparse_c8_layer(layer_name) is enough.
Signed-off-by: mayumeng <m30059191@china.huawei.com>
| self._sparse_c8_layer_ids, self._sparse_c8_layer_names = self._parse_sparse_c8_layers_from_quant_config( | ||
| quant_config | ||
| ) | ||
| self._sparse_c8_layer_filter_enabled = self._has_sparse_c8_layer_config(quant_config) |
There was a problem hiding this comment.
Plz add e2e test in tests/e2e/multicard/2-cards/test_offline_inference_distributed.py
What this PR does / why we need it?
Support deepseek 3.2 C8 revert quantization layers from int8 to float.
Does this PR introduce any user-facing change?
No.
How was this patch tested?