Skip to content

Conversation

SunnyLee151064
Copy link
Contributor

@SunnyLee151064 SunnyLee151064 commented Aug 12, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the _forward_prefill and _forward_decode methods in AscendMLAImpl to simplify the logic by removing several conditional paths based on attention states. While the intent to simplify is good, the refactoring has introduced critical issues in the function signatures of both methods. Incorrect type hints and mismatched arguments will lead to runtime TypeErrors. These need to be fixed to ensure the code runs correctly.

Comment on lines 934 to 942
self,
q_nope: torch.Tensor,
ql_nope: torch.Tensor,
q_pe: torch.Tensor,
k_nope: torch.Tensor,
k_pe: torch.Tensor,
kv_c_and_k_pe_cache: Tuple[torch.Tensor],
block_size: int,
attn_metadata: AscendMLAMetadata,
enable_multistream_mla: bool = False,
) -> torch.Tensor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The signature of _forward_decode was changed to take block_size: int instead of kv_c_and_k_pe_cache. However, the call sites in the forward method are not updated and still pass kv_cache (a tuple of tensors), which will cause a TypeError. To fix this, you should revert the signature to accept kv_c_and_k_pe_cache and derive block_size inside this function, as it was done previously (block_size = kv_c_and_k_pe_cache[0].shape[1]).

Suggested change
self,
q_nope: torch.Tensor,
ql_nope: torch.Tensor,
q_pe: torch.Tensor,
k_nope: torch.Tensor,
k_pe: torch.Tensor,
kv_c_and_k_pe_cache: Tuple[torch.Tensor],
block_size: int,
attn_metadata: AscendMLAMetadata,
enable_multistream_mla: bool = False,
) -> torch.Tensor:
self,
ql_nope: torch.Tensor,
q_pe: torch.Tensor,
k_nope: torch.Tensor,
k_pe: torch.Tensor,
kv_c_and_k_pe_cache: Tuple[torch.Tensor, ...],
attn_metadata: AscendMLAMetadata,
enable_multistream_mla: bool = False,
) -> torch.Tensor:

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant