[Feat]Make full graph mode compalible with MTP #3276

anon189Ty · 2025-09-29T15:21:53Z

What this PR does / why we need it?

Make the Full Graph mode can run with MTP.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@releases/v0.11.0

github-actions · 2025-09-29T15:22:03Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request aims to make the Full Graph mode compatible with MTP (Multi-Token Prediction). The changes involve enabling graph support for the MLA attention backend, adding logic to handle rotary embeddings in graph mode, and updating the model runner to prepare inputs and execute the model correctly for MTP in graph mode.

I've found two critical bugs. One is in vllm_ascend/attention/mla_v1.py where local variables for rotary embeddings are not handled correctly, leading to a potential UnboundLocalError and incorrect behavior in graph mode. The other is a typo in vllm_ascend/worker/model_runner_v1.py that prevents the correct attention state from being set for MTP graph capture. Both issues need to be addressed to ensure the feature works as intended.

gemini-code-assist · 2025-09-29T15:24:43Z

vllm_ascend/attention/mla_v1.py

+            # TODO: After the fullgraph supports MTP, the if branch needs to deleted
+            assert self.cos_cache is not None
+            assert self.sin_cache is not None
+            if cos is None and sin is not None:
+                cos = self.cos_cache[
+                    input_positions].unsqueeze(  # type: ignore
+                        1).unsqueeze(2)
+                sin = self.sin_cache[
+                    input_positions].unsqueeze(  # type: ignore
+                        1).unsqueeze(2)
+
+                decode_metadata = AscendMLADecodeMetadata(
+                    input_positions=input_positions,
+                    block_table=block_table,
+                    seq_lens=seq_lens,
+                    seq_lens_list=seq_lens_list,
+                    max_seq_lens=max_seq_lens,
+                    attn_mask=common_attn_metadata.spec_attn_mask,
+                    actual_seq_lengths_q=actual_seq_lengths_q,
+                    sin=sin,
+                    cos=cos)
+            else:
+                cos[:num_decode_tokens,
+                    ...] = self.cos_cache[input_positions].unsqueeze(
+                        1).unsqueeze(2)
+                sin[:num_decode_tokens,
+                    ...] = self.sin_cache[input_positions].unsqueeze(
+                        1).unsqueeze(2)
+
+                decode_metadata = AscendMLADecodeMetadata(
+                    input_positions=input_positions,
+                    block_table=block_table,
+                    seq_lens=seq_lens,
+                    seq_lens_list=seq_lens_list,
+                    max_seq_lens=max_seq_lens,
+                    attn_mask=common_attn_metadata.spec_attn_mask,
+                    actual_seq_lengths_q=actual_seq_lengths_q,
+                    sin=sin[:num_decode_tokens, ...],
+                    cos=cos[:num_decode_tokens, ...])


This block has a critical bug. The local variables cos and sin are not guaranteed to be initialized, which will lead to an UnboundLocalError if num_prefills is 0.

Additionally, the condition if cos is None and sin is not None: seems incorrect.

It appears the intention is to use the pre-allocated cos and sin tensors from common_attn_metadata for graph mode, but they are not being used. This will cause a failure during graph capture for decode-only batches.

gemini-code-assist · 2025-09-29T15:24:44Z

vllm_ascend/worker/model_runner_v1.py

+                attn_state = AscendAttentionState.DecodeOnly
+                if self.speculative_config and \
+                        self.speculative_config.method == "deepseek_mtp":
+                    attn_states = AscendAttentionState.SpecDecoding

                for attn_group in self.attn_groups[kv_cache_group_id]:
                    if vllm_version_is("0.10.2"):
                        builder = attn_group.metadata_builder
                    else:
                        builder = attn_group.get_metadata_builder()
                    attn_metadata_i = builder.build_for_graph_capture(
-                        common_attn_metadata)
+                        common_attn_metadata,
+                        self.get_model())


There's a typo here. attn_states is assigned but never used. It should be attn_state. Because of this, attn_state is not updated to AscendAttentionState.SpecDecoding for MTP, and the default DecodeOnly is used when calling build_for_graph_capture. This will cause incorrect behavior in MTP graph mode.

attn_state = AscendAttentionState.DecodeOnly if self.speculative_config and \ self.speculative_config.method == "deepseek_mtp": attn_state = AscendAttentionState.SpecDecoding for attn_group in self.attn_groups[kv_cache_group_id]: if vllm_version_is("0.10.2"): builder = attn_group.metadata_builder else: builder = attn_group.get_metadata_builder() attn_metadata_i = builder.build_for_graph_capture( common_attn_metadata, attn_state, self.get_model())

yiz-liu

You can rebase this PR onto #3125 so you'll have a clean merge once #3125 is merged.

yiz-liu · 2025-09-30T01:35:59Z

vllm_ascend/compilation/acl_graph.py

+    ):
+        (q_nope, k_nope, q_pe, k_pe, num_heads, num_kv_heads, input_layout,
+         spec_attn_mask, sparse_mode, scale, block_table, block_size,
+         seq_lens_list, actual_seq_lengths, worlspace, attn_output,


Typo here, worlspace.

yiz-liu · 2025-09-30T01:36:46Z

vllm_ascend/compilation/acl_graph.py

+                block_size=block_size,
+                actual_seq_lengths_kv=seq_lens_list,
+                actual_seq_lengths=actual_seq_lengths,
+                workspace=workspace,


Should retrieve workspace from graph params.

Signed-off-by: anon189Ty <[email protected]>

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

yiz-liu reviewed Sep 30, 2025

View reviewed changes

anon189Ty force-pushed the mtp_full_graph_compalible branch 2 times, most recently from e3d7f83 to a0a497f Compare September 30, 2025 09:03

[Feat]Make full graph mode compalible with MTP

395de6e

Signed-off-by: anon189Ty <[email protected]>

anon189Ty force-pushed the mtp_full_graph_compalible branch from a0a497f to 395de6e Compare September 30, 2025 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat]Make full graph mode compalible with MTP #3276

[Feat]Make full graph mode compalible with MTP #3276

anon189Ty commented Sep 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Uh oh!

gemini-code-assist bot Sep 29, 2025

Uh oh!

yiz-liu left a comment

Uh oh!

yiz-liu Sep 30, 2025

Uh oh!

yiz-liu Sep 30, 2025

Uh oh!

Uh oh!

[Feat]Make full graph mode compalible with MTP #3276

Are you sure you want to change the base?

[Feat]Make full graph mode compalible with MTP #3276

Conversation

anon189Ty commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu left a comment

Choose a reason for hiding this comment

Uh oh!

yiz-liu Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anon189Ty commented Sep 29, 2025 •

edited by github-actions bot

Loading