Skip to content

[RFC]: Support MTP running in full_decode_only graph mode. #4300

@zouyida2052

Description

@zouyida2052

Motivation.

Currently, vllm-ascend supports MTP>1 functionality in piecewise scenarios. However, the capability for MTP to operate in full_decode_only graph mode remains unsupported, resulting in significant pipeline bubbles between NPUs and causing considerable computational waste.

Enabling MTP in full_decode_only mode would allow each decoding round to be executed through whole-graph submission in a single pass. This reduces frequent operator dispatching from the CPU side, significantly minimizes bubbles between operators, enhances NPU utilization, maximizes hardware efficiency, and meets practical deployment requirements.

Proposed Change.

Add _mtp_graph_params in acl_graph.py to isolate the data of main model and the data of MTP.
Padding some metadata in mla_v1.py when in fullgraph mode.
Fixed the essential data address that will be used in model.forward.
Adapted according to the aclgraph capture framwork:
1). Rebuild MTP model with ACLGraphWrapper.
2). Add common attn metadata when start capture in MTP dummy_run.
3). Add common attn metadata update in MTP.
4). Addapted data update when num_speculative_tokens > 1.
Add a patch of MTP to adapt vllm v0.11.0.

Feedback Period.

No response

CC List.

No response

Any Other Things.

Aclgraph full_decode_only support
#3892

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest For Comments

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions