Skip to content

Commit b72e332

Browse files
authored
bugfix for mtp>1 (#3174)
### What this PR does / why we need it? fix bugs when mtp>1, and reorder input batch when mtp is not accepted. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by ci - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 --------- Signed-off-by: zouyida2052 <[email protected]>
1 parent 69509bc commit b72e332

File tree

2 files changed

+3
-1
lines changed

2 files changed

+3
-1
lines changed

vllm_ascend/attention/mla_v1.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,8 @@ def __init__(self,
202202
npu_fused_infer_attention_score TND layout's limit of 16, \
203203
got {self.decode_threshold}"
204204

205+
self.reorder_batch_threshold = self.decode_threshold
206+
205207
if self.chunked_prefill_enabled:
206208
self.chunked_prefill_workspace_size = min(
207209
# Max sure there is enough for 8 full length request or at least

vllm_ascend/spec_decode/mtp_proposer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -555,7 +555,7 @@ def _propose(
555555
# copy inputs to buffer for cudagraph
556556
self.input_ids[:batch_size] = input_ids
557557
self.positions[:batch_size] = clamped_positions
558-
self.hidden_states[:batch_size] = hidden_states
558+
self.hidden_states[:hidden_states.shape[0]] = hidden_states
559559
attn_metadata_i.slot_mapping[:batch_size] = slot_mapping
560560

561561
if attn_metadata_i.prefill is not None:

0 commit comments

Comments
 (0)