Commit a5ca6a5
authored
[0.9.1][BUGFIX] FIX FIA input when mtp is enabled in pd Disaggregation scenario (#2509)
### What this PR does / why we need it?
This bug can be triggered by receving over 16 requests at one time from
prefill node for one decode node, since
torch_npu.npu_fused_infer_attention_score can only accept 16 sequence
length for query in one batch.
### How was this patch tested?
4P1D:
P:
```
vllm serve /mnt/nfs/levis/DeepSeek-R1_w8a8_vllm \
--host 0.0.0.0 \
--port 20002 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address 141.61.39.149 \
--data-parallel-rpc-port 13348 \
--tensor-parallel-size 8 \
--max-num-seqs 512 \
--seed 1024 \
--served-model-name ds_r1 \
--max-model-len 17000 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--enable-expert-parallel \
--enforce-eager \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"ascend_scheduler_config":{"enabled":false}, "torchair_graph_config":{"enabled":false,"enable_multistream_shared_expert":false},"chunked_prefill_for_mla":true,"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
```
D:
```
vllm serve /mnt/nfs/levis/DeepSeek-R1_w8a8_vllm \
--host 0.0.0.0 \
--port 20002 \
--data-parallel-size 64 \
--data-parallel-size-local 16 \
--data-parallel-address 141.61.39.165 \
--data-parallel-rpc-port 13348 \
--tensor-parallel-size 1 \
--seed 1024 \
--served-model-name ds_r1 \
--max-model-len 17000 \
--max-num-batched-tokens 256 \
--max-num-seqs 28 \
--quantization ascend \
--trust-remote-code \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":true,"enable_multistream_mla":true,"enable_multistream_moe":true,"graph_batch_sizes":[28], "enable_super_kernel":true, "use_cached_graph":true},"chunked_prefill_for_mla":true,"enable_weight_nz_layout":true}'
```
Signed-off-by: xuyexiong <[email protected]>1 parent 8aadcb7 commit a5ca6a5
1 file changed
+27
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
543 | 543 | | |
544 | 544 | | |
545 | 545 | | |
546 | | - | |
547 | | - | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
548 | 549 | | |
549 | 550 | | |
550 | 551 | | |
| |||
588 | 589 | | |
589 | 590 | | |
590 | 591 | | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
591 | 616 | | |
592 | 617 | | |
593 | 618 | | |
| |||
0 commit comments