[Usage]: DeepSeek-R1-0528模型+A2服务器+PD分离场景性能和mindie相差比较大的问题

### Your current environment

npu-smi info

<img width="1820" height="1494" alt="Image" src="https://github.com/user-attachments/assets/f1368d80-f75a-4eda-837d-e228edb23213" />

cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info

<img width="2382" height="360" alt="Image" src="https://github.com/user-attachments/assets/bdd50b2e-050d-4755-b590-9229e0834f69" />

### 问题
在8台A2服务器上通过vllm-ascend（0.11.0-dev分支，截止到commit id为7cc6208029743309f497603aab1d1fde045f7f3d的代码）PD分离方案（4台做prefill，4台做decode）推理deepseek-r1-0528 w8a8模型，在7K输入，2K输出，8并发40请求的情况下进行性能测试，TTFT为3.68秒，有时候甚至4、5秒，tpot为0.0422，而mindie在相同场景下ttft为1.83秒，tpot为0.031秒，希望能得到合理的部署及参数方案，实现和mindie持平的性能效果。
以下为测试截图：
vllm-ascend性能测试：

<img width="1352" height="912" alt="Image" src="https://github.com/user-attachments/assets/8b37ba17-8301-49b0-8c8a-7ac6e7b2a420" />
mindie的性能测试：

<img width="2144" height="772" alt="Image" src="https://github.com/user-attachments/assets/730f7864-268e-4890-95db-b5c0fbc362ce" />

### 部署方式
总体参照[Distributed DP Server With Large Scale Expert Parallelism(Deepseek)](https://docs.vllm.ai/projects/ascend/en/stable/tutorials/large_scale_ep.html)的部署方案，具体Prefill采用DP4+TP8+EP32的方式，Decode采用DP32+TP1+EP32的方式部署
下面是详细的启动脚本
Prefill-0节点，其中没有设置VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED启动DP_LB，因为查看代码，在0.11.0-dev
中已经没有这个环境变量了：
```text
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="eth0"
local_ip="172.20.113.109"

#export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
#export VLLM_WORKER_MULTIPROC_METHOD="fork"
#export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
export VLLM_TORCH_PROFILER_DIR=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/profile
#export VLLM_LOGGING_LEVEL=DEBUG

export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"

# MC2分层通信
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json"
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /workspace/models/deepseek-r1-0528 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--data-parallel-size-local 1 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--seed 1024 \
--served-model-name deepseek_r1 \
--enable-expert-parallel \
--max-num-seqs 2 \
--max-model-len 17000 \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--enforce-eager \
--max-num-batched-tokens 16384 \
--no-enable-prefix-caching \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
 '{"kv_connector": "LLMDataDistCMgrConnector",
 "kv_buffer_device": "npu",
 "kv_role": "kv_producer",
 "kv_parallel_size": "1",
 "kv_port": "20001",
 "engine_id": "0",
 "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
 }' \
--additional-config '{"enable_shared_expert_dp":true}' \
```
Prefill其他节点的脚本不再赘述
Decode-0节点：
```text
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="eth0"
local_ip="172.20.210.108"

export VLLM_TORCH_PROFILER_DIR=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/profile
#export VLLM_LOGGING_LEVEL=DEBUG


export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export HCCL_CONNECT_TIMEOUT=7200

export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json"


# A2机器 mc2分层通信
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /workspace/models/deepseek-r1-0528 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 32 \
--data-parallel-size-local 8 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--seed 1024 \
--served-model-name deepseek_r1 \
--enable-expert-parallel \
--max-num-seqs 4 \
--max-model-len 17000  \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--max-num-batched-tokens 256 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
 '{"kv_connector": "LLMDataDistCMgrConnector",
 "kv_buffer_device": "npu",
 "kv_role": "kv_consumer",
 "kv_parallel_size": 1,
 "kv_port": "20001",
 "engine_id": "0",
 "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
 }' \
--additional-config '{"torchair_graph_config":{"enabled":true,"enable_multistream_mla":true,"graph_batch_sizes":[4], "use_cached_graph":true},"multistream_overlap_shared_expert":true}'
```
decode其他节点脚本不再赘述
我们还试过其他的部署方式或者参数配置，比如prefill4台节点分两组，分别采用DP2+TP8+EP16部署方式；Prefill采用aclgraph；kv cache传输采用Mooncake等方式，性能都没有提升，有些情况设置会劣化

### Profile分析
有采集Profile，对Prefill节点的一个NPU Profile文件尝试分析
在worker的execute_model之后，有20+ execute_dummy_batch

<img width="2872" height="1852" alt="Image" src="https://github.com/user-attachments/assets/6351135c-425b-435e-b878-6468143a07a9" />

<img width="2872" height="1848" alt="Image" src="https://github.com/user-attachments/assets/62690b9f-fc25-4601-856c-d67333aa4813" />

在worker的execute_model中，在不少layer中，耗时占比明显比较长的是一段tensor to cpu的操作

<img width="2976" height="1886" alt="Image" src="https://github.com/user-attachments/assets/ef32c94f-60dd-401a-9618-8d3ffb1909c6" />

对应到代码为：vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py中torchair_fused_experts_with_all2all方法中的以下代码：

<img width="2124" height="1420" alt="Image" src="https://github.com/user-attachments/assets/cb35060a-3840-406f-b817-47adb6205678" />

以上的profile分析不一定正确，仅供参考，希望能得到合理的部署及参数方案，实现和mindie持平的性能效果。
如果还有什么信息需要提供，我会尽快提供，感谢~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Usage]: DeepSeek-R1-0528模型+A2服务器+PD分离场景性能和mindie相差比较大的问题 #4295

Your current environment

问题

部署方式

Profile分析

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Usage]: DeepSeek-R1-0528模型+A2服务器+PD分离场景性能和mindie相差比较大的问题 #4295

Description

Your current environment

问题

部署方式

Profile分析

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions