Skip to content

[Usage]: DeepSeek-R1-0528模型+A2服务器+PD分离场景性能和mindie相差比较大的问题 #4295

@zhanghw0354

Description

@zhanghw0354

Your current environment

npu-smi info

Image

cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info

Image

问题

在8台A2服务器上通过vllm-ascend(0.11.0-dev分支,截止到commit id为7cc6208029743309f497603aab1d1fde045f7f3d的代码)PD分离方案(4台做prefill,4台做decode)推理deepseek-r1-0528 w8a8模型,在7K输入,2K输出,8并发40请求的情况下进行性能测试,TTFT为3.68秒,有时候甚至4、5秒,tpot为0.0422,而mindie在相同场景下ttft为1.83秒,tpot为0.031秒,希望能得到合理的部署及参数方案,实现和mindie持平的性能效果。
以下为测试截图:
vllm-ascend性能测试:

Image mindie的性能测试: Image

部署方式

总体参照Distributed DP Server With Large Scale Expert Parallelism(Deepseek)的部署方案,具体Prefill采用DP4+TP8+EP32的方式,Decode采用DP32+TP1+EP32的方式部署
下面是详细的启动脚本
Prefill-0节点,其中没有设置VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED启动DP_LB,因为查看代码,在0.11.0-dev
中已经没有这个环境变量了:

#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="eth0"
local_ip="172.20.113.109"

#export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
#export VLLM_WORKER_MULTIPROC_METHOD="fork"
#export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
export VLLM_TORCH_PROFILER_DIR=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/profile
#export VLLM_LOGGING_LEVEL=DEBUG

export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"

# MC2分层通信
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json"
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /workspace/models/deepseek-r1-0528 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--data-parallel-size-local 1 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--seed 1024 \
--served-model-name deepseek_r1 \
--enable-expert-parallel \
--max-num-seqs 2 \
--max-model-len 17000 \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--enforce-eager \
--max-num-batched-tokens 16384 \
--no-enable-prefix-caching \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
 '{"kv_connector": "LLMDataDistCMgrConnector",
 "kv_buffer_device": "npu",
 "kv_role": "kv_producer",
 "kv_parallel_size": "1",
 "kv_port": "20001",
 "engine_id": "0",
 "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
 }' \
--additional-config '{"enable_shared_expert_dp":true}' \

Prefill其他节点的脚本不再赘述
Decode-0节点:

#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="eth0"
local_ip="172.20.210.108"

export VLLM_TORCH_PROFILER_DIR=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/profile
#export VLLM_LOGGING_LEVEL=DEBUG


export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export HCCL_CONNECT_TIMEOUT=7200

export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json"


# A2机器 mc2分层通信
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /workspace/models/deepseek-r1-0528 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 32 \
--data-parallel-size-local 8 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--seed 1024 \
--served-model-name deepseek_r1 \
--enable-expert-parallel \
--max-num-seqs 4 \
--max-model-len 17000  \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--max-num-batched-tokens 256 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
 '{"kv_connector": "LLMDataDistCMgrConnector",
 "kv_buffer_device": "npu",
 "kv_role": "kv_consumer",
 "kv_parallel_size": 1,
 "kv_port": "20001",
 "engine_id": "0",
 "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
 }' \
--additional-config '{"torchair_graph_config":{"enabled":true,"enable_multistream_mla":true,"graph_batch_sizes":[4], "use_cached_graph":true},"multistream_overlap_shared_expert":true}'

decode其他节点脚本不再赘述
我们还试过其他的部署方式或者参数配置,比如prefill4台节点分两组,分别采用DP2+TP8+EP16部署方式;Prefill采用aclgraph;kv cache传输采用Mooncake等方式,性能都没有提升,有些情况设置会劣化

Profile分析

有采集Profile,对Prefill节点的一个NPU Profile文件尝试分析
在worker的execute_model之后,有20+ execute_dummy_batch

Image Image

在worker的execute_model中,在不少layer中,耗时占比明显比较长的是一段tensor to cpu的操作

Image

对应到代码为:vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py中torchair_fused_experts_with_all2all方法中的以下代码:

Image

以上的profile分析不一定正确,仅供参考,希望能得到合理的部署及参数方案,实现和mindie持平的性能效果。
如果还有什么信息需要提供,我会尽快提供,感谢~

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions