-
Notifications
You must be signed in to change notification settings - Fork 583
Description
Your current environment
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
问题
在8台A2服务器上通过vllm-ascend(0.11.0-dev分支,截止到commit id为7cc6208029743309f497603aab1d1fde045f7f3d的代码)PD分离方案(4台做prefill,4台做decode)推理deepseek-r1-0528 w8a8模型,在7K输入,2K输出,8并发40请求的情况下进行性能测试,TTFT为3.68秒,有时候甚至4、5秒,tpot为0.0422,而mindie在相同场景下ttft为1.83秒,tpot为0.031秒,希望能得到合理的部署及参数方案,实现和mindie持平的性能效果。
以下为测试截图:
vllm-ascend性能测试:
mindie的性能测试:
部署方式
总体参照Distributed DP Server With Large Scale Expert Parallelism(Deepseek)的部署方案,具体Prefill采用DP4+TP8+EP32的方式,Decode采用DP32+TP1+EP32的方式部署
下面是详细的启动脚本
Prefill-0节点,其中没有设置VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED启动DP_LB,因为查看代码,在0.11.0-dev
中已经没有这个环境变量了:
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="eth0"
local_ip="172.20.113.109"
#export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
#export VLLM_WORKER_MULTIPROC_METHOD="fork"
#export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
export VLLM_TORCH_PROFILER_DIR=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/profile
#export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
# MC2分层通信
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json"
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /workspace/models/deepseek-r1-0528 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--data-parallel-size-local 1 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--seed 1024 \
--served-model-name deepseek_r1 \
--enable-expert-parallel \
--max-num-seqs 2 \
--max-model-len 17000 \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--enforce-eager \
--max-num-batched-tokens 16384 \
--no-enable-prefix-caching \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config '{"enable_shared_expert_dp":true}' \
Prefill其他节点的脚本不再赘述
Decode-0节点:
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="eth0"
local_ip="172.20.210.108"
export VLLM_TORCH_PROFILER_DIR=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/profile
#export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json"
# A2机器 mc2分层通信
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /workspace/models/deepseek-r1-0528 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 32 \
--data-parallel-size-local 8 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--seed 1024 \
--served-model-name deepseek_r1 \
--enable-expert-parallel \
--max-num-seqs 4 \
--max-model-len 17000 \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--max-num-batched-tokens 256 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config '{"torchair_graph_config":{"enabled":true,"enable_multistream_mla":true,"graph_batch_sizes":[4], "use_cached_graph":true},"multistream_overlap_shared_expert":true}'
decode其他节点脚本不再赘述
我们还试过其他的部署方式或者参数配置,比如prefill4台节点分两组,分别采用DP2+TP8+EP16部署方式;Prefill采用aclgraph;kv cache传输采用Mooncake等方式,性能都没有提升,有些情况设置会劣化
Profile分析
有采集Profile,对Prefill节点的一个NPU Profile文件尝试分析
在worker的execute_model之后,有20+ execute_dummy_batch
在worker的execute_model中,在不少layer中,耗时占比明显比较长的是一段tensor to cpu的操作
对应到代码为:vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py中torchair_fused_experts_with_all2all方法中的以下代码:
以上的profile分析不一定正确,仅供参考,希望能得到合理的部署及参数方案,实现和mindie持平的性能效果。
如果还有什么信息需要提供,我会尽快提供,感谢~