-
Notifications
You must be signed in to change notification settings - Fork 600
[Doc] Refactor the DeepSeek-V3.1 tutorial. #4399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a comprehensive tutorial for deploying the DeepSeek-V3.1 model. While the document covers various deployment scenarios, I've found several critical errors in the provided code snippets and configurations, particularly for multi-node and prefill-decode disaggregation setups. These issues, including Python syntax errors, incorrect data parallel configurations, and inconsistent model naming, would likely prevent users from successfully following the instructions. My review provides specific corrections to address these critical problems and improve the tutorial's accuracy and usability.
| def run_command(visiable_devices. dp_rank, vllm_engine_port): | ||
| command = [ | ||
| "bash", | ||
| "./run_dp_template.sh", | ||
| visiable_devices, | ||
| str(vllm_engine_port), | ||
| str(dp_size), | ||
| str(dp_rank), | ||
| dp_address, | ||
| dp_rpc_port, | ||
| str(tp_size), | ||
| ] | ||
| subprocess.run(command, check=True) | ||
|
|
||
| if __name__ == "__main__": | ||
| template_path = "./run_dp_template.sh" | ||
| if not os.path.exists(template_path): | ||
| print(f"Template file {template_path} does not exist.") | ||
| sys.exit(1) | ||
|
|
||
| processes = [] | ||
| num_cards = dp_size_local * tp_size | ||
| for i in range(dp_size_local): | ||
| dp_rank = dp_rank_start + i | ||
| vllm_engine_port = vllm_start_port + i | ||
| visiable_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size)) | ||
| process = multiprocessing.Process(target=run_command, | ||
| args=(visiable_devices, dp_rank, | ||
| vllm_engine_port)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Python script has a syntax error and a recurring typo. The function definition on line 302 uses a period . instead of a comma ,. Additionally, the variable visiable_devices is misspelled throughout the script and should be visible_devices. These errors will prevent the script from running.
def run_command(visible_devices, dp_rank, vllm_engine_port):
command = [
"bash",
"./run_dp_template.sh",
visible_devices,
str(vllm_engine_port),
str(dp_size),
str(dp_rank),
dp_address,
dp_rpc_port,
str(tp_size),
]
subprocess.run(command, check=True)
if __name__ == "__main__":
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
print(f"Template file {template_path} does not exist.")
sys.exit(1)
processes = []
num_cards = dp_size_local * tp_size
for i in range(dp_size_local):
dp_rank = dp_rank_start + i
vllm_engine_port = vllm_start_port + i
visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
process = multiprocessing.Process(target=run_command,
args=(visible_devices, dp_rank,
vllm_engine_port))| # d0 | ||
| python launch_dp_program.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100 | ||
| # d1 | ||
| python launch_dp_program.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.4 --dp-rpc-port 12321 --vllm-start-port 7100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --dp-address for the second decode node (d1) is incorrect. In a distributed data-parallel setup, all worker nodes must point to the same master address. Here, it's set to its own IP (141.xx.xx.4), but it should point to the master node's IP, which is 141.xx.xx.3 as configured for d0.
| python launch_dp_program.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.4 --dp-rpc-port 12321 --vllm-start-port 7100 | |
| python launch_dp_program.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100 |
| --tensor-parallel-size 4 \ | ||
| --quantization ascend \ | ||
| --seed 1024 \ | ||
| --served-model-name deepseek_v3 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The served-model-name for Node 1 (deepseek_v3) is inconsistent with the name used for Node 0 (deepseek_v3.1 on line 159). All nodes in a multi-node deployment must use the exact same served-model-name to function correctly.
| --served-model-name deepseek_v3 \ | |
| --served-model-name deepseek_v3.1 |
c2ec2e3 to
e29fb14
Compare
Signed-off-by: 1092626063 <[email protected]>
e29fb14 to
be588ec
Compare
Signed-off-by: 1092626063 <[email protected]>
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?