Skip to content

Conversation

@1092626063
Copy link
Contributor

@1092626063 1092626063 commented Nov 24, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 24, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a comprehensive tutorial for deploying the DeepSeek-V3.1 model. While the document covers various deployment scenarios, I've found several critical errors in the provided code snippets and configurations, particularly for multi-node and prefill-decode disaggregation setups. These issues, including Python syntax errors, incorrect data parallel configurations, and inconsistent model naming, would likely prevent users from successfully following the instructions. My review provides specific corrections to address these critical problems and improve the tutorial's accuracy and usability.

Comment on lines 302 to 333
def run_command(visiable_devices. dp_rank, vllm_engine_port):
command = [
"bash",
"./run_dp_template.sh",
visiable_devices,
str(vllm_engine_port),
str(dp_size),
str(dp_rank),
dp_address,
dp_rpc_port,
str(tp_size),
]
subprocess.run(command, check=True)

if __name__ == "__main__":
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
print(f"Template file {template_path} does not exist.")
sys.exit(1)

processes = []
num_cards = dp_size_local * tp_size
for i in range(dp_size_local):
dp_rank = dp_rank_start + i
vllm_engine_port = vllm_start_port + i
visiable_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
process = multiprocessing.Process(target=run_command,
args=(visiable_devices, dp_rank,
vllm_engine_port))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This Python script has a syntax error and a recurring typo. The function definition on line 302 uses a period . instead of a comma ,. Additionally, the variable visiable_devices is misspelled throughout the script and should be visible_devices. These errors will prevent the script from running.

def run_command(visible_devices, dp_rank, vllm_engine_port):
    command = [
        "bash",
        "./run_dp_template.sh",
        visible_devices,
        str(vllm_engine_port),
        str(dp_size),
        str(dp_rank),
        dp_address,
        dp_rpc_port,
        str(tp_size),
    ]
    subprocess.run(command, check=True)

if __name__ == "__main__":
    template_path = "./run_dp_template.sh"
    if not os.path.exists(template_path):
        print(f"Template file {template_path} does not exist.")
        sys.exit(1)
    
    processes = []
    num_cards = dp_size_local * tp_size
    for i in range(dp_size_local):
        dp_rank = dp_rank_start + i
        vllm_engine_port = vllm_start_port + i
        visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
        process = multiprocessing.Process(target=run_command,
                                        args=(visible_devices, dp_rank,
                                                vllm_engine_port))

# d0
python launch_dp_program.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
# d1
python launch_dp_program.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.4 --dp-rpc-port 12321 --vllm-start-port 7100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The --dp-address for the second decode node (d1) is incorrect. In a distributed data-parallel setup, all worker nodes must point to the same master address. Here, it's set to its own IP (141.xx.xx.4), but it should point to the master node's IP, which is 141.xx.xx.3 as configured for d0.

Suggested change
python launch_dp_program.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.4 --dp-rpc-port 12321 --vllm-start-port 7100
python launch_dp_program.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100

--tensor-parallel-size 4 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The served-model-name for Node 1 (deepseek_v3) is inconsistent with the name used for Node 0 (deepseek_v3.1 on line 159). All nodes in a multi-node deployment must use the exact same served-model-name to function correctly.

Suggested change
--served-model-name deepseek_v3 \
--served-model-name deepseek_v3.1

@1092626063 1092626063 force-pushed the DeepSeek3.1 branch 2 times, most recently from c2ec2e3 to e29fb14 Compare November 24, 2025 12:26
Signed-off-by: 1092626063 <[email protected]>
Signed-off-by: 1092626063 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant