[DP] support torchrun external launcher with Data Parallelism #24899

luccafong · 2025-09-15T18:19:04Z

Purpose

Support torchrun DP/EP with MOE models
Add CI tests for MOE models on torchrun

Test Plan/Results

Simple Example

torchrun --nproc-per-node=2 examples/offline_inference/torchrun_dp_example.py

lm_eval

Need patch lm_eval PR EleutherAI/lm-evaluation-harness#3304

torchrun --nproc-per-node=8 --no-python  lm_eval     --model vllm     --model_args "pretrained=/data/local/models/oss/DeepSeek-R1-0528,max_model_len=20000,gpu_memory_utilization=0.9,tensor_parallel_size=1,data_parallel_size=8,enable_expert_parallel=true,max_num_seqs=256,distributed_executor_backend=external_launcher"     --batch_size 256     --task gsm8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.956|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.953|±  |0.0058|

baseline of non torchrun version

lm_eval --model vllm --model_args "pretrained=/data/local/models/oss/DeepSeek-R1-0528,max_model_len=20000,gpu_memory_utilization=0.9,tensor_parallel_size=8,data_parallel_size=1,enable_expert_parallel=true,max_num_seqs=256" --batch_size 256 --task gsm8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9568|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.9545|±  |0.0057|

Added CI tests

  - TP_SIZE=4 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py
  # test with torchrun tp=2, pp=2 and dp=1
  - PP_SIZE=2 TP_SIZE=2 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py
  # test with torchrun tp=1 and dp=4 with ep
  - DP_SIZE=4 ENABLE_EP=1 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py
  # test with torchrun tp=2 and dp=2 with ep
  - TP_SIZE=2 DP_SIZE=2 ENABLE_EP=1 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

vllm/config/parallel.py

mergify · 2025-09-17T05:43:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

examples/offline_inference/torchrun_dp_example.py

vllm/compilation/decorators.py

vllm/compilation/backends.py

…rank Signed-off-by: Lu Fang <[email protected]>

Signed-off-by: Lu Fang <[email protected]>

Signed-off-by: Zhuohan Li <[email protected]>

Signed-off-by: Lu Fang <[email protected]>

zhuohan123 · 2025-09-22T17:55:36Z

Will force merge since the CI failure is not caused by this PR and is being fixed by #25396

facebook-github-bot · 2025-09-22T20:55:28Z

@kingsmad has imported this pull request. If you are a Meta employee, you can view this in D82998295.

facebook-github-bot · 2025-09-22T21:27:45Z

This pull request has been imported. If you are a Meta employee, you can view this in D82998295.

…roject#24899) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Zhuohan Li <[email protected]> Co-authored-by: Zhuohan Li <[email protected]>

…roject#24899) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Zhuohan Li <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Signed-off-by: charlifu <[email protected]>

Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Zhuohan Li <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Signed-off-by: yewentao256 <[email protected]>

…roject#24899) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Zhuohan Li <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Signed-off-by: gaojc <[email protected]>

…roject#24899) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Zhuohan Li <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…roject#24899) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Zhuohan Li <[email protected]> Co-authored-by: Zhuohan Li <[email protected]>

…roject#24899) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Zhuohan Li <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…roject#24899) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Zhuohan Li <[email protected]> Co-authored-by: Zhuohan Li <[email protected]>

mergify bot added documentation Improvements or additions to documentation v1 labels Sep 15, 2025

luccafong force-pushed the torchrun_dp branch 2 times, most recently from 3f09d97 to 521eeab Compare September 15, 2025 19:54

zhuohan123 reviewed Sep 15, 2025

View reviewed changes

vllm/config/parallel.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Sep 17, 2025

luccafong force-pushed the torchrun_dp branch from 569485c to 43fd309 Compare September 18, 2025 00:05

luccafong requested a review from zou3519 September 18, 2025 00:45

zou3519 approved these changes Sep 18, 2025

View reviewed changes

examples/offline_inference/torchrun_dp_example.py Show resolved Hide resolved

vllm/compilation/decorators.py Outdated Show resolved Hide resolved

BoyuanFeng reviewed Sep 18, 2025

View reviewed changes

vllm/compilation/backends.py Outdated Show resolved Hide resolved

BoyuanFeng approved these changes Sep 18, 2025

View reviewed changes

luccafong force-pushed the torchrun_dp branch from 43fd309 to 56f3d37 Compare September 19, 2025 01:35

luccafong mentioned this pull request Sep 19, 2025

Support torchrun vllm DP EleutherAI/lm-evaluation-harness#3304

Open

luccafong force-pushed the torchrun_dp branch from 56f3d37 to 33881ce Compare September 19, 2025 20:14

mergify bot added ci/build and removed needs-rebase labels Sep 19, 2025

luccafong marked this pull request as ready for review September 19, 2025 20:28

luccafong requested review from WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, youkaichao and ywang96 as code owners September 19, 2025 20:28

luccafong and others added 6 commits September 20, 2025 19:25

ensure all dp rank step when there is remaining requests on other dp …

9977530

…rank Signed-off-by: Lu Fang <[email protected]>

add ci tests and safe cleanup

cedff2f

Signed-off-by: Lu Fang <[email protected]>

address commmen and safer cleanup

4fe81d7

Signed-off-by: Lu Fang <[email protected]>

fix lint

bf70583

Signed-off-by: Lu Fang <[email protected]>

minor fix

39f66e5

Signed-off-by: Zhuohan Li <[email protected]>

cleanup

04a487d

Signed-off-by: Lu Fang <[email protected]>

luccafong force-pushed the torchrun_dp branch from 93aca33 to 04a487d Compare September 21, 2025 02:26

mergify bot removed the needs-rebase label Sep 21, 2025

luccafong and others added 5 commits September 20, 2025 20:51

Merge branch 'main' into torchrun_dp

f710310

deprecate v0 test

5106a65

Signed-off-by: Lu Fang <[email protected]>

Merge branch 'main' into torchrun_dp

2d88de3

Merge branch 'main' into torchrun_dp

84de3f1

Merge branch 'main' into torchrun_dp

b4ab007

zhuohan123 disabled auto-merge September 22, 2025 17:54

zhuohan123 merged commit 922979b into vllm-project:main Sep 22, 2025
76 of 78 checks passed

LucasWilkinson mentioned this pull request Oct 21, 2025

[Bug]: Hang issue for offline inference when using DP #27269

Closed

1 task

markmc mentioned this pull request Dec 3, 2025

Fix LLMEngine.del dp_group cleanup condition #29954

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[DP] support torchrun external launcher with Data Parallelism #24899

[DP] support torchrun external launcher with Data Parallelism #24899

Uh oh!

luccafong commented Sep 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

mergify bot commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Sep 22, 2025

Uh oh!

Uh oh!

facebook-github-bot commented Sep 22, 2025

Uh oh!

facebook-github-bot commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

[DP] support torchrun external launcher with Data Parallelism #24899

[DP] support torchrun external launcher with Data Parallelism #24899

Uh oh!

Conversation

luccafong commented Sep 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan/Results

Uh oh!

Uh oh!

mergify bot commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Sep 22, 2025

Uh oh!

Uh oh!

facebook-github-bot commented Sep 22, 2025

Uh oh!

facebook-github-bot commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

luccafong commented Sep 15, 2025 •

edited by github-actions bot

Loading