[train][Fix] Fix process group creation with vLLM for `run_engines_locally=false` #789

SumanthRH · 2025-12-17T16:49:27Z

What does this PR do?

Fixes process group creation with vLLM for run_engines_locally=false. Currently, weight syncing with run_engines_locally is broken, and the remote engine example https://github.com/NovaSky-AI/SkyRL/blob/main/skyrl-train/examples/remote_inference_engine/run_remote.sh fails with NCCL error : invalid argument

This is because of our custom process group creation logic - I suspect this broke after a PyTorch upgrade. It is recommended to use vLLM's StatelessProcessGroup abstraction for creating multiple process groups without polluting global state.

This PR only modifies the vLLM + NCCL backend codepath to use the new process group creation logic. Fixes for Sglang will follow in a follow-up PR

Signed-off-by: SumanthRH <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a fix for creating process groups with vLLM when run_engines_locally=false. A new utility stateless_init_process_group is added to leverage vLLM's StatelessProcessGroup, avoiding conflicts with the default torch distributed process group. This is conditionally used when the weight_sync_backend is nccl and the generator backend is vllm. The changes are consistently applied across the vLLM inference engine and the DeepSpeed/FSDP workers. The overall logic is sound, but I found a critical typo in the FSDP worker that needs to be addressed.

skyrl-train/skyrl_train/workers/fsdp/fsdp_worker.py

Signed-off-by: SumanthRH <[email protected]>

SumanthRH added 2 commits December 17, 2025 15:13

use vllm's stateless init process group

e07d5a7

Signed-off-by: SumanthRH <[email protected]>

x

ce66888

Signed-off-by: SumanthRH <[email protected]>

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

skyrl-train/skyrl_train/workers/fsdp/fsdp_worker.py Outdated Show resolved Hide resolved

x

0041f38

Signed-off-by: SumanthRH <[email protected]>

SumanthRH marked this pull request as ready for review December 17, 2025 17:13

SumanthRH marked this pull request as draft December 17, 2025 17:51

x

fa7927e

Signed-off-by: SumanthRH <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[train][Fix] Fix process group creation with vLLM for `run_engines_locally=false` #789

[train][Fix] Fix process group creation with vLLM for `run_engines_locally=false` #789

Uh oh!

SumanthRH commented Dec 17, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[train][Fix] Fix process group creation with vLLM for run_engines_locally=false #789

Are you sure you want to change the base?

[train][Fix] Fix process group creation with vLLM for run_engines_locally=false #789

Uh oh!

Conversation

SumanthRH commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[train][Fix] Fix process group creation with vLLM for `run_engines_locally=false` #789

[train][Fix] Fix process group creation with vLLM for `run_engines_locally=false` #789

SumanthRH commented Dec 17, 2025 •

edited

Loading