Skip to content

Conversation

@SumanthRH
Copy link
Member

@SumanthRH SumanthRH commented Dec 17, 2025

What does this PR do?

Fixes process group creation with vLLM for run_engines_locally=false. Currently, weight syncing with run_engines_locally is broken, and the remote engine example https://github.com/NovaSky-AI/SkyRL/blob/main/skyrl-train/examples/remote_inference_engine/run_remote.sh fails with NCCL error : invalid argument

This is because of our custom process group creation logic - I suspect this broke after a PyTorch upgrade. It is recommended to use vLLM's StatelessProcessGroup abstraction for creating multiple process groups without polluting global state.

This PR only modifies the vLLM + NCCL backend codepath to use the new process group creation logic. Fixes for Sglang will follow in a follow-up PR

x
Signed-off-by: SumanthRH <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for creating process groups with vLLM when run_engines_locally=false. A new utility stateless_init_process_group is added to leverage vLLM's StatelessProcessGroup, avoiding conflicts with the default torch distributed process group. This is conditionally used when the weight_sync_backend is nccl and the generator backend is vllm. The changes are consistently applied across the vLLM inference engine and the DeepSpeed/FSDP workers. The overall logic is sound, but I found a critical typo in the FSDP worker that needs to be addressed.

x
Signed-off-by: SumanthRH <[email protected]>
@SumanthRH SumanthRH marked this pull request as ready for review December 17, 2025 17:13
@SumanthRH SumanthRH marked this pull request as draft December 17, 2025 17:51
x
Signed-off-by: SumanthRH <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant