-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
Fix Ray placement group allocation is not respecting env VLLM_RAY_PER_WORKER_GPUS (fractional gpu) #22577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to respect the VLLM_RAY_PER_WORKER_GPUS
environment variable when creating Ray placement groups, which is crucial for fractional GPU allocation. The changes correctly replace the hardcoded GPU request of 1.0
with the value from the environment variable. My review focuses on a potential side effect for users with multi-GPU setups. While the change is valid for your use case, it could lead to multiple workers being co-located on the same GPU in a multi-GPU scenario, which is generally unsupported and may cause failures. I've suggested adding a warning to alert users to this possibility.
@@ -338,6 +339,7 @@ def initialize_ray_cluster( | |||
else: | |||
logger.info("No current placement group found. " | |||
"Creating a new placement group.") | |||
device_resource_request = envs.VLLM_RAY_PER_WORKER_GPUS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a fractional VLLM_RAY_PER_WORKER_GPUS
when world_size > 1
can lead to multiple workers from the same tensor-parallel group being scheduled on the same GPU. This is generally not supported and can cause failures.
While you've noted this is for a specific use case, this change could unintentionally affect users with multi-GPU setups. The previous implementation requested a full GPU (1.0
) for each worker in the placement group, which prevented this co-location scenario. This PR changes that behavior.
To mitigate this risk for other users, please add a warning when world_size > 1
and a fractional GPU value is used.
device_resource_request = envs.VLLM_RAY_PER_WORKER_GPUS
if parallel_config.world_size > 1 and device_resource_request < 1.0:
logger.warning(
"VLLM_RAY_PER_WORKER_GPUS is set to %f, which is less than 1.0. "
"When using multi-GPU inference (world_size > 1), this can "
"cause multiple workers to be placed on the same GPU, which "
"is not supported and may lead to unexpected behavior or "
"failures. Please ensure that each worker is placed on a "
"separate GPU.", device_resource_request)
Signed-off-by: eric-higgins-ai <[email protected]>
Signed-off-by: eric-higgins-ai <[email protected]>
Signed-off-by: eric-higgins-ai <[email protected]>
Signed-off-by: eric-higgins-ai <[email protected]>
Purpose
This is pretty much the same change as #14521, but with the
STRICT_SPREAD
change removed. Just to add some context about our use case + address the comments raised on that PR:num_gpus>0
or create it in a placement group with GPUs, and we don't want to do the latter for a reason that's not important here. Since vLLM requires 1 worker on the same node as the engine, we need to set e.g.num_gpus=0.1
andVLLM_RAY_PER_WORKER_GPUS=0.9
, but since this placement group validation doesn't respectVLLM_RAY_PER_WORKER_GPUS
this is impossible right now.placement_group
to be passed intoLLMEngine.from_engine_args
VLLM_RAY_PER_WORKER_GPUS<=0.5
then it's already possible that multiple workers could share a GPU.Test Plan
We applied the patch internally and ran some jobs. Checked that they all started up successfully