Skip to content

Fix Ray placement group allocation is not respecting env VLLM_RAY_PER_WORKER_GPUS (fractional gpu) #22577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

eric-higgins-ai
Copy link

@eric-higgins-ai eric-higgins-ai commented Aug 10, 2025

Purpose

This is pretty much the same change as #14521, but with the STRICT_SPREAD change removed. Just to add some context about our use case + address the comments raised on that PR:

  • We are running vLLM as a step in a Ray Data pipeline, where our Ray cluster has 1 CPU-only head node, some GPU worker nodes with 1 GPU each, and some CPU-only worker nodes. vLLM requires the engine to be run on a GPU node, and in order to do that we need to either run the engine with num_gpus>0 or create it in a placement group with GPUs, and we don't want to do the latter for a reason that's not important here. Since vLLM requires 1 worker on the same node as the engine, we need to set e.g. num_gpus=0.1 and VLLM_RAY_PER_WORKER_GPUS=0.9, but since this placement group validation doesn't respect VLLM_RAY_PER_WORKER_GPUS this is impossible right now.
    • Let me know if there's a better solution than this though. My main constraint is that I want to avoid scheduling the engine actor in a placement group, and this seems like it should be supported (considering that the code to create a new placement group exists), but I'm open to other solutions if I missed anything. Another possible solution is allowing placement_group to be passed into LLMEngine.from_engine_args
  • Addressing this comment on the previous PR: "With this change, workers from the same TP or PP group of a vLLM instance may colocation with each other, which may cause failures" - this seems to me like it's already an issue and not a new one introduced by this PR. vLLM creates placement group bundles that request 1 GPU and Ray allows scheduling multiple actors to 1 bundle, so if VLLM_RAY_PER_WORKER_GPUS<=0.5 then it's already possible that multiple workers could share a GPU.
    • I'm not sure if Ray will spread tasks among bundles if possible - i.e. if there are 5 bundles with 1 GPU and 5 workers each requesting 0.5 GPU, Ray might always prefer to use all 5 bundles. Its exact behavior in this case isn't documented though, so it's better to not rely on it.

Test Plan

We applied the patch internally and ran some jobs. Checked that they all started up successfully

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to respect the VLLM_RAY_PER_WORKER_GPUS environment variable when creating Ray placement groups, which is crucial for fractional GPU allocation. The changes correctly replace the hardcoded GPU request of 1.0 with the value from the environment variable. My review focuses on a potential side effect for users with multi-GPU setups. While the change is valid for your use case, it could lead to multiple workers being co-located on the same GPU in a multi-GPU scenario, which is generally unsupported and may cause failures. I've suggested adding a warning to alert users to this possibility.

@@ -338,6 +339,7 @@ def initialize_ray_cluster(
else:
logger.info("No current placement group found. "
"Creating a new placement group.")
device_resource_request = envs.VLLM_RAY_PER_WORKER_GPUS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a fractional VLLM_RAY_PER_WORKER_GPUS when world_size > 1 can lead to multiple workers from the same tensor-parallel group being scheduled on the same GPU. This is generally not supported and can cause failures.

While you've noted this is for a specific use case, this change could unintentionally affect users with multi-GPU setups. The previous implementation requested a full GPU (1.0) for each worker in the placement group, which prevented this co-location scenario. This PR changes that behavior.

To mitigate this risk for other users, please add a warning when world_size > 1 and a fractional GPU value is used.

        device_resource_request = envs.VLLM_RAY_PER_WORKER_GPUS
        if parallel_config.world_size > 1 and device_resource_request < 1.0:
            logger.warning(
                "VLLM_RAY_PER_WORKER_GPUS is set to %f, which is less than 1.0. "
                "When using multi-GPU inference (world_size > 1), this can "
                "cause multiple workers to be placed on the same GPU, which "
                "is not supported and may lead to unexpected behavior or "
                "failures. Please ensure that each worker is placed on a "
                "separate GPU.", device_resource_request)

Signed-off-by: eric-higgins-ai <[email protected]>
Signed-off-by: eric-higgins-ai <[email protected]>
Signed-off-by: eric-higgins-ai <[email protected]>
Signed-off-by: eric-higgins-ai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant