Skip to content

[Feature] Support multi-host worker indexing and atomic operations #3902

@andrewsykim

Description

@andrewsykim

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

When enabling multi-host worker groups with KubeRay (numOfHosts > 1), it would be really useful to have labels for the following:

  1. A unique label per replica, representing a group of hosts. This label can be used for anti-affinity scheduling and label selectors to group pods within the same replica.
  2. An ordered index label that can be used to uniquely identify worker IDs within a slice.

The labels should be generic to enable future use-cases for GPUs, TPUs and other future accelerators that support multi-host. Maybe something like ray.io/replica-index=worker-group-name-<hash> and ray.io/host-index: <int>

Use case

The first label can be used for atomic operation of worker groups. For example, if one worker in a multi-host group is deleted, all the workers should be recreated. This is a common operation required when managing multi-host deployments of GPU or TPU workers.

The second label is useful for use-cases like JAX with TPUs where each TPU worker needs to be given a unique worker ID in a slice.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions