-
Notifications
You must be signed in to change notification settings - Fork 594
Description
Search before asking
- I had searched in the issues and found no similar feature requirement.
Description
When enabling multi-host worker groups with KubeRay (numOfHosts > 1), it would be really useful to have labels for the following:
- A unique label per replica, representing a group of hosts. This label can be used for anti-affinity scheduling and label selectors to group pods within the same replica.
- An ordered index label that can be used to uniquely identify worker IDs within a slice.
The labels should be generic to enable future use-cases for GPUs, TPUs and other future accelerators that support multi-host. Maybe something like ray.io/replica-index=worker-group-name-<hash>
and ray.io/host-index: <int>
Use case
The first label can be used for atomic operation of worker groups. For example, if one worker in a multi-host group is deleted, all the workers should be recreated. This is a common operation required when managing multi-host deployments of GPU or TPU workers.
The second label is useful for use-cases like JAX with TPUs where each TPU worker needs to be given a unique worker ID in a slice.
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!