Integration: KAI Scheduler #3886

EkinKarabulut · 2025-07-20T08:58:15Z

Why are these changes needed?

This adds the integration to KAI Scheduler.

What has been done?

Integrate KAI Scheduler as a supported batch scheduler.
Add KAI scheduler plugin integration.
Provide sample YAMLs demonstrating KAI integration, including GPU sharing and gang scheduling.
Update Helm chart values and documentation.
Add tests for KAI scheduler functionality.

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests

kevin85421 · 2025-07-29T21:07:11Z

@EkinKarabulut would you mind fixing the conflict? Thanks.

…f explanation on docs

ray-operator/config/samples/ray-cluster.kai-scheduler-queues.yaml

troychiu

This is awesome. The example works great. Thank you for the contribution!
Just one small comment. Do you mind running the pre-commit (pre-commit run)?

troychiu · 2025-07-31T17:07:43Z

@kevin85421 PTAL!

kevin85421

Overall LG!

kevin85421 · 2025-07-31T18:21:22Z

ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml

+    template:
+      metadata:
+        annotations:
+          gpu-fraction: "0.5"


What does this mean? Are you using DRA to mount the same GPU to two different Pods?

Additionally, do we need to specify GPUs in the resource requests and limits? If not, KubeRay won’t pass GPU information to Ray, and Ray will be unable to map physical GPU resources in Kubernetes to logical resources within Ray.

Can you add comments for the KAI Scheduler–specific configuration so that users can understand what this YAML is for?

The example is using KAI Scheduler's native GPU sharing feature that is through time slicing. I made it clear in the comments with the new changes that I pushed.

We do not need to specify it - When using gpu-fraction, KAI Scheduler manages the GPU allocation internally

I added comments to the YAML files to explain the KAI specific configurations now. Let me know what you think

We do not need to specify it - When using gpu-fraction, KAI Scheduler manages the GPU allocation internally

Can you try testing whether Ray tasks or actors are actually using the GPUs? Since the CR doesn't specify nvidia.com/gpu, KubeRay doesn't automatically map physical resources to Ray's logical resources. You may need to specify num-gpus in rayStartParams.

@ray.remote(num_gpus=0.5) def f(): # check GPU ref = f.remote() ray.get(ref)

Thanks for the pointer. My tests showed it getting allocated, but the test-cluster has since been removed, so i have no quick way to try it again rn. I will try later again. If you have a way to test it in the mean time, please let me know if you find anything.

I tested this now.

It's working fine. Ray tasks can access GPUs and the workers are sharing the GPU correctly.
Both worker pods see the same Tesla T4:

=== Worker Pod 1 GPU Visibility === Defaulted container "worker" out of: worker, wait-gcs-ready (init) GPU 0: Tesla T4 (UUID: GPU-82b4fb2e-d25b-0fb8-480c-7e61e49760f3) === Worker Pod 2 GPU Visibility === Defaulted container "worker" out of: worker, wait-gcs-ready (init) GPU 0: Tesla T4 (UUID: GPU-82b4fb2e-d25b-0fb8-480c-7e61e49760f3)

The test pattern you suggested also works.

One quirk I noticed: Ray shows 2.0 GPU total (1 per worker) instead of recognizing the fractional allocations. I think this happens because, as you pointed out, without nvidia.com/gpu in resource limits, Ray just auto-detects the physical GPU on each worker node - correct me if I am wrong. But the actual sharing works fine since KAI Scheduler handles it.
I think this is fine to document as expected behavior - users need to manage their memory anyway - if they request 0.5 from kai in their deployment, they should not use more with Ray workloads. Let me know what you think

If GPU sharing is achieved by time slicing, does that mean each worker would feel they own the entire GPU? Is that why Ray shows 2.0 GPU?

ray-operator/config/samples/ray-cluster.kai-scheduler-queues.yaml

ray-operator/controllers/ray/batchscheduler/kai-scheduler/kai_scheduler.go

kevin85421 · 2025-08-01T03:29:47Z

@EkinKarabulut would you mind fixing lint issue? Thanks! You can install pre-commit.

ray-operator/controllers/ray/batchscheduler/kai-scheduler/kai_scheduler.go

… struct

kevin85421 · 2025-08-04T17:13:31Z

ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml

+              memory: "2Gi"
+  # ---- Two workers share one GPU (0.5 each) ----
+  workerGroupSpecs:
+  - groupName: shared-gpu


Could you share what the Pod looks like after it's created, using kubectl describe pod ...?

Here we go:  

Name: raycluster-half-gpu-shared-gpu-worker-6sx5d Namespace: default Priority: 0 Service Account: default Node: ip-xxxxx Start Time: Wed, 06 Aug 2025 21:01:54 +0200 Labels: app.kubernetes.io/created-by=kuberay-operator app.kubernetes.io/name=kuberay kai.scheduler/queue=team-a ray.io/cluster=raycluster-half-gpu ray.io/group=shared-gpu ray.io/identifier=raycluster-half-gpu-worker ray.io/is-ray-node=yes ray.io/node-type=worker runai-gpu-group=518b1881-bd3c-4593-9bf3-2e59e98d6cb9 Annotations: gpu-fraction: 0.5 pod-group-name: pg-raycluster-half-gpu-b1ee6048-1369-4ee3-a5a5-a66a377e769f received-resource-type: Fraction runai/shared-gpu-configmap: raycluster-half-gpu-6hl5xvs-shared-gpu Status: Running IP: xxxxxx IPs: IP: xxxxxx Controlled By: RayCluster/raycluster-half-gpu Init Containers: wait-gcs-ready: Container ID: containerd://27bf1b6c4f5723594b77658697c8a9be3bf9f72579e2f230e2e8ae28d2d74459 Image: rayproject/ray:2.46.0 Image ID: docker.io/rayproject/ray@sha256:764d7d4bf276143fac2fe322fe41593bb36bbd4dbe7fe9a2d94b67acb736eae3 Port: <none> Host Port: <none> Command: /bin/bash -c -- Args: SECONDS=0 while true; do if (( SECONDS <= 120 )); then if ray health-check --address raycluster-half-gpu-head-svc.default.svc.cluster.local:6379 > /dev/null 2>&1; then echo "GCS is ready." break fi echo "$SECONDS seconds elapsed: Waiting for GCS to be ready." else if ray health-check --address raycluster-half-gpu-head-svc.default.svc.cluster.local:6379; then echo "GCS is ready. Any error messages above can be safely ignored." break fi echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md." fi sleep 5 done State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 06 Aug 2025 21:01:55 +0200 Finished: Wed, 06 Aug 2025 21:02:20 +0200 Ready: True Restart Count: 0 Limits: cpu: 200m memory: 256Mi Requests: cpu: 200m memory: 256Mi Environment: FQ_RAY_IP: raycluster-half-gpu-head-svc.default.svc.cluster.local RAY_IP: raycluster-half-gpu-head-svc Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wz6g (ro) Containers: worker: Container ID: containerd://ae2297c4dd07bfb89a4a2795915cd0b7d8aadd0aacd84f8859c448ab95927f86 Image: rayproject/ray:2.46.0 Image ID: docker.io/rayproject/ray@sha256:764d7d4bf276143fac2fe322fe41593bb36bbd4dbe7fe9a2d94b67acb736eae3 Port: 8080/TCP Host Port: 0/TCP Command: /bin/bash -c -- Args: ulimit -n 65536; ray start --address=raycluster-half-gpu-head-svc.default.svc.cluster.local:6379 --block --dashboard-agent-listen-port=52365 --memory=2147483648 --metrics-export-port=8080 --num-cpus=1 State: Running Started: Wed, 06 Aug 2025 21:02:21 +0200 Ready: True Restart Count: 0 Limits: cpu: 1 memory: 2Gi Requests: cpu: 1 memory: 2Gi Liveness: exec [bash -c wget --tries 1 -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=2s period=5s #success=1 #failure=120 Readiness: exec [bash -c wget --tries 1 -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=10s timeout=2s period=5s #success=1 #failure=10 Environment Variables from: raycluster-half-gpu-6hl5xvs-shared-gpu-0-evar ConfigMap Optional: false Environment: FQ_RAY_IP: raycluster-half-gpu-head-svc.default.svc.cluster.local RAY_IP: raycluster-half-gpu-head-svc RAY_CLUSTER_NAME: (v1:metadata.labels['ray.io/cluster']) RAY_CLOUD_INSTANCE_ID: raycluster-half-gpu-shared-gpu-worker-6sx5d (v1:metadata.name) RAY_NODE_TYPE_NAME: (v1:metadata.labels['ray.io/group']) KUBERAY_GEN_RAY_START_CMD: ray start --address=raycluster-half-gpu-head-svc.default.svc.cluster.local:6379 --block --dashboard-agent-listen-port=52365 --memory=2147483648 --metrics-export-port=8080 --num-cpus=1 RAY_PORT: 6379 RAY_ADDRESS: raycluster-half-gpu-head-svc.default.svc.cluster.local:6379 RAY_USAGE_STATS_KUBERAY_IN_USE: 1 RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE: 1 NVIDIA_VISIBLE_DEVICES: <set to the key 'NVIDIA_VISIBLE_DEVICES' of config map 'raycluster-half-gpu-6hl5xvs-shared-gpu-0'> Optional: false RUNAI_NUM_OF_GPUS: <set to the key 'RUNAI_NUM_OF_GPUS' of config map 'raycluster-half-gpu-6hl5xvs-shared-gpu-0'> Optional: false Mounts: /dev/shm from shared-mem (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wz6g (ro) Conditions: Type Status PodBound True Initialized True Ready True ContainersReady True PodScheduled True Volumes: shared-mem: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory SizeLimit: 2Gi kube-api-access-5wz6g: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true raycluster-half-gpu-6hl5xvs-shared-gpu-0-vol: Type: ConfigMap (a volume populated by a ConfigMap) Name: raycluster-half-gpu-6hl5xvs-shared-gpu-0 Optional: false QoS Class: Guaranteed Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 100s kai-scheduler Successfully assigned pod default/raycluster-half-gpu-shared-gpu-worker-6sx5d to node ip-xxxxxxx at node-pool default Normal Bound 100s binder Pod bound successfully to node ip-xxxxxxx Normal Pulled 99s kubelet Container image "rayproject/ray:2.46.0" already present on machine Normal Created 99s kubelet Created container wait-gcs-ready Normal Started 99s kubelet Started container wait-gcs-ready Normal Pulled 73s kubelet Container image "rayproject/ray:2.46.0" already present on machine Normal Created 73s kubelet Created container worker Normal Started 73s kubelet Started container worker

owenowenisme

LGTM!

400Ping

LGTM!

EkinKarabulut mentioned this pull request Jul 23, 2025

docs: Adding docs for Kuberay KAI scheduler integration ray-project/ray#54857

Open

4 tasks

kevin85421 assigned rueian and MortalHappiness Jul 29, 2025

EkinKarabulut added 7 commits July 30, 2025 09:27

[KAI integration] Adding integration and example yamls

13dc672

Adding logs that indicates that the queue label is missing

e80c426

Adding tests for KAI Scheduler

fcbd27b

Changing Error message with Warning message

6032771

Dividing queues and workload examples into different files for ease o…

dfea787

…f explanation on docs

Updating the image tag

e6634bb

[KAI integration] Fixing tests

1a63219

EkinKarabulut force-pushed the kai-integration branch from 05fda72 to 1a63219 Compare July 30, 2025 07:28

fscnick reviewed Jul 30, 2025

View reviewed changes

ray-operator/config/samples/ray-cluster.kai-scheduler-queues.yaml Outdated Show resolved Hide resolved

troychiu approved these changes Jul 31, 2025

View reviewed changes

KAI integration: fixed pre-commit

cfe36e6

fscnick approved these changes Jul 31, 2025

View reviewed changes

kevin85421 reviewed Jul 31, 2025

View reviewed changes

owenowenisme reviewed Aug 2, 2025

View reviewed changes

ray-operator/controllers/ray/batchscheduler/kai-scheduler/kai_scheduler.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/batchscheduler/kai-scheduler/kai_scheduler.go Show resolved Hide resolved

[KAI Integration] Comment and merge yaml files, pull logger into main…

3c5b2f8

… struct

kevin85421 reviewed Aug 4, 2025

View reviewed changes

[KAI Integration] improve yaml comments

28b592d

owenowenisme approved these changes Aug 5, 2025

View reviewed changes

400Ping approved these changes Aug 8, 2025

View reviewed changes

Integration: KAI Scheduler #3886

Are you sure you want to change the base?

Integration: KAI Scheduler #3886

Uh oh!

Conversation

EkinKarabulut commented Jul 20, 2025

Why are these changes needed?

What has been done?

Checks

Uh oh!

kevin85421 commented Jul 29, 2025

Uh oh!

Uh oh!

troychiu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

troychiu commented Jul 31, 2025

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevin85421 commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

owenowenisme left a comment

Choose a reason for hiding this comment

Uh oh!

400Ping left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

troychiu left a comment •

edited

Loading