Skip to content

Integration: KAI Scheduler #3886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

EkinKarabulut
Copy link

Why are these changes needed?

This adds the integration to KAI Scheduler.

What has been done?

  • Integrate KAI Scheduler as a supported batch scheduler.
  • Add KAI scheduler plugin integration.
  • Provide sample YAMLs demonstrating KAI integration, including GPU sharing and gang scheduling.
  • Update Helm chart values and documentation.
  • Add tests for KAI scheduler functionality.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests

@kevin85421
Copy link
Member

@EkinKarabulut would you mind fixing the conflict? Thanks.

Copy link
Contributor

@troychiu troychiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. The example works great. Thank you for the contribution!
Just one small comment. Do you mind running the pre-commit (pre-commit run)?

@troychiu
Copy link
Contributor

@kevin85421 PTAL!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LG!

template:
metadata:
annotations:
gpu-fraction: "0.5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Are you using DRA to mount the same GPU to two different Pods?

Additionally, do we need to specify GPUs in the resource requests and limits? If not, KubeRay won’t pass GPU information to Ray, and Ray will be unable to map physical GPU resources in Kubernetes to logical resources within Ray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments for the KAI Scheduler–specific configuration so that users can understand what this YAML is for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is using KAI Scheduler's native GPU sharing feature that is through time slicing. I made it clear in the comments with the new changes that I pushed.

We do not need to specify it - When using gpu-fraction, KAI Scheduler manages the GPU allocation internally

I added comments to the YAML files to explain the KAI specific configurations now. Let me know what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to specify it - When using gpu-fraction, KAI Scheduler manages the GPU allocation internally

Can you try testing whether Ray tasks or actors are actually using the GPUs? Since the CR doesn't specify nvidia.com/gpu, KubeRay doesn't automatically map physical resources to Ray's logical resources. You may need to specify num-gpus in rayStartParams.

@ray.remote(num_gpus=0.5)
def f():
  # check GPU

ref = f.remote()
ray.get(ref)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. My tests showed it getting allocated, but the test-cluster has since been removed, so i have no quick way to try it again rn. I will try later again. If you have a way to test it in the mean time, please let me know if you find anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this now.

It's working fine. Ray tasks can access GPUs and the workers are sharing the GPU correctly.
Both worker pods see the same Tesla T4:

=== Worker Pod 1 GPU Visibility ===
Defaulted container "worker" out of: worker, wait-gcs-ready (init)
GPU 0: Tesla T4 (UUID: GPU-82b4fb2e-d25b-0fb8-480c-7e61e49760f3)

=== Worker Pod 2 GPU Visibility ===
Defaulted container "worker" out of: worker, wait-gcs-ready (init)
GPU 0: Tesla T4 (UUID: GPU-82b4fb2e-d25b-0fb8-480c-7e61e49760f3)

The test pattern you suggested also works.

One quirk I noticed: Ray shows 2.0 GPU total (1 per worker) instead of recognizing the fractional allocations. I think this happens because, as you pointed out, without nvidia.com/gpu in resource limits, Ray just auto-detects the physical GPU on each worker node - correct me if I am wrong. But the actual sharing works fine since KAI Scheduler handles it.
I think this is fine to document as expected behavior - users need to manage their memory anyway - if they request 0.5 from kai in their deployment, they should not use more with Ray workloads. Let me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If GPU sharing is achieved by time slicing, does that mean each worker would feel they own the entire GPU? Is that why Ray shows 2.0 GPU?

@kevin85421
Copy link
Member

@EkinKarabulut would you mind fixing lint issue? Thanks! You can install pre-commit.

memory: "2Gi"
# ---- Two workers share one GPU (0.5 each) ----
workerGroupSpecs:
- groupName: shared-gpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share what the Pod looks like after it's created, using kubectl describe pod ...?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we go:



Name:             raycluster-half-gpu-shared-gpu-worker-6sx5d
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-xxxxx
Start Time:       Wed, 06 Aug 2025 21:01:54 +0200
Labels:           app.kubernetes.io/created-by=kuberay-operator
                  app.kubernetes.io/name=kuberay
                  kai.scheduler/queue=team-a
                  ray.io/cluster=raycluster-half-gpu
                  ray.io/group=shared-gpu
                  ray.io/identifier=raycluster-half-gpu-worker
                  ray.io/is-ray-node=yes
                  ray.io/node-type=worker
                  runai-gpu-group=518b1881-bd3c-4593-9bf3-2e59e98d6cb9
Annotations:      gpu-fraction: 0.5
                  pod-group-name: pg-raycluster-half-gpu-b1ee6048-1369-4ee3-a5a5-a66a377e769f
                  received-resource-type: Fraction
                  runai/shared-gpu-configmap: raycluster-half-gpu-6hl5xvs-shared-gpu
Status:           Running
IP:               xxxxxx
IPs:
  IP:           xxxxxx
Controlled By:  RayCluster/raycluster-half-gpu
Init Containers:
  wait-gcs-ready:
    Container ID:  containerd://27bf1b6c4f5723594b77658697c8a9be3bf9f72579e2f230e2e8ae28d2d74459
    Image:         rayproject/ray:2.46.0
    Image ID:      docker.io/rayproject/ray@sha256:764d7d4bf276143fac2fe322fe41593bb36bbd4dbe7fe9a2d94b67acb736eae3
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      --
    Args:

                            SECONDS=0
                            while true; do
                              if (( SECONDS <= 120 )); then
                                if ray health-check --address raycluster-half-gpu-head-svc.default.svc.cluster.local:6379 > /dev/null 2>&1; then
                                  echo "GCS is ready."
                                  break
                                fi
                                echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."
                              else
                                if ray health-check --address raycluster-half-gpu-head-svc.default.svc.cluster.local:6379; then
                                  echo "GCS is ready. Any error messages above can be safely ignored."
                                  break
                                fi
                                echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md."
                              fi
                              sleep 5
                            done

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 06 Aug 2025 21:01:55 +0200
      Finished:     Wed, 06 Aug 2025 21:02:20 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:     200m
      memory:  256Mi
    Environment:
      FQ_RAY_IP:  raycluster-half-gpu-head-svc.default.svc.cluster.local
      RAY_IP:     raycluster-half-gpu-head-svc
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wz6g (ro)
Containers:
  worker:
    Container ID:  containerd://ae2297c4dd07bfb89a4a2795915cd0b7d8aadd0aacd84f8859c448ab95927f86
    Image:         rayproject/ray:2.46.0
    Image ID:      docker.io/rayproject/ray@sha256:764d7d4bf276143fac2fe322fe41593bb36bbd4dbe7fe9a2d94b67acb736eae3
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
      -c
      --
    Args:
      ulimit -n 65536; ray start  --address=raycluster-half-gpu-head-svc.default.svc.cluster.local:6379  --block  --dashboard-agent-listen-port=52365  --memory=2147483648  --metrics-export-port=8080  --num-cpus=1
    State:          Running
      Started:      Wed, 06 Aug 2025 21:02:21 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:      1
      memory:   2Gi
    Liveness:   exec [bash -c wget --tries 1 -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=2s period=5s #success=1 #failure=120
    Readiness:  exec [bash -c wget --tries 1 -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=10s timeout=2s period=5s #success=1 #failure=10
    Environment Variables from:
      raycluster-half-gpu-6hl5xvs-shared-gpu-0-evar  ConfigMap  Optional: false
    Environment:
      FQ_RAY_IP:                            raycluster-half-gpu-head-svc.default.svc.cluster.local
      RAY_IP:                               raycluster-half-gpu-head-svc
      RAY_CLUSTER_NAME:                      (v1:metadata.labels['ray.io/cluster'])
      RAY_CLOUD_INSTANCE_ID:                raycluster-half-gpu-shared-gpu-worker-6sx5d (v1:metadata.name)
      RAY_NODE_TYPE_NAME:                    (v1:metadata.labels['ray.io/group'])
      KUBERAY_GEN_RAY_START_CMD:            ray start  --address=raycluster-half-gpu-head-svc.default.svc.cluster.local:6379  --block  --dashboard-agent-listen-port=52365  --memory=2147483648  --metrics-export-port=8080  --num-cpus=1
      RAY_PORT:                             6379
      RAY_ADDRESS:                          raycluster-half-gpu-head-svc.default.svc.cluster.local:6379
      RAY_USAGE_STATS_KUBERAY_IN_USE:       1
      RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE:  1
      NVIDIA_VISIBLE_DEVICES:               <set to the key 'NVIDIA_VISIBLE_DEVICES' of config map 'raycluster-half-gpu-6hl5xvs-shared-gpu-0'>  Optional: false
      RUNAI_NUM_OF_GPUS:                    <set to the key 'RUNAI_NUM_OF_GPUS' of config map 'raycluster-half-gpu-6hl5xvs-shared-gpu-0'>       Optional: false
    Mounts:
      /dev/shm from shared-mem (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wz6g (ro)
Conditions:
  Type              Status
  PodBound          True
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  shared-mem:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  2Gi
  kube-api-access-5wz6g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  raycluster-half-gpu-6hl5xvs-shared-gpu-0-vol:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        raycluster-half-gpu-6hl5xvs-shared-gpu-0
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From           Message
  ----    ------     ----  ----           -------
  Normal  Scheduled  100s  kai-scheduler  Successfully assigned pod default/raycluster-half-gpu-shared-gpu-worker-6sx5d to node ip-xxxxxxx at node-pool default
  Normal  Bound      100s  binder         Pod bound successfully to node ip-xxxxxxx
  Normal  Pulled     99s   kubelet        Container image "rayproject/ray:2.46.0" already present on machine
  Normal  Created    99s   kubelet        Created container wait-gcs-ready
  Normal  Started    99s   kubelet        Started container wait-gcs-ready
  Normal  Pulled     73s   kubelet        Container image "rayproject/ray:2.46.0" already present on machine
  Normal  Created    73s   kubelet        Created container worker
  Normal  Started    73s   kubelet        Started container worker

Copy link
Contributor

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@400Ping 400Ping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants