-
Notifications
You must be signed in to change notification settings - Fork 594
Integration: KAI Scheduler #3886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@EkinKarabulut would you mind fixing the conflict? Thanks. |
…f explanation on docs
05fda72
to
1a63219
Compare
ray-operator/config/samples/ray-cluster.kai-scheduler-queues.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome. The example works great. Thank you for the contribution!
Just one small comment. Do you mind running the pre-commit (pre-commit run
)?
@kevin85421 PTAL! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LG!
template: | ||
metadata: | ||
annotations: | ||
gpu-fraction: "0.5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Are you using DRA to mount the same GPU to two different Pods?
Additionally, do we need to specify GPUs in the resource requests and limits? If not, KubeRay won’t pass GPU information to Ray, and Ray will be unable to map physical GPU resources in Kubernetes to logical resources within Ray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comments for the KAI Scheduler–specific configuration so that users can understand what this YAML is for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example is using KAI Scheduler's native GPU sharing feature that is through time slicing. I made it clear in the comments with the new changes that I pushed.
We do not need to specify it - When using gpu-fraction
, KAI Scheduler manages the GPU allocation internally
I added comments to the YAML files to explain the KAI specific configurations now. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to specify it - When using gpu-fraction, KAI Scheduler manages the GPU allocation internally
Can you try testing whether Ray tasks or actors are actually using the GPUs? Since the CR doesn't specify nvidia.com/gpu
, KubeRay doesn't automatically map physical resources to Ray's logical resources. You may need to specify num-gpus
in rayStartParams
.
@ray.remote(num_gpus=0.5)
def f():
# check GPU
ref = f.remote()
ray.get(ref)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. My tests showed it getting allocated, but the test-cluster has since been removed, so i have no quick way to try it again rn. I will try later again. If you have a way to test it in the mean time, please let me know if you find anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this now.
It's working fine. Ray tasks can access GPUs and the workers are sharing the GPU correctly.
Both worker pods see the same Tesla T4:
=== Worker Pod 1 GPU Visibility ===
Defaulted container "worker" out of: worker, wait-gcs-ready (init)
GPU 0: Tesla T4 (UUID: GPU-82b4fb2e-d25b-0fb8-480c-7e61e49760f3)
=== Worker Pod 2 GPU Visibility ===
Defaulted container "worker" out of: worker, wait-gcs-ready (init)
GPU 0: Tesla T4 (UUID: GPU-82b4fb2e-d25b-0fb8-480c-7e61e49760f3)
The test pattern you suggested also works.
One quirk I noticed: Ray shows 2.0 GPU total (1 per worker) instead of recognizing the fractional allocations. I think this happens because, as you pointed out, without nvidia.com/gpu
in resource limits, Ray just auto-detects the physical GPU on each worker node - correct me if I am wrong. But the actual sharing works fine since KAI Scheduler handles it.
I think this is fine to document as expected behavior - users need to manage their memory anyway - if they request 0.5 from kai in their deployment, they should not use more with Ray workloads. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If GPU sharing is achieved by time slicing, does that mean each worker would feel they own the entire GPU? Is that why Ray shows 2.0 GPU?
ray-operator/config/samples/ray-cluster.kai-scheduler-queues.yaml
Outdated
Show resolved
Hide resolved
ray-operator/config/samples/ray-cluster.kai-scheduler-queues.yaml
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/kai-scheduler/kai_scheduler.go
Show resolved
Hide resolved
@EkinKarabulut would you mind fixing lint issue? Thanks! You can install |
ray-operator/controllers/ray/batchscheduler/kai-scheduler/kai_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/kai-scheduler/kai_scheduler.go
Show resolved
Hide resolved
memory: "2Gi" | ||
# ---- Two workers share one GPU (0.5 each) ---- | ||
workerGroupSpecs: | ||
- groupName: shared-gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share what the Pod looks like after it's created, using kubectl describe pod ...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we go:
Name: raycluster-half-gpu-shared-gpu-worker-6sx5d
Namespace: default
Priority: 0
Service Account: default
Node: ip-xxxxx
Start Time: Wed, 06 Aug 2025 21:01:54 +0200
Labels: app.kubernetes.io/created-by=kuberay-operator
app.kubernetes.io/name=kuberay
kai.scheduler/queue=team-a
ray.io/cluster=raycluster-half-gpu
ray.io/group=shared-gpu
ray.io/identifier=raycluster-half-gpu-worker
ray.io/is-ray-node=yes
ray.io/node-type=worker
runai-gpu-group=518b1881-bd3c-4593-9bf3-2e59e98d6cb9
Annotations: gpu-fraction: 0.5
pod-group-name: pg-raycluster-half-gpu-b1ee6048-1369-4ee3-a5a5-a66a377e769f
received-resource-type: Fraction
runai/shared-gpu-configmap: raycluster-half-gpu-6hl5xvs-shared-gpu
Status: Running
IP: xxxxxx
IPs:
IP: xxxxxx
Controlled By: RayCluster/raycluster-half-gpu
Init Containers:
wait-gcs-ready:
Container ID: containerd://27bf1b6c4f5723594b77658697c8a9be3bf9f72579e2f230e2e8ae28d2d74459
Image: rayproject/ray:2.46.0
Image ID: docker.io/rayproject/ray@sha256:764d7d4bf276143fac2fe322fe41593bb36bbd4dbe7fe9a2d94b67acb736eae3
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
--
Args:
SECONDS=0
while true; do
if (( SECONDS <= 120 )); then
if ray health-check --address raycluster-half-gpu-head-svc.default.svc.cluster.local:6379 > /dev/null 2>&1; then
echo "GCS is ready."
break
fi
echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."
else
if ray health-check --address raycluster-half-gpu-head-svc.default.svc.cluster.local:6379; then
echo "GCS is ready. Any error messages above can be safely ignored."
break
fi
echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md."
fi
sleep 5
done
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 06 Aug 2025 21:01:55 +0200
Finished: Wed, 06 Aug 2025 21:02:20 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 200m
memory: 256Mi
Requests:
cpu: 200m
memory: 256Mi
Environment:
FQ_RAY_IP: raycluster-half-gpu-head-svc.default.svc.cluster.local
RAY_IP: raycluster-half-gpu-head-svc
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wz6g (ro)
Containers:
worker:
Container ID: containerd://ae2297c4dd07bfb89a4a2795915cd0b7d8aadd0aacd84f8859c448ab95927f86
Image: rayproject/ray:2.46.0
Image ID: docker.io/rayproject/ray@sha256:764d7d4bf276143fac2fe322fe41593bb36bbd4dbe7fe9a2d94b67acb736eae3
Port: 8080/TCP
Host Port: 0/TCP
Command:
/bin/bash
-c
--
Args:
ulimit -n 65536; ray start --address=raycluster-half-gpu-head-svc.default.svc.cluster.local:6379 --block --dashboard-agent-listen-port=52365 --memory=2147483648 --metrics-export-port=8080 --num-cpus=1
State: Running
Started: Wed, 06 Aug 2025 21:02:21 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 1
memory: 2Gi
Liveness: exec [bash -c wget --tries 1 -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=2s period=5s #success=1 #failure=120
Readiness: exec [bash -c wget --tries 1 -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=10s timeout=2s period=5s #success=1 #failure=10
Environment Variables from:
raycluster-half-gpu-6hl5xvs-shared-gpu-0-evar ConfigMap Optional: false
Environment:
FQ_RAY_IP: raycluster-half-gpu-head-svc.default.svc.cluster.local
RAY_IP: raycluster-half-gpu-head-svc
RAY_CLUSTER_NAME: (v1:metadata.labels['ray.io/cluster'])
RAY_CLOUD_INSTANCE_ID: raycluster-half-gpu-shared-gpu-worker-6sx5d (v1:metadata.name)
RAY_NODE_TYPE_NAME: (v1:metadata.labels['ray.io/group'])
KUBERAY_GEN_RAY_START_CMD: ray start --address=raycluster-half-gpu-head-svc.default.svc.cluster.local:6379 --block --dashboard-agent-listen-port=52365 --memory=2147483648 --metrics-export-port=8080 --num-cpus=1
RAY_PORT: 6379
RAY_ADDRESS: raycluster-half-gpu-head-svc.default.svc.cluster.local:6379
RAY_USAGE_STATS_KUBERAY_IN_USE: 1
RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE: 1
NVIDIA_VISIBLE_DEVICES: <set to the key 'NVIDIA_VISIBLE_DEVICES' of config map 'raycluster-half-gpu-6hl5xvs-shared-gpu-0'> Optional: false
RUNAI_NUM_OF_GPUS: <set to the key 'RUNAI_NUM_OF_GPUS' of config map 'raycluster-half-gpu-6hl5xvs-shared-gpu-0'> Optional: false
Mounts:
/dev/shm from shared-mem (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wz6g (ro)
Conditions:
Type Status
PodBound True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
shared-mem:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: 2Gi
kube-api-access-5wz6g:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
raycluster-half-gpu-6hl5xvs-shared-gpu-0-vol:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: raycluster-half-gpu-6hl5xvs-shared-gpu-0
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 100s kai-scheduler Successfully assigned pod default/raycluster-half-gpu-shared-gpu-worker-6sx5d to node ip-xxxxxxx at node-pool default
Normal Bound 100s binder Pod bound successfully to node ip-xxxxxxx
Normal Pulled 99s kubelet Container image "rayproject/ray:2.46.0" already present on machine
Normal Created 99s kubelet Created container wait-gcs-ready
Normal Started 99s kubelet Started container wait-gcs-ready
Normal Pulled 73s kubelet Container image "rayproject/ray:2.46.0" already present on machine
Normal Created 73s kubelet Created container worker
Normal Started 73s kubelet Started container worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Why are these changes needed?
This adds the integration to KAI Scheduler.
What has been done?
Checks