11# Cluster Setup
22
33The cluster setup installs and configures the following components:
4- + Coscheduler
4+ + Scheduler Plugins
55+ Kubeflow Training Operator
66+ KubeRay
77+ Kueue
@@ -16,7 +16,13 @@ Create `default-priority`, `high-priority`, and `low-priority` priority classes:
1616kubectl apply -f setup.k8s/mlbatch-priorities.yaml
1717```
1818
19- ## Coscheduler
19+ ## Scheduler Plugins
20+
21+ MLBatch utilizes Kubernetes Scheduler Plugins to ensure gang scheduling of
22+ multi-Pod workloads and to pack ` Pods ` onto ` Nodes ` to reduce GPU fragmentation.
23+ Two options are described below: Coscheduler and Sakkara. You should pick and install one of them
24+ as a secondary scheduler for your cluster.
25+ ### Coscheduler
2026
2127Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
2228``` sh
@@ -30,6 +36,17 @@ kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s
3036kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
3137```
3238
39+ ### Sakkara
40+
41+ [ Sakkara] ( https://github.com/atantawi/scheduler-plugins/tree/sakkara ) is an experimental
42+ new scheduler plugin with advanced support for topology-aware scheduling.
43+
44+ Install Sakkara as a secondary scheduler:
45+ ``` sh
46+ helm install sakkara-scheduler --namespace sakkara-scheduler --create-namespace mlbatch/sakkara-scheduler
47+ ```
48+ Optionally, create a config map capturing your cluster's topology as described in the [ Sakkara documentation] ( https://github.com/atantawi/sakkara-deploy/tree/main?tab=readme-ov-file#cluster-topology ) . This step is optional but recommended for production clusters. If the config map is not present Sakkara will default to a single-level hierarchy containing the Nodes of the cluster.
49+
3350## Install Operators
3451
3552Create the mlbatch-system namespace
@@ -38,8 +55,14 @@ kubectl create namespace mlbatch-system
3855```
3956
4057Install the Kubeflow Training Operator
58+
59+ If you are using Coscheduler do:
60+ ``` sh
61+ kubectl apply --server-side -k setup.k8s/training-operator/coscheduler
62+ ```
63+ If you are using Sakkara do:
4164``` sh
42- kubectl apply --server-side -k setup.k8s/training-operator
65+ kubectl apply --server-side -k setup.k8s/training-operator/sakkara
4366```
4467
4568Install the KubeRay Operator
@@ -53,13 +76,19 @@ kubectl apply --server-side -k setup.k8s/kueue
5376```
5477
5578Install the AppWrapper Operator
79+ If you are using Coscheduler do:
5680``` sh
57- kubectl apply --server-side -k setup.k8s/appwrapper
81+ kubectl apply --server-side -k setup.k8s/appwrapper/coscheduler
5882```
83+ If you are using Sakkara do:
84+ ``` sh
85+ kubectl apply --server-side -k setup.k8s/appwrapper/sakkara
86+ ```
87+
5988The provided configuration differs from the default configuration of the
6089operators as follows:
6190- Kubeflow Training Operator:
62- - ` gang-scheduler-name ` is set to ` scheduler-plugins-scheduler ` ,
91+ - ` gang-scheduler-name ` is set to either ` scheduler-plugins-scheduler ` or ` sakkara -scheduler` ,
6392- Kueue:
6493 - ` batch/job ` integration is disabled,
6594 - ` manageJobsWithoutQueueName ` is enabled and configured via ` managedJobsNamespaceSelector ` to be
@@ -70,7 +99,7 @@ operators as follows:
7099 - ` enableClusterQueueResources ` metrics is enabled,
71100- AppWrapper operator:
72101 - ` userRBACAdmissionCheck ` is disabled,
73- - ` schedulerName ` is set to ` scheduler-plugins-scheduler ` ,
102+ - ` schedulerName ` is set to ` scheduler-plugins-scheduler ` or ` sakkara-scheduler ` ,
74103 - ` queueName ` is set to ` default-queue ` ,
75104- pod priorities, resource requests and limits have been adjusted.
76105
0 commit comments