11# MLBatch Tutorial
22
3+ MLBatch is the software stack we developed in IBM Research to facilitate the
4+ setup, administration, and use of Kubernetes clusters dedicated to batch AI/ML
5+ workloads. It leverages a number of community projects such as
6+ [ Kueue] ( https://kueue.sigs.k8s.io ) , [ Kubeflow
7+ Trainer] ( https://www.kubeflow.org/docs/components/training/ ) ,
8+ [ KubeRay] ( https://docs.ray.io/en/latest/cluster/kubernetes/index.html ) , and
9+ [ vLLM] ( https://docs.vllm.ai/en/latest/ ) . It complements them with several
10+ open-source components born in IBM Research including
11+ [ AutoPilot] ( https://github.com/IBM/autopilot ) ,
12+ [ AppWrappers] ( https://project-codeflare.github.io/appwrapper/ ) , and
13+ [ Sakkara] ( https://github.com/atantawi/4986-kep-sakkara ) . MLBatch manages teams,
14+ queues, quotas, and resource allocation. It monitors key cluster components,
15+ detecting faults and to a degree automating fault recovery.
16+
317In this tutorial, we walk through all the steps necessary to setup MLBatch on a
418Kubernetes cluster and run a few example workloads.
519- We configure persistent storage using
@@ -130,26 +144,26 @@ helm install scheduler-plugins -n scheduler-plugins --create-namespace \
130144 scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
131145 --set-json pluginConfig=' [{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/GPU","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]'
132146
147+ # Patch scheduler-plugins pod priorities
148+ kubectl patch deployment -n scheduler-plugins --type=json \
149+ --patch-file setup.k8s/scheduler-priority-patch.yaml scheduler-plugins-controller
150+ kubectl patch deployment -n scheduler-plugins --type=json \
151+ --patch-file setup.k8s/scheduler-priority-patch.yaml scheduler-plugins-scheduler
152+
133153# Wait for scheduler-plugins pods to be ready
134154while [[ $( kubectl get pods -n scheduler-plugins -o ' jsonpath={..status.conditions[?(@.type=="Ready")].status}' | tr ' ' ' \n' | sort -u) != " True" ]]
135155do
136156 echo -n " ." && sleep 1;
137157done
138158echo " "
139159
140- # Patch scheduler-plugins pod priorities
141- kubectl patch deployment -n scheduler-plugins --type=json \
142- --patch-file setup.k8s/scheduler-priority-patch.yaml scheduler-plugins-controller
143- kubectl patch deployment -n scheduler-plugins --type=json \
144- --patch-file setup.k8s/scheduler-priority-patch.yaml scheduler-plugins-scheduler
145-
146160# Create mlbatch-system namespace
147161kubectl create namespace mlbatch-system
148162
149163# Deploy Kubeflow training operator
150164kubectl apply --server-side -k setup.k8s/training-operator/coscheduling
151165
152- # Deploy Kuberay
166+ # Deploy KubeRay
153167kubectl apply --server-side -k setup.k8s/kuberay
154168
155169# Deploy Kueue
@@ -176,8 +190,9 @@ kubectl apply -f setup.k8s/default-flavor.yaml
176190
177191# Setup mlbatch-edit-role
178192kubectl apply -f setup.k8s/mlbatch-edit-role.yaml
179-
180- # Create slack cluster queue with 8 GPUs
193+ ```
194+ We reserve 8 GPUs out of 24 for MLBatch's slack queue.
195+ ``` yaml
181196kubectl apply -f- << EOF
182197apiVersion : kueue.x-k8s.io/v1beta1
183198kind : ClusterQueue
@@ -206,7 +221,6 @@ spec:
206221 nominalQuota : 100
207222EOF
208223```
209- We reserve 8 GPUs out of 24 for MLBatch's slack queue.
210224
211225</details >
212226
@@ -218,7 +232,7 @@ GPUs.
218232
219233<details >
220234
221- ``` sh
235+ ``` yaml
222236# Create namespaces
223237kubectl create ns blue
224238kubectl create ns red
@@ -503,8 +517,8 @@ kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidi
503517
504518## Workload Management
505519
506- We will now demonstrate the queueing, quota management, and fault recovery capabilities
507- of MLBatch using synthetic workloads.
520+ We will now demonstrate the queueing, quota management, and fault recovery
521+ capabilities of MLBatch using synthetic workloads.
508522
509523<details >
510524
514528
515529## Example Workloads
516530
517- We now will now run some sample workloads that are representative of what is run on
518- an AI GPU Cluster.
531+ We now will now run some sample workloads that are representative of what is run
532+ on an AI GPU Cluster.
519533
520534### Batch Inference with vLLM
521535
@@ -636,8 +650,9 @@ The two containers are synchronized as follows: `load-generator` waits for
636650
637651### Pre-Training with PyTorch
638652
639- In this example, ` alice ` uses the [ Kubeflow Training Operator] ( https://github.com/kubeflow/training-operator )
640- to run a job that uses [ PyTorch] ( https://pytorch.org ) to train a machine learning model.
653+ In this example, ` alice ` uses the [ Kubeflow Training
654+ Operator] ( https://github.com/kubeflow/training-operator ) to run a job that uses
655+ [ PyTorch] ( https://pytorch.org ) to train a machine learning model.
641656
642657<details >
643658
647662
648663### Fine-Tuning with Ray
649664
650- In this example, ` alice ` uses [ KubeRay] ( https://github.com/ray-project/kuberay ) to run a job that
651- uses [ Ray] ( https://github.com/ray-project/ray ) to fine tune a machine learning model.
665+ In this example, ` alice ` uses [ KubeRay] ( https://github.com/ray-project/kuberay )
666+ to run a job that uses [ Ray] ( https://github.com/ray-project/ray ) to fine tune a
667+ machine learning model.
652668
653669<details >
654670
0 commit comments