@@ -121,10 +121,10 @@ cd mlbatch
121121# Setup priority classes
122122kubectl apply -f setup.k8s/mlbatch-priorities.yaml
123123
124- # Deploy scheduler plugins
124+ # Deploy scheduler- plugins
125125helm install scheduler-plugins --namespace scheduler-plugins --create-namespace scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ --set-json pluginConfig=' [{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/GPU","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]'
126126
127- # Wait for scheduler-plugins pods to be running
127+ # Wait for scheduler-plugins pods to be ready
128128while [[ $( kubectl get pods -n scheduler-plugins -o ' jsonpath={..status.conditions[?(@.type=="Ready")].status}' | tr ' ' ' \n' | sort -u) != " True" ]]
129129do
130130 echo -n " ." && sleep 1;
154154done
155155echo " "
156156
157- kubectl get pods -n mlbatch-system
158-
159157# Deploy AppWrapper
160158kubectl apply --server-side -k setup.k8s/appwrapper/coscheduling
161159
@@ -496,7 +494,8 @@ kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidi
496494
497495## Workload Management
498496
499- TODO
497+ We will now demonstrate the queueing, quota management, and fault recovery capabilities
498+ of MLBatch using synthetic workloads.
500499
501500<details >
502501
506505
507506## Example Workloads
508507
509- We now run a few example workloads.
508+ We now will now run some sample workloads that are representative of what is run on
509+ an AI GPU Cluster.
510510
511511### Batch Inference with vLLM
512512
@@ -627,7 +627,8 @@ The two containers are synchronized as follows: `load-generator` waits for
627627
628628### Pre-Training with PyTorch
629629
630- TODO
630+ In this example, ` alice ` uses the [ Kubeflow Training Operator] ( https://github.com/kubeflow/training-operator )
631+ to run a job that uses [ PyTorch] ( https://pytorch.org ) to train a machine learning model.
631632
632633<details >
633634
637638
638639### Fine-Tuning with Ray
639640
640- TODO
641+ In this example, ` alice ` uses [ KubeRay] ( https://github.com/ray-project/kuberay ) to run a job that
642+ uses [ Ray] ( https://github.com/ray-project/ray ) to fine tune a machine learning model.
641643
642644<details >
643645
0 commit comments