@@ -537,12 +537,131 @@ kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidi
537537
538538## Workload Management
539539
540- We will now demonstrate the queueing , quota management, and fault recovery
541- capabilities of MLBatch using synthetic workloads.
540+ We will now demonstrate the queuing , quota management, and fault recovery capabilities of MLBatch
541+ using synthetic workloads.
542542
543543<details >
544+ For this portion of the tutorial, we will use variations on the simple batch/v1 Job shown below.
545+ All variations will create multiple pods, each requesting some number of GPUs, and sleep for
546+ a specified interval before completing successfully.
547+ ``` yaml
548+ apiVersion : workload.codeflare.dev/v1beta2
549+ kind : AppWrapper
550+ metadata :
551+ generateName : <jobtype>
552+ labels :
553+ kueue.x-k8s.io/queue-name : default-queue
554+ spec :
555+ components :
556+ - template :
557+ apiVersion : batch/v1
558+ kind : Job
559+ metadata :
560+ generateName : <jobtype>
561+ spec :
562+ completions : <number of pods>
563+ parallelism : <number of pods>
564+ template :
565+ spec :
566+ restartPolicy : Never
567+ terminationGracePeriodSeconds : 0
568+ priorityClassName : <priority class>
569+ containers :
570+ - name : busybox
571+ image : quay.io/project-codeflare/busybox:1.36
572+ command : ["sh", "-c", "sleep 600"]
573+ resources :
574+ limits :
575+ nvidia.com/gpu : 4
576+ ` ` `
577+ We will use four types of jobs:
578+ | Job Type | Priority | Duration | Number of Pods | GPU Usage |
579+ ---------------------------------------------------------------
580+ | short | normal | 30s | 2 | 2 X 4 = 8 |
581+ | normal | normal | 600s | 2 | 2 X 4 = 8 |
582+ | important| high | 600s | 2 | 2 x 4 = 8 |
583+ | large | normal | 600s | 4 | 4 x 4 = 16|
584+
585+ ### Queuing
586+
587+ First, Alice will submit a burst of short running jobs that exceeds
588+ the number of available GPUs in the cluster. The excess jobs will
589+ suspended by Kueue and admitted in turn as resources become available.
544590
545- TODO
591+ ` ` ` sh
592+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
593+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
594+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
595+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
596+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
597+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
598+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
599+ ```
600+
601+ Since no one else is using the cluster, Alice is able to utilize
602+ both her blue team's quota of 8 GPUs and to borrow all 8 GPUs from the red team's quota
603+ and the 8 GPUs allocated to the slack cluster queue. During this part of the demo,
604+ we will start with 3 admitted jobs and 5 pending jobs on the blue cluster queue. Over
605+ the next two minutes, the queue will drain as the short running jobs complete and the
606+ next pending job is admitted.
607+
608+ ### Borrowing and Preemption
609+
610+ Alice will now submit 4 normal jobs. Again, with borrowing three of these jobs
611+ will be able to run immediately and the 4th job will be queued.
612+ ``` sh
613+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
614+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
615+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
616+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
617+ ```
618+
619+ Alice can use priorities to ensure important jobs run quickly.
620+ ``` sh
621+ kubectl create -f ./setup.KubeConEU25/sample-jobs/important.yaml -n blue --as alice
622+ ```
623+ One of Alice's normal jobs is automatically suspended and put back on the queue of
624+ waiting jobs to make resource available for her high priority job.
625+
626+ Bob on the red team arrives at work and submits two jobs.
627+
628+ ``` sh
629+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
630+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
631+ ```
632+ To allow Bob to utilize his quota, which Alice's jobs had been borrowing, one of Alice's
633+ jobs is quickly preempted returned it to the queue of pending jobs.
634+
635+ ### Fault Tolerance
636+
637+ In this scenario, we will start fresh with an empty cluster. Alice will submit
638+ a single large job:
639+ ``` sh
640+ kubectl create -f ./setup.KubeConEU25/sample-jobs/large.yaml -n blue --as alice
641+ ```
642+ After the job is running, we will simulate Autopilot detecting a serious GPU failure
643+ on by labeling a Node:
644+ ``` sh
645+ kubectl label node < node-name> autopilot.ibm.com/gpuhealth=EVICT --overwrite
646+ ```
647+ MLBatch will automatically trigger a reset of all running jobs with Pods on
648+ the impacted node. This reset first does a clean removal of all of the job's
649+ Pods and then creates fresh versions of them. Since MLBatch automatically injects
650+ the Kubernetes affinities shown below into all Pods it creates for user workloads,
651+ the Kubernetes scheduler will avoid scheduling the new Pods on the impacted Node.
652+ ``` yaml
653+ affinity :
654+ nodeAffinity :
655+ requiredDuringSchedulingIgnoredDuringExecution :
656+ nodeSelectorTerms :
657+ - matchExpressions :
658+ - key : autopilot.ibm.com/gpuhealth
659+ operator : NotIn
660+ values :
661+ - ERR
662+ - TESTING
663+ - EVICT
664+ ` ` `
546665
547666</details>
548667
0 commit comments