@@ -537,12 +537,143 @@ kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidi
537537
538538## Workload Management
539539
540- We will now demonstrate the queueing , quota management, and fault recovery
541- capabilities of MLBatch using synthetic workloads.
540+ We will now demonstrate the queuing , quota management, and fault recovery capabilities of MLBatch
541+ using synthetic workloads.
542542
543543<details >
544+ For this portion of the tutorial, we will use variations on the simple batch/v1 Job shown below.
545+ All variations will create multiple pods, each requesting some number of GPUs, and sleep for
546+ a specified interval before completing successfully.
544547
545- TODO
548+ ``` yaml
549+ apiVersion : workload.codeflare.dev/v1beta2
550+ kind : AppWrapper
551+ metadata :
552+ generateName : <jobtype>
553+ labels :
554+ kueue.x-k8s.io/queue-name : default-queue
555+ spec :
556+ components :
557+ - template :
558+ apiVersion : batch/v1
559+ kind : Job
560+ metadata :
561+ generateName : <jobtype>
562+ spec :
563+ completions : <number of pods>
564+ parallelism : <number of pods>
565+ template :
566+ spec :
567+ restartPolicy : Never
568+ terminationGracePeriodSeconds : 0
569+ priorityClassName : <priority class>
570+ containers :
571+ - name : busybox
572+ image : quay.io/project-codeflare/busybox:1.36
573+ command : ["sh", "-c", "sleep 600"]
574+ resources :
575+ limits :
576+ nvidia.com/gpu : 4
577+ ` ` `
578+
579+ We will use four types of jobs:
580+
581+ | Job Type | Priority | Duration | Number of Pods | GPU Usage |
582+ |----------|----------|----------|----------------|------------|
583+ | short | normal | 30s | 2 | 2 X 4 = 8 |
584+ | normal | normal | 600s | 2 | 2 X 4 = 8 |
585+ | important| high | 600s | 2 | 2 x 4 = 8 |
586+ | large | normal | 600s | 4 | 4 x 4 = 16 |
587+
588+ ### Queuing
589+
590+ First, Alice will submit a burst of short running jobs that exceeds
591+ the number of available GPUs in the cluster. The excess jobs will
592+ suspended by Kueue and admitted in turn as resources become available.
593+
594+ ` ` ` sh
595+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
596+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
597+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
598+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
599+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
600+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
601+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
602+ ```
603+
604+ Since no one else is using the cluster, Alice is able to utilize
605+ both her blue team's quota of 8 GPUs and to borrow all 8 GPUs from the red team's quota
606+ and the 8 GPUs allocated to the slack cluster queue. During this part of the demo,
607+ we will start with 3 admitted jobs and 5 pending jobs on the blue cluster queue. Over
608+ the next two minutes, the queue will drain as the short running jobs complete and the
609+ next pending job is admitted.
610+
611+ ### Borrowing and Preemption
612+
613+ Alice will now submit 4 normal jobs. Again, with borrowing, three of these jobs
614+ will be able to run immediately and the 4th job will be queued.
615+
616+ ``` sh
617+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
618+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
619+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
620+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
621+ ```
622+
623+ Alice can use priorities to ensure her important jobs run quickly.
624+
625+ ``` sh
626+ kubectl create -f ./setup.KubeConEU25/sample-jobs/important.yaml -n blue --as alice
627+ ```
628+
629+ One of Alice's normal jobs is automatically suspended and put back on the queue of
630+ waiting jobs to make its resource available for her high priority job.
631+
632+ Finally Bob on the red team arrives at work and submits two jobs.
633+
634+ ``` sh
635+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
636+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
637+ ```
638+
639+ Kueue ensures that Bob has immediate access to his team's allocated quota
640+ by evicting borrowing jobs. One of Alice's running
641+ jobs is quickly suspended and returned to her team's queue of pending jobs.
642+
643+ ### Fault Tolerance
644+
645+ In this scenario, we will start fresh with an empty cluster. Alice will submit
646+ a single large job:
647+
648+ ``` sh
649+ kubectl create -f ./setup.KubeConEU25/sample-jobs/large.yaml -n blue --as alice
650+ ```
651+
652+ After the job is running, we will simulate Autopilot detecting a serious GPU failure
653+ on by labeling a Node:
654+
655+ ``` sh
656+ kubectl label node < node-name> autopilot.ibm.com/gpuhealth=EVICT --overwrite
657+ ```
658+
659+ MLBatch will automatically trigger a reset of all running jobs with Pods on
660+ the impacted node. This reset first does a clean removal of all of the job's
661+ Pods and then creates fresh versions of them. Since MLBatch automatically injects
662+ the Kubernetes affinities shown below into all Pods it creates for user workloads,
663+ the Kubernetes scheduler will avoid scheduling the new Pods on the impacted Node.
664+ ``` yaml
665+ affinity :
666+ nodeAffinity :
667+ requiredDuringSchedulingIgnoredDuringExecution :
668+ nodeSelectorTerms :
669+ - matchExpressions :
670+ - key : autopilot.ibm.com/gpuhealth
671+ operator : NotIn
672+ values :
673+ - ERR
674+ - TESTING
675+ - EVICT
676+ ` ` `
546677
547678</details>
548679
0 commit comments