@@ -544,6 +544,7 @@ using synthetic workloads.
544544For this portion of the tutorial, we will use variations on the simple batch/v1 Job shown below.
545545All variations will create multiple pods, each requesting some number of GPUs, and sleep for
546546a specified interval before completing successfully.
547+
547548``` yaml
548549apiVersion : workload.codeflare.dev/v1beta2
549550kind : AppWrapper
@@ -574,6 +575,7 @@ spec:
574575 limits :
575576 nvidia.com/gpu : 4
576577` ` `
578+
577579We will use four types of jobs:
578580| Job Type | Priority | Duration | Number of Pods | GPU Usage |
579581---------------------------------------------------------------
@@ -609,6 +611,7 @@ next pending job is admitted.
609611
610612Alice will now submit 4 normal jobs. Again, with borrowing three of these jobs
611613will be able to run immediately and the 4th job will be queued.
614+
612615``` sh
613616kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
614617kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
@@ -617,9 +620,11 @@ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
617620```
618621
619622Alice can use priorities to ensure important jobs run quickly.
623+
620624``` sh
621625kubectl create -f ./setup.KubeConEU25/sample-jobs/important.yaml -n blue --as alice
622626```
627+
623628One of Alice's normal jobs is automatically suspended and put back on the queue of
624629waiting jobs to make resource available for her high priority job.
625630
@@ -629,21 +634,26 @@ Bob on the red team arrives at work and submits two jobs.
629634kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
630635kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
631636```
637+
632638To allow Bob to utilize his quota, which Alice's jobs had been borrowing, one of Alice's
633639jobs is quickly preempted returned it to the queue of pending jobs.
634640
635641### Fault Tolerance
636642
637643In this scenario, we will start fresh with an empty cluster. Alice will submit
638644a single large job:
645+
639646``` sh
640647kubectl create -f ./setup.KubeConEU25/sample-jobs/large.yaml -n blue --as alice
641648```
649+
642650After the job is running, we will simulate Autopilot detecting a serious GPU failure
643651on by labeling a Node:
652+
644653``` sh
645654 kubectl label node < node-name> autopilot.ibm.com/gpuhealth=EVICT --overwrite
646655```
656+
647657MLBatch will automatically trigger a reset of all running jobs with Pods on
648658the impacted node. This reset first does a clean removal of all of the job's
649659Pods and then creates fresh versions of them. Since MLBatch automatically injects
0 commit comments