diff --git a/setup.KubeConEU25/README.md b/setup.KubeConEU25/README.md index b39340b..54e2a21 100644 --- a/setup.KubeConEU25/README.md +++ b/setup.KubeConEU25/README.md @@ -537,12 +537,143 @@ kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidi ## Workload Management -We will now demonstrate the queueing, quota management, and fault recovery -capabilities of MLBatch using synthetic workloads. +We will now demonstrate the queuing, quota management, and fault recovery capabilities of MLBatch +using synthetic workloads.
+For this portion of the tutorial, we will use variations on the simple batch/v1 Job shown below. +All variations will create multiple pods, each requesting some number of GPUs, and sleep for +a specified interval before completing successfully. -TODO +```yaml +apiVersion: workload.codeflare.dev/v1beta2 +kind: AppWrapper +metadata: + generateName: + labels: + kueue.x-k8s.io/queue-name: default-queue +spec: + components: + - template: + apiVersion: batch/v1 + kind: Job + metadata: + generateName: + spec: + completions: + parallelism: + template: + spec: + restartPolicy: Never + terminationGracePeriodSeconds: 0 + priorityClassName: + containers: + - name: busybox + image: quay.io/project-codeflare/busybox:1.36 + command: ["sh", "-c", "sleep 600"] + resources: + limits: + nvidia.com/gpu: 4 +``` + +We will use four types of jobs: + +| Job Type | Priority | Duration | Number of Pods | GPU Usage | +|----------|----------|----------|----------------|------------| +| short | normal | 30s | 2 | 2 X 4 = 8 | +| normal | normal | 600s | 2 | 2 X 4 = 8 | +| important| high | 600s | 2 | 2 x 4 = 8 | +| large | normal | 600s | 4 | 4 x 4 = 16 | + +### Queuing + +First, Alice will submit a burst of short running jobs that exceeds +the number of available GPUs in the cluster. The excess jobs will +suspended by Kueue and admitted in turn as resources become available. + +```sh +kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice +kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice +kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice +kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice +kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice +kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice +kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice +``` + +Since no one else is using the cluster, Alice is able to utilize +both her blue team's quota of 8 GPUs and to borrow all 8 GPUs from the red team's quota +and the 8 GPUs allocated to the slack cluster queue. During this part of the demo, +we will start with 3 admitted jobs and 5 pending jobs on the blue cluster queue. Over +the next two minutes, the queue will drain as the short running jobs complete and the +next pending job is admitted. + +### Borrowing and Preemption + +Alice will now submit 4 normal jobs. Again, with borrowing, three of these jobs +will be able to run immediately and the 4th job will be queued. + +```sh +kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice +kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice +kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice +kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice +``` + +Alice can use priorities to ensure her important jobs run quickly. + +```sh +kubectl create -f ./setup.KubeConEU25/sample-jobs/important.yaml -n blue --as alice +``` + +One of Alice's normal jobs is automatically suspended and put back on the queue of +waiting jobs to make its resource available for her high priority job. + +Finally Bob on the red team arrives at work and submits two jobs. + +```sh +kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob +kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob +``` + +Kueue ensures that Bob has immediate access to his team's allocated quota +by evicting borrowing jobs. One of Alice's running +jobs is quickly suspended and returned to her team's queue of pending jobs. + +### Fault Tolerance + +In this scenario, we will start fresh with an empty cluster. Alice will submit +a single large job: + +```sh +kubectl create -f ./setup.KubeConEU25/sample-jobs/large.yaml -n blue --as alice +``` + +After the job is running, we will simulate Autopilot detecting a serious GPU failure +on by labeling a Node: + +```sh + kubectl label node autopilot.ibm.com/gpuhealth=EVICT --overwrite +``` + +MLBatch will automatically trigger a reset of all running jobs with Pods on +the impacted node. This reset first does a clean removal of all of the job's +Pods and then creates fresh versions of them. Since MLBatch automatically injects +the Kubernetes affinities shown below into all Pods it creates for user workloads, +the Kubernetes scheduler will avoid scheduling the new Pods on the impacted Node. +```yaml + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: autopilot.ibm.com/gpuhealth + operator: NotIn + values: + - ERR + - TESTING + - EVICT +```
diff --git a/setup.KubeConEU25/sample-jobs/important.yaml b/setup.KubeConEU25/sample-jobs/important.yaml new file mode 100644 index 0000000..2e95d2f --- /dev/null +++ b/setup.KubeConEU25/sample-jobs/important.yaml @@ -0,0 +1,28 @@ +apiVersion: workload.codeflare.dev/v1beta2 +kind: AppWrapper +metadata: + generateName: important + labels: + kueue.x-k8s.io/queue-name: default-queue +spec: + components: + - template: + apiVersion: batch/v1 + kind: Job + metadata: + generateName: important + spec: + completions: 2 + parallelism: 2 + template: + spec: + restartPolicy: Never + terminationGracePeriodSeconds: 0 + priorityClassName: high-priority + containers: + - name: busybox + image: quay.io/project-codeflare/busybox:1.36 + command: ["sh", "-c", "sleep 600"] + resources: + limits: + nvidia.com/gpu: 4 diff --git a/setup.KubeConEU25/sample-jobs/large.yaml b/setup.KubeConEU25/sample-jobs/large.yaml new file mode 100644 index 0000000..bd5255a --- /dev/null +++ b/setup.KubeConEU25/sample-jobs/large.yaml @@ -0,0 +1,29 @@ +apiVersion: workload.codeflare.dev/v1beta2 +kind: AppWrapper +metadata: + generateName: large + labels: + kueue.x-k8s.io/queue-name: default-queue + annotations: + workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 5s +spec: + components: + - template: + apiVersion: batch/v1 + kind: Job + metadata: + generateName: large + spec: + completions: 4 + parallelism: 4 + template: + spec: + restartPolicy: Never + terminationGracePeriodSeconds: 0 + containers: + - name: busybox + image: quay.io/project-codeflare/busybox:1.36 + command: ["sh", "-c", "sleep 600"] + resources: + limits: + nvidia.com/gpu: 4 diff --git a/setup.KubeConEU25/sample-jobs/normal.yaml b/setup.KubeConEU25/sample-jobs/normal.yaml new file mode 100644 index 0000000..b02f64d --- /dev/null +++ b/setup.KubeConEU25/sample-jobs/normal.yaml @@ -0,0 +1,27 @@ +apiVersion: workload.codeflare.dev/v1beta2 +kind: AppWrapper +metadata: + generateName: normal + labels: + kueue.x-k8s.io/queue-name: default-queue +spec: + components: + - template: + apiVersion: batch/v1 + kind: Job + metadata: + generateName: normal + spec: + completions: 2 + parallelism: 2 + template: + spec: + restartPolicy: Never + terminationGracePeriodSeconds: 0 + containers: + - name: busybox + image: quay.io/project-codeflare/busybox:1.36 + command: ["sh", "-c", "sleep 600"] + resources: + limits: + nvidia.com/gpu: 4 diff --git a/setup.KubeConEU25/sample-jobs/short.yaml b/setup.KubeConEU25/sample-jobs/short.yaml new file mode 100644 index 0000000..bef54fd --- /dev/null +++ b/setup.KubeConEU25/sample-jobs/short.yaml @@ -0,0 +1,27 @@ +apiVersion: workload.codeflare.dev/v1beta2 +kind: AppWrapper +metadata: + generateName: short + labels: + kueue.x-k8s.io/queue-name: default-queue +spec: + components: + - template: + apiVersion: batch/v1 + kind: Job + metadata: + generateName: short + spec: + completions: 2 + parallelism: 2 + template: + spec: + restartPolicy: Never + terminationGracePeriodSeconds: 0 + containers: + - name: busybox + image: quay.io/project-codeflare/busybox:1.36 + command: ["sh", "-c", "sleep 30"] + resources: + limits: + nvidia.com/gpu: 4