Skip to content

Commit 5250144

Browse files
Enhance multi-team workshop with dual GPU flavors and add kueue architecture diagram
1 parent 87a025b commit 5250144

File tree

8 files changed

+205
-127
lines changed

8 files changed

+205
-127
lines changed

examples/kfto-sft-llm/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ It uses HuggingFace SFTTrainer, with PEFT for LoRA and qLoRA, and PyTorch FSDP t
5858
> * Follow the [Configure Kueue (Optional)](#configure-kueue-optional) section to set up required resources
5959
> * Add the local-queue name label to your job configuration to enforce workload management
6060
> * You can skip Kueue usage by:
61-
> > Note: Kueue Enablement via Validating Admission Policy was introduced in RHOAI-2.21. You can skip this section if using an earlier RHOAI release version.
61+
> > Note: Kueue Enablement via Validating Admission Policy was introduced in RHOAI 2.21. You can skip this section if using an earlier RHOAI release version.
6262
> * Disabling the existing `kueue-validating-admission-policy-binding`
6363
> * Omitting the local-queue-name label in your job configuration
6464

workshops/kueue/README.md

Lines changed: 155 additions & 109 deletions
Large diffs are not rendered by default.

workshops/kueue/resources/resource_flavor.yaml

Lines changed: 0 additions & 7 deletions
This file was deleted.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
apiVersion: kueue.x-k8s.io/v1beta1
2+
kind: ResourceFlavor
3+
metadata:
4+
name: nvidia-a100-80gb
5+
spec:
6+
nodeLabels:
7+
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
8+
tolerations:
9+
- key: nvidia.com/gpu
10+
operator: Exists
11+
effect: NoSchedule
12+
---
13+
apiVersion: kueue.x-k8s.io/v1beta1
14+
kind: ResourceFlavor
15+
metadata:
16+
name: nvidia-h100-80gb
17+
spec:
18+
nodeLabels:
19+
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
20+
tolerations:
21+
- key: nvidia.com/gpu
22+
operator: Exists
23+
effect: NoSchedule

workshops/kueue/resources/team1_cluster_queue.yaml

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,20 @@ spec:
1717
- memory
1818
- nvidia.com/gpu
1919
flavors:
20-
- name: nvidia-a100
20+
- name: nvidia-h100-80gb
2121
resources:
2222
- name: cpu
23-
nominalQuota: '256'
23+
nominalQuota: '16'
2424
- name: memory
25-
nominalQuota: 2000Gi
25+
nominalQuota: 256Gi
2626
- name: nvidia.com/gpu
27-
nominalQuota: '5'
27+
nominalQuota: '2'
28+
- name: nvidia-a100-80gb
29+
resources:
30+
- name: cpu
31+
nominalQuota: '64'
32+
- name: memory
33+
nominalQuota: 1024Gi
34+
- name: nvidia.com/gpu
35+
nominalQuota: '8'
2836
stopPolicy: None
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
apiVersion: kueue.x-k8s.io/v1beta1
22
kind: LocalQueue
33
metadata:
4-
name: team1-alpha
4+
name: team1
55
spec:
66
clusterQueue: team1

workshops/kueue/resources/team2_cluster_queue.yaml

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,20 @@ spec:
1717
- memory
1818
- nvidia.com/gpu
1919
flavors:
20-
- name: nvidia-a100
20+
- name: nvidia-h100-80gb
2121
resources:
2222
- name: cpu
23-
nominalQuota: '256'
23+
nominalQuota: '48'
2424
- name: memory
25-
nominalQuota: 2000Gi
25+
nominalQuota: 768Gi
2626
- name: nvidia.com/gpu
27-
nominalQuota: '5'
27+
nominalQuota: '6'
28+
- name: nvidia-a100-80gb
29+
resources:
30+
- name: cpu
31+
nominalQuota: '32'
32+
- name: memory
33+
nominalQuota: 512Gi
34+
- name: nvidia.com/gpu
35+
nominalQuota: '4'
2836
stopPolicy: None
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
apiVersion: kueue.x-k8s.io/v1beta1
22
kind: LocalQueue
33
metadata:
4-
name: team2-alpha
4+
name: team2
55
spec:
66
clusterQueue: team2

0 commit comments

Comments
 (0)