Skip to content

Commit 87a025b

Browse files
Add kueue workshop and adjust kfto-sft-llm example to leverage kueue managed scheduling by using multiple team specific scenarios
1 parent 72f04b0 commit 87a025b

23 files changed

+576
-523
lines changed

examples/kfto-sft-llm/README.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
This example demonstrates how to fine-tune LLMs with the Kubeflow Training operator on OpenShift AI.
44
It uses HuggingFace SFTTrainer, with PEFT for LoRA and qLoRA, and PyTorch FSDP to distribute the training on multiple GPUs / nodes.
55

6+
> [!TIP]
7+
> **Multi-Team Resource Management**: For enterprise scenarios with multiple teams sharing GPU resources, see the [**Kueue Multi-Team Resource Management Workshop**](../../workshops/kueue/README.md). It demonstrates how to use this LLM fine-tuning example with Kueue for fair resource allocation, borrowing policies, and workload scheduling across teams.
8+
69
> [!IMPORTANT]
710
> This example has been tested with the configurations listed in the [validation](#validation) section.
811
> Its configuration space is highly dimensional, and tightly coupled to runtime / hardware configuration.
@@ -17,6 +20,8 @@ It uses HuggingFace SFTTrainer, with PEFT for LoRA and qLoRA, and PyTorch FSDP t
1720

1821
## Setup
1922

23+
### Setup Workbench
24+
2025
* Access the OpenShift AI dashboard, for example from the top navigation bar menu:
2126
![](./docs/01.png)
2227
* Log in, then go to _Data Science Projects_ and create a project:
@@ -43,8 +48,41 @@ It uses HuggingFace SFTTrainer, with PEFT for LoRA and qLoRA, and PyTorch FSDP t
4348
![](./docs/06.png)
4449
* Navigate to the `distributed-workloads/examples/kfto-sft-llm` directory and open the `sft` notebook
4550

51+
> [!IMPORTANT]
52+
> * You will need a Hugging Face token if using gated models:
53+
> * The examples use gated Llama models that require a token (e.g., https://huggingface.co/meta-llama/Llama-3.1-8B)
54+
> * Set the `HF_TOKEN` environment variable in your job configuration
55+
> * Note: You can skip the token if switching to non-gated models
56+
> * If using RHOAI 2.21+, the example supports Kueue integration for workload management:
57+
> * When using Kueue:
58+
> * Follow the [Configure Kueue (Optional)](#configure-kueue-optional) section to set up required resources
59+
> * Add the local-queue name label to your job configuration to enforce workload management
60+
> * You can skip Kueue usage by:
61+
> > Note: Kueue Enablement via Validating Admission Policy was introduced in RHOAI-2.21. You can skip this section if using an earlier RHOAI release version.
62+
> * Disabling the existing `kueue-validating-admission-policy-binding`
63+
> * Omitting the local-queue-name label in your job configuration
64+
4665
You can now proceed with the instructions from the notebook. Enjoy!
4766

67+
### Configure Kueue (Optional)
68+
69+
> [!NOTE]
70+
> This section is only required if you plan to use Kueue for workload management (RHOAI 2.21+) or Kueue is not already configured in your cluster.
71+
72+
* Update the `nodeLabels` in the `workshops/kueue/resources/resource_flavor.yaml` file to match your AI worker nodes
73+
* Create the ResourceFlavor:
74+
```console
75+
oc apply -f workshops/kueue/resources/resource_flavor.yaml
76+
```
77+
* Create the ClusterQueue:
78+
```console
79+
oc apply -f workshops/kueue/resources/team1_cluster_queue.yaml
80+
```
81+
* Create a LocalQueue in your namespace:
82+
```console
83+
oc apply -f workshops/kueue/resources/team1_local_queue.yaml -n <your-namespace>
84+
```
85+
4886
## Validation
4987

5088
This example has been validated with the following configurations:
@@ -176,7 +214,7 @@ This example has been validated with the following configurations:
176214
num_workers: 16
177215
num_procs_per_worker: 1
178216
resources_per_worker:
179-
"amd.com/gpu": 1
217+
"nvidia.com/gpu": 1
180218
"memory": 192Gi
181219
"cpu": 4
182220
base_image: quay.io/modh/training:py311-cuda121-torch241

examples/kfto-sft-llm/sft.ipynb

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,17 @@
287287
"Configure the SDK client by providing the authentication token:"
288288
]
289289
},
290+
{
291+
"cell_type": "code",
292+
"execution_count": null,
293+
"id": "4e8ac3ef",
294+
"metadata": {},
295+
"outputs": [],
296+
"source": [
297+
"# IMPORTANT: Labels and annotations support in create_job() method requires kubeflow-training v1.9.2+. Skip this cell if using RHOAI 2.21 or later.\n",
298+
"%pip install -U kubeflow-training "
299+
]
300+
},
290301
{
291302
"cell_type": "code",
292303
"execution_count": null,
@@ -356,6 +367,7 @@
356367
" # NCCL / RCCL\n",
357368
" \"NCCL_DEBUG\": \"INFO\",\n",
358369
" },\n",
370+
" # labels={\"kueue.x-k8s.io/queue-name\": \"<LOCAL_QUEUE_NAME>\"}, # Optional: Add local queue name and uncomment these lines if using Kueue for resource management\n",
359371
" parameters=parameters,\n",
360372
" volumes=[\n",
361373
" V1Volume(name=\"shared\",\n",

workshops/kueue/README.md

Lines changed: 511 additions & 0 deletions
Large diffs are not rendered by default.

workshops/kueue/docs/01.png

345 KB
Loading

workshops/kueue/docs/02.png

259 KB
Loading

workshops/kueue/docs/03.png

263 KB
Loading

workshops/kueue/docs/04.png

137 KB
Loading

workshops/kueue/docs/05.png

328 KB
Loading

workshops/kueue/docs/06.png

357 KB
Loading

workshops/kueue/docs/07.png

361 KB
Loading

0 commit comments

Comments
 (0)