You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/kfto-sft-llm/README.md
+39-1Lines changed: 39 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,9 @@
3
3
This example demonstrates how to fine-tune LLMs with the Kubeflow Training operator on OpenShift AI.
4
4
It uses HuggingFace SFTTrainer, with PEFT for LoRA and qLoRA, and PyTorch FSDP to distribute the training on multiple GPUs / nodes.
5
5
6
+
> [!TIP]
7
+
> **Multi-Team Resource Management**: For enterprise scenarios with multiple teams sharing GPU resources, see the [**Kueue Multi-Team Resource Management Workshop**](../../workshops/kueue/README.md). It demonstrates how to use this LLM fine-tuning example with Kueue for fair resource allocation, borrowing policies, and workload scheduling across teams.
8
+
6
9
> [!IMPORTANT]
7
10
> This example has been tested with the configurations listed in the [validation](#validation) section.
8
11
> Its configuration space is highly dimensional, and tightly coupled to runtime / hardware configuration.
@@ -17,6 +20,8 @@ It uses HuggingFace SFTTrainer, with PEFT for LoRA and qLoRA, and PyTorch FSDP t
17
20
18
21
## Setup
19
22
23
+
### Setup Workbench
24
+
20
25
* Access the OpenShift AI dashboard, for example from the top navigation bar menu:
21
26

22
27
* Log in, then go to _Data Science Projects_ and create a project:
@@ -43,8 +48,41 @@ It uses HuggingFace SFTTrainer, with PEFT for LoRA and qLoRA, and PyTorch FSDP t
43
48

44
49
* Navigate to the `distributed-workloads/examples/kfto-sft-llm` directory and open the `sft` notebook
45
50
51
+
> [!IMPORTANT]
52
+
> * You will need a Hugging Face token if using gated models:
53
+
> * The examples use gated Llama models that require a token (e.g., https://huggingface.co/meta-llama/Llama-3.1-8B)
54
+
> * Set the `HF_TOKEN` environment variable in your job configuration
55
+
> * Note: You can skip the token if switching to non-gated models
56
+
> * If using RHOAI 2.21+, the example supports Kueue integration for workload management:
57
+
> * When using Kueue:
58
+
> * Follow the [Configure Kueue (Optional)](#configure-kueue-optional) section to set up required resources
59
+
> * Add the local-queue name label to your job configuration to enforce workload management
60
+
> * You can skip Kueue usage by:
61
+
> > Note: Kueue Enablement via Validating Admission Policy was introduced in RHOAI 2.21. You can skip this section if using an earlier RHOAI release version.
62
+
> * Disabling the existing `kueue-validating-admission-policy-binding`
63
+
> * Omitting the local-queue-name label in your job configuration
64
+
46
65
You can now proceed with the instructions from the notebook. Enjoy!
47
66
67
+
### Configure Kueue (Optional)
68
+
69
+
> [!NOTE]
70
+
> This section is only required if you plan to use Kueue for workload management (RHOAI 2.21+) or Kueue is not already configured in your cluster.
71
+
72
+
* Update the `nodeLabels` in the `workshops/kueue/resources/resource_flavor.yaml` file to match your AI worker nodes
Copy file name to clipboardExpand all lines: examples/kfto-sft-llm/sft.ipynb
+12Lines changed: 12 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -287,6 +287,17 @@
287
287
"Configure the SDK client by providing the authentication token:"
288
288
]
289
289
},
290
+
{
291
+
"cell_type": "code",
292
+
"execution_count": null,
293
+
"id": "4e8ac3ef",
294
+
"metadata": {},
295
+
"outputs": [],
296
+
"source": [
297
+
"# IMPORTANT: Labels and annotations support in create_job() method requires kubeflow-training v1.9.2+. Skip this cell if using RHOAI 2.21 or later.\n",
298
+
"%pip install -U kubeflow-training "
299
+
]
300
+
},
290
301
{
291
302
"cell_type": "code",
292
303
"execution_count": null,
@@ -356,6 +367,7 @@
356
367
" # NCCL / RCCL\n",
357
368
"\"NCCL_DEBUG\": \"INFO\",\n",
358
369
" },\n",
370
+
" # labels={\"kueue.x-k8s.io/queue-name\": \"<LOCAL_QUEUE_NAME>\"}, # Optional: Add local queue name and uncomment these lines if using Kueue for resource management\n",
0 commit comments