Skip to content

Commit e469a4b

Browse files
committed
Merge remote-tracking branch 'upstream/main'
2 parents 055a251 + 2c79b90 commit e469a4b

25 files changed

+666
-557
lines changed

examples/kfto-sft-llm/README.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
This example demonstrates how to fine-tune LLMs with the Kubeflow Training operator on OpenShift AI.
44
It uses HuggingFace SFTTrainer, with PEFT for LoRA and qLoRA, and PyTorch FSDP to distribute the training on multiple GPUs / nodes.
55

6+
> [!TIP]
7+
> **Multi-Team Resource Management**: For enterprise scenarios with multiple teams sharing GPU resources, see the [**Kueue Multi-Team Resource Management Workshop**](../../workshops/kueue/README.md). It demonstrates how to use this LLM fine-tuning example with Kueue for fair resource allocation, borrowing policies, and workload scheduling across teams.
8+
69
> [!IMPORTANT]
710
> This example has been tested with the configurations listed in the [validation](#validation) section.
811
> Its configuration space is highly dimensional, and tightly coupled to runtime / hardware configuration.
@@ -17,6 +20,8 @@ It uses HuggingFace SFTTrainer, with PEFT for LoRA and qLoRA, and PyTorch FSDP t
1720

1821
## Setup
1922

23+
### Setup Workbench
24+
2025
* Access the OpenShift AI dashboard, for example from the top navigation bar menu:
2126
![](./docs/01.png)
2227
* Log in, then go to _Data Science Projects_ and create a project:
@@ -43,8 +48,41 @@ It uses HuggingFace SFTTrainer, with PEFT for LoRA and qLoRA, and PyTorch FSDP t
4348
![](./docs/06.png)
4449
* Navigate to the `distributed-workloads/examples/kfto-sft-llm` directory and open the `sft` notebook
4550

51+
> [!IMPORTANT]
52+
> * You will need a Hugging Face token if using gated models:
53+
> * The examples use gated Llama models that require a token (e.g., https://huggingface.co/meta-llama/Llama-3.1-8B)
54+
> * Set the `HF_TOKEN` environment variable in your job configuration
55+
> * Note: You can skip the token if switching to non-gated models
56+
> * If using RHOAI 2.21+, the example supports Kueue integration for workload management:
57+
> * When using Kueue:
58+
> * Follow the [Configure Kueue (Optional)](#configure-kueue-optional) section to set up required resources
59+
> * Add the local-queue name label to your job configuration to enforce workload management
60+
> * You can skip Kueue usage by:
61+
> > Note: Kueue Enablement via Validating Admission Policy was introduced in RHOAI 2.21. You can skip this section if using an earlier RHOAI release version.
62+
> * Disabling the existing `kueue-validating-admission-policy-binding`
63+
> * Omitting the local-queue-name label in your job configuration
64+
4665
You can now proceed with the instructions from the notebook. Enjoy!
4766

67+
### Configure Kueue (Optional)
68+
69+
> [!NOTE]
70+
> This section is only required if you plan to use Kueue for workload management (RHOAI 2.21+) or Kueue is not already configured in your cluster.
71+
72+
* Update the `nodeLabels` in the `workshops/kueue/resources/resource_flavor.yaml` file to match your AI worker nodes
73+
* Create the ResourceFlavor:
74+
```console
75+
oc apply -f workshops/kueue/resources/resource_flavor.yaml
76+
```
77+
* Create the ClusterQueue:
78+
```console
79+
oc apply -f workshops/kueue/resources/team1_cluster_queue.yaml
80+
```
81+
* Create a LocalQueue in your namespace:
82+
```console
83+
oc apply -f workshops/kueue/resources/team1_local_queue.yaml -n <your-namespace>
84+
```
85+
4886
## Validation
4987

5088
This example has been validated with the following configurations:
@@ -176,7 +214,7 @@ This example has been validated with the following configurations:
176214
num_workers: 16
177215
num_procs_per_worker: 1
178216
resources_per_worker:
179-
"amd.com/gpu": 1
217+
"nvidia.com/gpu": 1
180218
"memory": 192Gi
181219
"cpu": 4
182220
base_image: quay.io/modh/training:py311-cuda121-torch241

examples/kfto-sft-llm/sft.ipynb

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,17 @@
287287
"Configure the SDK client by providing the authentication token:"
288288
]
289289
},
290+
{
291+
"cell_type": "code",
292+
"execution_count": null,
293+
"id": "4e8ac3ef",
294+
"metadata": {},
295+
"outputs": [],
296+
"source": [
297+
"# IMPORTANT: Labels and annotations support in create_job() method requires kubeflow-training v1.9.2+. Skip this cell if using RHOAI 2.21 or later.\n",
298+
"%pip install -U kubeflow-training "
299+
]
300+
},
290301
{
291302
"cell_type": "code",
292303
"execution_count": null,
@@ -356,6 +367,7 @@
356367
" # NCCL / RCCL\n",
357368
" \"NCCL_DEBUG\": \"INFO\",\n",
358369
" },\n",
370+
" # labels={\"kueue.x-k8s.io/queue-name\": \"<LOCAL_QUEUE_NAME>\"}, # Optional: Add local queue name and uncomment these lines if using Kueue for resource management\n",
359371
" parameters=parameters,\n",
360372
" volumes=[\n",
361373
" V1Volume(name=\"shared\",\n",

tests/kfto/kfto_training_test.go

Lines changed: 0 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@ package kfto
1919
import (
2020
"fmt"
2121
"testing"
22-
"time"
2322

2423
kftov1 "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1"
2524
. "github.com/onsi/gomega"
@@ -183,27 +182,6 @@ func runKFTOPyTorchJob(t *testing.T, image string, gpu Accelerator, numGpus, num
183182
test.Eventually(PyTorchJob(test, namespace, tuningJob.Name), TestTimeoutDouble).
184183
Should(WithTransform(PyTorchJobConditionRunning, Equal(corev1.ConditionTrue)))
185184

186-
// Verify GPU utilization
187-
if IsOpenShift(test) && gpu == NVIDIA {
188-
trainingPods := GetPods(test, namespace, metav1.ListOptions{LabelSelector: "training.kubeflow.org/job-name=" + tuningJob.GetName()})
189-
test.Expect(trainingPods).To(HaveLen(numberOfWorkerNodes + 1)) // +1 is a master node
190-
191-
for _, trainingPod := range trainingPods {
192-
// Check that GPUs for training pods were utilized recently
193-
test.Eventually(OpenShiftPrometheusGpuUtil(test, trainingPod, gpu), 10*time.Minute).
194-
Should(
195-
And(
196-
HaveLen(numGpus),
197-
ContainElement(
198-
// Check that at least some GPU was utilized on more than 10%
199-
HaveField("Value", BeNumerically(">", 10)),
200-
),
201-
),
202-
)
203-
}
204-
test.T().Log("All GPUs were successfully utilized")
205-
}
206-
207185
// Make sure the PyTorch job succeeded
208186
test.Eventually(PyTorchJob(test, namespace, tuningJob.Name), TestTimeoutLong).Should(WithTransform(PyTorchJobConditionSucceeded, Equal(corev1.ConditionTrue)))
209187
test.T().Logf("PytorchJob %s/%s ran successfully", tuningJob.Namespace, tuningJob.Name)

0 commit comments

Comments
 (0)