Merge remote-tracking branch 'upstream/main'

dchourasia · dchourasia · commit 14e03542dcf5 · 2025-07-15T00:17:40.000Z
diff --git a/examples/kfto-dreambooth/README.md b/examples/kfto-dreambooth/README.md
@@ -5,6 +5,8 @@ The finetuning is performed on OpenShift environment using Kubeflow Training ope
 
 This example is based on HuggingFace DreamBooth Hackathon example - https://huggingface.co/learn/diffusion-course/en/hackathon/dreambooth
 
+> [!TIP]
+> **Multi-Team Resource Management**: For enterprise scenarios with multiple teams sharing GPU resources, see the [**Kueue Multi-Team Resource Management Workshop**](../../workshops/kueue/README.md). It demonstrates how to use this LLM fine-tuning example with Kueue for fair resource allocation, borrowing policies, and workload scheduling across teams.
 
 ## Requirements
 
@@ -45,4 +47,36 @@ This example is based on HuggingFace DreamBooth Hackathon example - https://hugg
 * From the workbench, clone this repository, i.e., `https://github.com/opendatahub-io/distributed-workloads.git`
 * Navigate to the `distributed-workloads/examples/kfto-dreambooth` directory and open the `dreambooth` notebook
 
-You can now proceed with the instructions from the notebook. Enjoy!
+You can now proceed with the instructions from the notebook. Enjoy!
+
+> [!IMPORTANT]
+> **Kueue Integration (RHOAI 2.21+):**
+> * If using RHOAI 2.21+, the example supports Kueue integration for workload management:
+>   * When using Kueue:
+>     * Follow the [Configure Kueue (Optional)](#configure-kueue-optional) section to set up required resources
+>     * Add the local-queue name label to your job configuration to enforce workload management
+>   * You can skip Kueue usage by:
+>     * Disabling the existing `kueue-validating-admission-policy-binding`
+>     * Omitting the local-queue-name label in your job configuration
+> 
+> **Note:** Kueue Enablement via Validating Admission Policy was introduced in RHOAI-2.21. You can skip this section if using an earlier RHOAI release version.
+
+### Configure Kueue (Optional)
+
+> [!NOTE]
+> This section is only required if you plan to use Kueue for workload management (RHOAI 2.21+) or Kueue is not already configured in your cluster.
+> The Kueue resource YAML files referenced below are located in the [Kueue workshop directory](../../workshops/kueue), specifically in `workshops/kueue/resources/`. You can use these files as templates for your own setup or copy them into your project as needed.
+
+* Update the `nodeLabels` in the `workshops/kueue/resources/resource_flavor.yaml` file to match your AI worker nodes
+* Create the ResourceFlavor:
+    ```console
+    oc apply -f workshops/kueue/resources/resource_flavor.yaml
+    ```
+* Create the ClusterQueue:
+    ```console
+    oc apply -f workshops/kueue/resources/team1_cluster_queue.yaml
+    ```
+* Create a LocalQueue in your namespace:
+    ```console
+    oc apply -f workshops/kueue/resources/team1_local_queue.yaml -n <your-namespace>
+    ```
diff --git a/examples/kfto-dreambooth/dreambooth.ipynb b/examples/kfto-dreambooth/dreambooth.ipynb
@@ -394,6 +394,7 @@
     "    resources_per_worker={\"gpu\": 2},\n",
     "    base_image=\"quay.io/modh/training:py311-cuda121-torch241\",\n",
     "    parameters=parameters,\n",
+    "    # labels={\"kueue.x-k8s.io/queue-name\": \"<LOCAL_QUEUE_NAME>\"}, # Optional: Add local queue name and uncomment these lines if using Kueue for resource management\n",
     "    env_vars=[\n",
     "        V1EnvVar(name=\"AWS_ACCESS_KEY_ID\", value_from=V1EnvVarSource(secret_key_ref=V1SecretKeySelector(key=\"AWS_ACCESS_KEY_ID\", name=aws_connection_name))),\n",
     "        V1EnvVar(name=\"AWS_S3_BUCKET\", value_from=V1EnvVarSource(secret_key_ref=V1SecretKeySelector(key=\"AWS_S3_BUCKET\", name=aws_connection_name))),\n",
diff --git a/examples/kfto-feast/README.md b/examples/kfto-feast/README.md
@@ -35,6 +35,9 @@ By integrating Feast into the fine-tuning pipeline, we ensure that the training
 
 ---
 
+> [!TIP]
+> **Multi-Team Resource Management**: For enterprise scenarios with multiple teams sharing GPU resources, see the [**Kueue Multi-Team Resource Management Workshop**](../../workshops/kueue/README.md). It demonstrates how to use this LLM fine-tuning example with Kueue for fair resource allocation, borrowing policies, and workload scheduling across teams.
+
 ## Requirements
 
 * An OpenShift cluster with OpenShift AI (RHOAI) 2.17+ installed:
@@ -109,4 +112,42 @@ By following this notebook, you'll gain hands-on experience in setting up a **fe
 
 You can now proceed with the instructions from the notebook. Enjoy!
 
+> [!IMPORTANT]
+> **Hugging Face Token Requirements:**
+> * You will need a Hugging Face token if using gated models:
+>   * The examples use gated Llama models that require a token (e.g., https://huggingface.co/meta-llama/Llama-3.1-8B)
+>   * Set the `HF_TOKEN` environment variable in your job configuration
+>   * Note: You can skip the token if switching to non-gated models
+> 
+> **Kueue Integration (RHOAI 2.21+):**
+> * If using RHOAI 2.21+, the example supports Kueue integration for workload management:
+>   * When using Kueue:
+>     * Follow the [Configure Kueue (Optional)](#configure-kueue-optional) section to set up required resources
+>     * Add the local-queue name label to your job configuration to enforce workload management
+>   * You can skip Kueue usage by:
+>     * Disabling the existing `kueue-validating-admission-policy-binding`
+>     * Omitting the local-queue-name label in your job configuration
+> 
+> **Note:** Kueue Enablement via Validating Admission Policy was introduced in RHOAI-2.21. You can skip this section if using an earlier RHOAI release version.
+
+### Configure Kueue (Optional)
+
+> [!NOTE]
+> This section is only required if you plan to use Kueue for workload management (RHOAI 2.21+) or Kueue is not already configured in your cluster.
+> The Kueue resource YAML files referenced below are located in the [Kueue workshop directory](../../workshops/kueue), specifically in `workshops/kueue/resources/`. You can use these files as templates for your own setup or copy them into your project as needed.
+
+* Update the `nodeLabels` in the `workshops/kueue/resources/resource_flavor.yaml` file to match your AI worker nodes
+* Create the ResourceFlavor:
+    ```console
+    oc apply -f workshops/kueue/resources/resource_flavor.yaml
+    ```
+* Create the ClusterQueue:
+    ```console
+    oc apply -f workshops/kueue/resources/team1_cluster_queue.yaml
+    ```
+* Create a LocalQueue in your namespace:
+    ```console
+    oc apply -f workshops/kueue/resources/team1_local_queue.yaml -n <your-namespace>
+    ```
+  
 
diff --git a/examples/kfto-feast/kfto_feast.ipynb b/examples/kfto-feast/kfto_feast.ipynb
@@ -1328,6 +1328,7 @@
     "       \"USE_LORA\": \"true\", # Whether to apply LoRA adapters in the standard (full‑precision) mode.\n",
     "       \"USE_QLORA\":\"false\", # Whether to apply QLoRA, which loads the model in 4‑bit quantized mode and then applies LoRA adapters.\n",
     "    }, \n",
+    "    # labels={\"kueue.x-k8s.io/queue-name\": \"<LOCAL_QUEUE_NAME>\"}, # Optional: Add local queue name and uncomment these lines if using Kueue for resource management\n",
     "    volume_mounts=[\n",
     "        {\n",
     "           \"name\": \"config-volume\",\n",