Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 190 additions & 0 deletions examples/kft-v2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# 🚀 Kubeflow Training V2: Advanced ML Training with Distributed Computing

This directory contains comprehensive examples demonstrating **Kubeflow Training V2** capabilities for distributed training using the Kubeflow Trainer SDK.

## 🎯 **What This Directory Demonstrates**

- **Kubeflow Trainer SDK**: Programmatic TrainJob creation and management
- **Checkpointing**: Controller-managed resume/suspended compatibility for model checkpoints
- **Distributed Training**: Multi-node Multi-CPU/GPU coordination with NCCL/GLOO backends

---
### **TRL (Transformer Reinforcement Learning) Integration**
- **SFTTrainer**: Supervised fine-tuning with instruction following
- **PEFT-LoRA**: Parameter-efficient fine-tuning with Low-Rank Adaptation
- **Model Support**: GPT-2, Llama, and other transformer models
- **Dataset Integration**: Alpaca dataset for instruction-following tasks

### **Distributed Training Capabilities**
- **Multi-Node Support**: Scale training across multiple nodes
- **Multi-GPU Coordination**: NCCL backend CUDA for NVIDIA GPUs, ROCm for AMD GPUs
- **CPU Training**: GLOO backend for CPU-based training
- **Resource Flexibility**: Configurable compute resources per node

---

## 📋 **Prerequisites**

### **Cluster Requirements**
- **OpenShift Cluster**: With OpenShift AI (RHOAI) 2.17+ installed
- **Required Components**: `dashboard`, `trainingoperator`, and `workbenches` enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trainer v2 component should probably be trainer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I will update it accoridngly, Thanks!
Thinking is it ok to put it that way, as we don't have this utility available yet 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this example be available before v2 is in RHOAI?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would likely confuse people to have this example available using the v1 component so it may be better to hold this PR and wait until v2 is in RHOAI, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fiona-Waters @astefanutti Yeah that makes sense, I will keep it as a draft for now!
There are some upcoming changes in Kubeflow SDK which will further simplify overall workflow here,
Specially TrainJob Options implementation including PodSpecOverrides capability which will allow mounting volume mounts without a need of customising default ClusterTrainingRuntimes 👍
So it will inturn help reducing oc dependencies expected from user ✅

- **Storage**: Persistent volume claim named `workspace` of minimum 50GB with RWX (ReadWriteMany) access mode

---

## 🛠️ **Setup Instructions**

### **1. Repository Setup**

Clone the repository and navigate to the kft-v2 directory:

```bash
git clone https://github.com/opendatahub-io/distributed-workloads.git
cd distributed-workloads/examples/kft-v2
```

### **2. Persistent Volume Setup**

Create a shared persistent volume for checkpoint storage:

```bash
oc apply -f manifests/shared_pvc.yaml
```

### **3. Cluster Training Runtime Setup**

Apply the cluster training runtime configuration:

```bash
oc apply -f manifests/cluster_training_runtime.yaml
```
Comment on lines +48 to +60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should aim at avoiding any extra oc commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then ClusterTrainingRuntime can also be created, I guess using Kubernetes custom_resource api, is it ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the examples would use the pre-installed ClusterTrainingRuntimes and users would not have to create one for each example.


This creates the necessary ClusterTrainingRuntime resources for PyTorch training.


## Setup

* Access the OpenShift AI dashboard, for example from the top navigation bar menu:

![](./docs/01.png)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add descriptive alt text to images for accessibility.

Lines 69, 73, 77, 85, 87, 99, and 144 have images without alt text, which impacts accessibility for screen reader users.

Example fixes:

-![](./docs/01.png)
+![OpenShift AI dashboard navigation](./docs/01.png)

-![](./docs/02.png)
+![Data Science Projects page](./docs/02.png)

-![](./docs/07.png)
+![Training job execution example](./docs/07.png)

Also applies to: 73-73, 77-77, 85-85, 87-87, 99-99, 144-144

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

69-69: Images should have alternate text (alt text)

(MD045, no-alt-text)

🤖 Prompt for AI Agents
In examples/kft-v2/README.md around lines 69, 73, 77, 85, 87, 99, and 144 there
are image references using ![](path) with empty alt text; update each image tag
to include concise, descriptive alt text that summarizes the image content or
purpose (e.g., ![Diagram showing X] or ![Screenshot of Y]) so screen readers can
convey meaning; keep alt text short and informative, and ensure any
decorative-only images use alt="" intentionally while meaningful images get
descriptive text.


* Log in, then go to _Data Science Projects_ and create a project:

![](./docs/02.png)

* Once the project is created, click on _Create a workbench_:

![](./docs/03.png)

* Then create a workbench with the following settings:

* Select the `PyTorch` (or the `ROCm-PyTorch`) notebook image:

* Select the _Medium_ container size and a sufficient persistent storage volume.

![](./docs/04.png)

![](./docs/05.png)

> [!NOTE]
>
> * Adding an accelerator is only needed to test the fine-tuned model from within the workbench so you can spare an accelerator if needed.
> * Keep the default 20GB workbench storage, it is enough to run the inference from within the workbench.


* Review the configuration and click _Create workbench_

* From "Workbenches" page, click on _Open_ when the workbench you've just created becomes ready:

![](./docs/06.png)

---

## 🚀 **Quick Start Examples**

### **Example 1: Fashion-MNIST Training**

Run the Fashion-MNIST training example:

```python
from scripts.mnist import train_fashion_mnist

# Configure training parameters
config = {
"epochs": 10,
"batch_size": 64,
"learning_rate": 0.001,
"checkpoint_dir": "/mnt/shared/checkpoints"
}

# Start training
train_fashion_mnist(config)
```

### **Example 2: TRL GPT-2 Fine-tuning**

Run the TRL training example:

```python
from scripts.trl_training import trl_train

# Configure TRL parameters
config = {
"model_name": "gpt2",
"dataset_name": "alpaca",
"lora_r": 16,
"lora_alpha": 32,
"max_seq_length": 512
}

# Start TRL training
trl_train(config)
```
Comment on lines +109 to +142
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Incorrect API usage in quick start examples.

Both example code blocks show passing a config dict to the training functions, but:

  • train_fashion_mnist() (mnist.py:2) takes no parameters
  • trl_train() (trl_training.py:1) takes no parameters

Both read configuration from environment variables. Update the examples to show setting environment variables instead:

# Example 1: Fashion-MNIST Training
import os
os.environ['NUM_EPOCHS'] = '10'
os.environ['BATCH_SIZE'] = '64'
os.environ['LEARNING_RATE'] = '0.001'
os.environ['CHECKPOINT_DIR'] = '/mnt/shared/checkpoints'

from scripts.mnist import train_fashion_mnist
train_fashion_mnist()
# Example 2: TRL GPT-2 Fine-tuning
import os
os.environ['MODEL_NAME'] = 'gpt2'
os.environ['DATASET_NAME'] = 'tatsu-lab/alpaca'
os.environ['LORA_R'] = '16'
os.environ['LORA_ALPHA'] = '32'

from scripts.trl_training import trl_train
trl_train()
🤖 Prompt for AI Agents
In examples/kft-v2/README.md around lines 109 to 142 the quick-start examples
incorrectly pass a config dict to train_fashion_mnist() and trl_train(), but
both functions take no parameters and read settings from environment variables;
update the README examples to set the appropriate environment variables
(NUM_EPOCHS, BATCH_SIZE, LEARNING_RATE, CHECKPOINT_DIR for MNIST; MODEL_NAME,
DATASET_NAME, LORA_R, LORA_ALPHA, etc. for TRL) via os.environ before importing
and calling the functions, and remove the config dict/arguments so the examples
reflect the actual API usage.


![](./docs/07.png)

---

## 📊 **Training Examples**

### **Fashion-MNIST Classification**

The `mnist.py` script demonstrates:

- **Distributed Training**: Multi-GPU Fashion-MNIST classification
- **Checkpointing**: Automatic checkpoint creation and resumption
- **Progress Tracking**: Real-time training progress monitoring
- **Error Handling**: Robust error handling and recovery

**Key Features:**
- CNN architecture for image classification
- Distributed data loading with DistributedSampler
- Automatic mixed precision (AMP) training
- Comprehensive logging and metrics

### **TRL GPT-2 Fine-tuning**

The `trl_training.py` script demonstrates:

- **Instruction Following**: Fine-tuning GPT-2 on Alpaca dataset
- **PEFT-LoRA**: Parameter-efficient fine-tuning
- **Checkpoint Management**: TRL-compatible checkpointing
- **Distributed Coordination**: Multi-node training coordination

**Key Features:**
- SFTTrainer for supervised fine-tuning
- LoRA adapters for efficient parameter updates
- Instruction-following dataset processing
- Hugging Face model integration

---

## 📚 **References and Documentation**

- **[Kubeflow Trainer SDK](https://github.com/kubeflow/sdk)**: Official SDK documentation
- **[TRL Documentation](https://huggingface.co/docs/trl/)**: Transformer Reinforcement Learning
- **[PEFT Documentation](https://huggingface.co/docs/peft/)**: Parameter-Efficient Fine-Tuning
- **[PyTorch Distributed Training](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)**: Distributed training guide
- **[OpenShift AI Documentation](https://access.redhat.com/documentation/en-us/red_hat_openshift_ai)**: RHOAI documentation

---
Binary file added examples/kft-v2/docs/01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/kft-v2/docs/02.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/kft-v2/docs/03.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/kft-v2/docs/04.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/kft-v2/docs/05.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/kft-v2/docs/06.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/kft-v2/docs/07.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/kft-v2/docs/jobs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/kft-v2/docs/trainjob_pods.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/kft-v2/docs/trainjobs_jobsets.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
141 changes: 141 additions & 0 deletions examples/kft-v2/manifests/cluster_training_runtime.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should aim to use one of the pre-installed ClusterTrainingRuntimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Yes that was the plan but at the moment the SDK doesn't allow providing volume mount specs for trainjob.. so I had to explicitly add all the needed configs/ env variables and volume mounts in the ClusterTrainingRuntime itself..
Do you mean I should add all the needed config in default torch-cuda-251 runtime and then use it as a reference for client.train method while creating a trainjob ?

Copy link
Contributor

@astefanutti astefanutti Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Yes that was the plan but at the moment the SDK doesn't allow providing volume mount specs for trainjob.. so I had to explicitly add all the needed configs/ env variables and volume mounts in the ClusterTrainingRuntime itself..

@abhijeet-dhumal right, that was my understanding.

Do you mean I should add all the needed config in default torch-cuda-251 runtime and then use it as a reference for client.train method while creating a trainjob ?

No, I mean we should try to fill the gaps, and see how we can improve the SDK flexibility. Could that be one of the option we are adding to the SDK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Once we have this options fix merged for SDK : kubeflow/sdk#91
I think we will be able to provide volume mounts as well as other configurations via PodSpecOverrides

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of adjusting this PR again promptly , will definitely share the results soon..
but This demo is totally adjusted wrt latest Kubeflow SDK version 0.1.0, so are we good to keep ClusterTrainingRuntime for now separate.. I will update the PVC creation flow to use kubernetes api

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This demo is totally adjusted wrt latest Kubeflow SDK version 0.1.0, so are we good to keep ClusterTrainingRuntime for now separate.

I'd be inclined to keep that PR open until we close this gap.

metadata:
name: torch-cuda-custom
labels:
trainer.kubeflow.org/framework: torch
spec:
mlPolicy:
numNodes: 2
torch:
numProcPerNode: 1
template:
metadata: {}
spec:
replicatedJobs:
- name: dataset-initializer
replicas: 1
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: dataset-initializer
spec:
template:
spec:
containers:
- env:
- name: HF_HOME
value: /workspace/cache
- name: DATASET_NAME
value: tatsu-lab/alpaca
- name: DATASET_CONFIG
value: main
- name: DATASET_SPLIT
value: 'train[:500]'
- name: DATASET_FORMAT
value: json
- name: WORKSPACE_PATH
value: /workspace
image: 'ghcr.io/kubeflow/trainer/dataset-initializer:v2.0.0'
name: dataset-initializer
resources:
limits:
cpu: '2'
memory: 4Gi
requests:
cpu: '1'
memory: 2Gi
volumeMounts:
- mountPath: /workspace
name: workspace
restartPolicy: Never
volumes:
- name: workspace
persistentVolumeClaim:
claimName: workspace
- dependsOn:
- name: dataset-initializer
status: Complete
name: model-initializer
replicas: 1
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: model-initializer
spec:
template:
spec:
containers:
- env:
- name: HF_HOME
value: /workspace/cache
- name: MODEL_NAME
value: gpt2
- name: MODEL_REVISION
value: main
- name: DOWNLOAD_MODE
value: force_redownload
- name: WORKSPACE_PATH
value: /workspace
image: 'ghcr.io/kubeflow/trainer/model-initializer:v2.0.0'
name: model-initializer
resources:
limits:
cpu: '2'
memory: 4Gi
requests:
cpu: '1'
memory: 2Gi
volumeMounts:
- mountPath: /workspace
name: workspace
restartPolicy: Never
volumes:
- name: workspace
persistentVolumeClaim:
claimName: workspace
- dependsOn:
- name: model-initializer
status: Complete
name: node
replicas: 1
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: trainer
spec:
template:
metadata: {}
spec:
containers:
- env:
- name: PYTHONUNBUFFERED
value: '1'
- name: NCCL_DEBUG
value: INFO
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: NCCL_IB_DISABLE
value: '1'
- name: NCCL_P2P_DISABLE
value: '1'
- name: TRAINJOB_PROGRESSION_FILE_PATH
value: /tmp/training_progression.json
- name: CHECKPOINT_DIR
value: /workspace/checkpoints
image: 'quay.io/modh/training:py311-cuda124-torch251'
name: node
resources:
limits:
cpu: '2'
memory: 4Gi
requests:
cpu: '1'
memory: 2Gi
volumeMounts:
- mountPath: /workspace
name: workspace
volumes:
- name: workspace
persistentVolumeClaim:
claimName: workspace
12 changes: 12 additions & 0 deletions examples/kft-v2/manifests/shared_pvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the Kubernetes Python SDK be used in the notebook to create that PVC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that will work too, on it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then ClusterTrainingRuntime can also be created, I guess using Kubernetes custom_resource api, is it ok?

kind: PersistentVolumeClaim
metadata:
name: workspace
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: nfs-csi
volumeMode: Filesystem
Loading