Skip to content

Commit c36f120

Browse files
Add Kubeflow Trainer V2 based finetuning demo example
1 parent 18dbc41 commit c36f120

File tree

16 files changed

+2130
-0
lines changed

16 files changed

+2130
-0
lines changed

examples/kft-v2/README.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# 🚀 Kubeflow Training V2: Advanced ML Training with Distributed Computing
2+
3+
This directory contains comprehensive examples demonstrating **Kubeflow Training V2** capabilities for distributed training using the Kubeflow Trainer SDK.
4+
5+
## 🎯 **What This Directory Demonstrates**
6+
7+
- **Kubeflow Trainer SDK**: Programmatic TrainJob creation and management
8+
- **Checkpointing**: Controller-managed resume/suspended compatibility for model checkpoints
9+
- **Distributed Training**: Multi-node Multi-CPU/GPU coordination with NCCL/GLOO backends
10+
11+
---
12+
### **TRL (Transformer Reinforcement Learning) Integration**
13+
- **SFTTrainer**: Supervised fine-tuning with instruction following
14+
- **PEFT-LoRA**: Parameter-efficient fine-tuning with Low-Rank Adaptation
15+
- **Model Support**: GPT-2, Llama, and other transformer models
16+
- **Dataset Integration**: Alpaca dataset for instruction-following tasks
17+
18+
### **Distributed Training Capabilities**
19+
- **Multi-Node Support**: Scale training across multiple nodes
20+
- **Multi-GPU Coordination**: NCCL backend CUDA for NVIDIA GPUs, ROCm for AMD GPUs
21+
- **CPU Training**: GLOO backend for CPU-based training
22+
- **Resource Flexibility**: Configurable compute resources per node
23+
24+
---
25+
26+
## 📋 **Prerequisites**
27+
28+
### **Cluster Requirements**
29+
- **OpenShift Cluster**: With OpenShift AI (RHOAI) 2.17+ installed
30+
- **Required Components**: `dashboard`, `trainingoperator`, and `workbenches` enabled
31+
- **Storage**: Persistent volume claim named `workspace` of minimum 50GB with RWX (ReadWriteMany) access mode
32+
33+
---
34+
35+
## 🛠️ **Setup Instructions**
36+
37+
### **1. Repository Setup**
38+
39+
Clone the repository and navigate to the kft-v2 directory:
40+
41+
```bash
42+
git clone https://github.com/opendatahub-io/distributed-workloads.git
43+
cd distributed-workloads/examples/kft-v2
44+
```
45+
46+
### **2. Persistent Volume Setup**
47+
48+
Create a shared persistent volume for checkpoint storage:
49+
50+
```bash
51+
oc apply -f manifests/shared_pvc.yaml
52+
```
53+
54+
### **3. Cluster Training Runtime Setup**
55+
56+
Apply the cluster training runtime configuration:
57+
58+
```bash
59+
oc apply -f manifests/cluster_training_runtime.yaml
60+
```
61+
62+
This creates the necessary ClusterTrainingRuntime resources for PyTorch training.
63+
64+
65+
## Setup
66+
67+
* Access the OpenShift AI dashboard, for example from the top navigation bar menu:
68+
69+
![](./docs/01.png)
70+
71+
* Log in, then go to _Data Science Projects_ and create a project:
72+
73+
![](./docs/02.png)
74+
75+
* Once the project is created, click on _Create a workbench_:
76+
77+
![](./docs/03.png)
78+
79+
* Then create a workbench with the following settings:
80+
81+
* Select the `PyTorch` (or the `ROCm-PyTorch`) notebook image:
82+
83+
* Select the _Medium_ container size and a sufficient persistent storage volume.
84+
85+
![](./docs/04.png)
86+
87+
![](./docs/05.png)
88+
89+
> [!NOTE]
90+
>
91+
> * Adding an accelerator is only needed to test the fine-tuned model from within the workbench so you can spare an accelerator if needed.
92+
> * Keep the default 20GB workbench storage, it is enough to run the inference from within the workbench.
93+
94+
95+
* Review the configuration and click _Create workbench_
96+
97+
* From "Workbenches" page, click on _Open_ when the workbench you've just created becomes ready:
98+
99+
![](./docs/06.png)
100+
101+
---
102+
103+
## 🚀 **Quick Start Examples**
104+
105+
### **Example 1: Fashion-MNIST Training**
106+
107+
Run the Fashion-MNIST training example:
108+
109+
```python
110+
from scripts.mnist import train_fashion_mnist
111+
112+
# Configure training parameters
113+
config = {
114+
"epochs": 10,
115+
"batch_size": 64,
116+
"learning_rate": 0.001,
117+
"checkpoint_dir": "/mnt/shared/checkpoints"
118+
}
119+
120+
# Start training
121+
train_fashion_mnist(config)
122+
```
123+
124+
### **Example 2: TRL GPT-2 Fine-tuning**
125+
126+
Run the TRL training example:
127+
128+
```python
129+
from scripts.trl_training import trl_train
130+
131+
# Configure TRL parameters
132+
config = {
133+
"model_name": "gpt2",
134+
"dataset_name": "alpaca",
135+
"lora_r": 16,
136+
"lora_alpha": 32,
137+
"max_seq_length": 512
138+
}
139+
140+
# Start TRL training
141+
trl_train(config)
142+
```
143+
144+
![](./docs/07.png)
145+
146+
---
147+
148+
## 📊 **Training Examples**
149+
150+
### **Fashion-MNIST Classification**
151+
152+
The `mnist.py` script demonstrates:
153+
154+
- **Distributed Training**: Multi-GPU Fashion-MNIST classification
155+
- **Checkpointing**: Automatic checkpoint creation and resumption
156+
- **Progress Tracking**: Real-time training progress monitoring
157+
- **Error Handling**: Robust error handling and recovery
158+
159+
**Key Features:**
160+
- CNN architecture for image classification
161+
- Distributed data loading with DistributedSampler
162+
- Automatic mixed precision (AMP) training
163+
- Comprehensive logging and metrics
164+
165+
### **TRL GPT-2 Fine-tuning**
166+
167+
The `trl_training.py` script demonstrates:
168+
169+
- **Instruction Following**: Fine-tuning GPT-2 on Alpaca dataset
170+
- **PEFT-LoRA**: Parameter-efficient fine-tuning
171+
- **Checkpoint Management**: TRL-compatible checkpointing
172+
- **Distributed Coordination**: Multi-node training coordination
173+
174+
**Key Features:**
175+
- SFTTrainer for supervised fine-tuning
176+
- LoRA adapters for efficient parameter updates
177+
- Instruction-following dataset processing
178+
- Hugging Face model integration
179+
180+
---
181+
182+
## 📚 **References and Documentation**
183+
184+
- **[Kubeflow Trainer SDK](https://github.com/kubeflow/sdk)**: Official SDK documentation
185+
- **[TRL Documentation](https://huggingface.co/docs/trl/)**: Transformer Reinforcement Learning
186+
- **[PEFT Documentation](https://huggingface.co/docs/peft/)**: Parameter-Efficient Fine-Tuning
187+
- **[PyTorch Distributed Training](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)**: Distributed training guide
188+
- **[OpenShift AI Documentation](https://access.redhat.com/documentation/en-us/red_hat_openshift_ai)**: RHOAI documentation
189+
190+
---

examples/kft-v2/docs/01.png

284 KB
Loading

examples/kft-v2/docs/02.png

262 KB
Loading

examples/kft-v2/docs/03.png

165 KB
Loading

examples/kft-v2/docs/04.png

179 KB
Loading

examples/kft-v2/docs/05.png

330 KB
Loading

examples/kft-v2/docs/06.png

364 KB
Loading

examples/kft-v2/docs/07.png

291 KB
Loading

examples/kft-v2/docs/jobs.png

182 KB
Loading
91 KB
Loading

0 commit comments

Comments
 (0)