-
Notifications
You must be signed in to change notification settings - Fork 70
Add Kubeflow Trainer V2 demo #454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,190 @@ | ||
| # 🚀 Kubeflow Training V2: Advanced ML Training with Distributed Computing | ||
|
|
||
| This directory contains comprehensive examples demonstrating **Kubeflow Training V2** capabilities for distributed training using the Kubeflow Trainer SDK. | ||
|
|
||
| ## 🎯 **What This Directory Demonstrates** | ||
|
|
||
| - **Kubeflow Trainer SDK**: Programmatic TrainJob creation and management | ||
| - **Checkpointing**: Controller-managed resume/suspended compatibility for model checkpoints | ||
| - **Distributed Training**: Multi-node Multi-CPU/GPU coordination with NCCL/GLOO backends | ||
|
|
||
| --- | ||
| ### **TRL (Transformer Reinforcement Learning) Integration** | ||
| - **SFTTrainer**: Supervised fine-tuning with instruction following | ||
| - **PEFT-LoRA**: Parameter-efficient fine-tuning with Low-Rank Adaptation | ||
| - **Model Support**: GPT-2, Llama, and other transformer models | ||
| - **Dataset Integration**: Alpaca dataset for instruction-following tasks | ||
|
|
||
| ### **Distributed Training Capabilities** | ||
| - **Multi-Node Support**: Scale training across multiple nodes | ||
| - **Multi-GPU Coordination**: NCCL backend CUDA for NVIDIA GPUs, ROCm for AMD GPUs | ||
| - **CPU Training**: GLOO backend for CPU-based training | ||
| - **Resource Flexibility**: Configurable compute resources per node | ||
|
|
||
| --- | ||
|
|
||
| ## 📋 **Prerequisites** | ||
|
|
||
| ### **Cluster Requirements** | ||
| - **OpenShift Cluster**: With OpenShift AI (RHOAI) 2.17+ installed | ||
| - **Required Components**: `dashboard`, `trainingoperator`, and `workbenches` enabled | ||
| - **Storage**: Persistent volume claim named `workspace` of minimum 50GB with RWX (ReadWriteMany) access mode | ||
|
|
||
| --- | ||
|
|
||
| ## 🛠️ **Setup Instructions** | ||
|
|
||
| ### **1. Repository Setup** | ||
|
|
||
| Clone the repository and navigate to the kft-v2 directory: | ||
|
|
||
| ```bash | ||
| git clone https://github.com/opendatahub-io/distributed-workloads.git | ||
| cd distributed-workloads/examples/kft-v2 | ||
| ``` | ||
|
|
||
| ### **2. Persistent Volume Setup** | ||
|
|
||
| Create a shared persistent volume for checkpoint storage: | ||
|
|
||
| ```bash | ||
| oc apply -f manifests/shared_pvc.yaml | ||
| ``` | ||
|
|
||
| ### **3. Cluster Training Runtime Setup** | ||
|
|
||
| Apply the cluster training runtime configuration: | ||
|
|
||
| ```bash | ||
| oc apply -f manifests/cluster_training_runtime.yaml | ||
| ``` | ||
|
Comment on lines
+48
to
+60
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should aim at avoiding any extra oc commands.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then ClusterTrainingRuntime can also be created, I guess using Kubernetes custom_resource api, is it ok?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally the examples would use the pre-installed ClusterTrainingRuntimes and users would not have to create one for each example. |
||
|
|
||
| This creates the necessary ClusterTrainingRuntime resources for PyTorch training. | ||
|
|
||
|
|
||
| ## Setup | ||
|
|
||
| * Access the OpenShift AI dashboard, for example from the top navigation bar menu: | ||
|
|
||
|  | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add descriptive alt text to images for accessibility. Lines 69, 73, 77, 85, 87, 99, and 144 have images without alt text, which impacts accessibility for screen reader users. Example fixes: -
+
-
+
-
+Also applies to: 73-73, 77-77, 85-85, 87-87, 99-99, 144-144 🧰 Tools🪛 markdownlint-cli2 (0.18.1)69-69: Images should have alternate text (alt text) (MD045, no-alt-text) 🤖 Prompt for AI Agents |
||
|
|
||
| * Log in, then go to _Data Science Projects_ and create a project: | ||
|
|
||
|  | ||
|
|
||
| * Once the project is created, click on _Create a workbench_: | ||
|
|
||
|  | ||
|
|
||
| * Then create a workbench with the following settings: | ||
|
|
||
| * Select the `PyTorch` (or the `ROCm-PyTorch`) notebook image: | ||
|
|
||
| * Select the _Medium_ container size and a sufficient persistent storage volume. | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
| > [!NOTE] | ||
| > | ||
| > * Adding an accelerator is only needed to test the fine-tuned model from within the workbench so you can spare an accelerator if needed. | ||
| > * Keep the default 20GB workbench storage, it is enough to run the inference from within the workbench. | ||
|
|
||
| * Review the configuration and click _Create workbench_ | ||
|
|
||
| * From "Workbenches" page, click on _Open_ when the workbench you've just created becomes ready: | ||
|
|
||
|  | ||
|
|
||
| --- | ||
|
|
||
| ## 🚀 **Quick Start Examples** | ||
|
|
||
| ### **Example 1: Fashion-MNIST Training** | ||
|
|
||
| Run the Fashion-MNIST training example: | ||
|
|
||
| ```python | ||
| from scripts.mnist import train_fashion_mnist | ||
|
|
||
| # Configure training parameters | ||
| config = { | ||
| "epochs": 10, | ||
| "batch_size": 64, | ||
| "learning_rate": 0.001, | ||
| "checkpoint_dir": "/mnt/shared/checkpoints" | ||
| } | ||
|
|
||
| # Start training | ||
| train_fashion_mnist(config) | ||
| ``` | ||
|
|
||
| ### **Example 2: TRL GPT-2 Fine-tuning** | ||
|
|
||
| Run the TRL training example: | ||
|
|
||
| ```python | ||
| from scripts.trl_training import trl_train | ||
|
|
||
| # Configure TRL parameters | ||
| config = { | ||
| "model_name": "gpt2", | ||
| "dataset_name": "alpaca", | ||
| "lora_r": 16, | ||
| "lora_alpha": 32, | ||
| "max_seq_length": 512 | ||
| } | ||
|
|
||
| # Start TRL training | ||
| trl_train(config) | ||
| ``` | ||
|
Comment on lines
+109
to
+142
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Incorrect API usage in quick start examples. Both example code blocks show passing a
Both read configuration from environment variables. Update the examples to show setting environment variables instead: # Example 1: Fashion-MNIST Training
import os
os.environ['NUM_EPOCHS'] = '10'
os.environ['BATCH_SIZE'] = '64'
os.environ['LEARNING_RATE'] = '0.001'
os.environ['CHECKPOINT_DIR'] = '/mnt/shared/checkpoints'
from scripts.mnist import train_fashion_mnist
train_fashion_mnist()# Example 2: TRL GPT-2 Fine-tuning
import os
os.environ['MODEL_NAME'] = 'gpt2'
os.environ['DATASET_NAME'] = 'tatsu-lab/alpaca'
os.environ['LORA_R'] = '16'
os.environ['LORA_ALPHA'] = '32'
from scripts.trl_training import trl_train
trl_train()🤖 Prompt for AI Agents |
||
|
|
||
|  | ||
|
|
||
| --- | ||
|
|
||
| ## 📊 **Training Examples** | ||
|
|
||
| ### **Fashion-MNIST Classification** | ||
|
|
||
| The `mnist.py` script demonstrates: | ||
|
|
||
| - **Distributed Training**: Multi-GPU Fashion-MNIST classification | ||
| - **Checkpointing**: Automatic checkpoint creation and resumption | ||
| - **Progress Tracking**: Real-time training progress monitoring | ||
| - **Error Handling**: Robust error handling and recovery | ||
|
|
||
| **Key Features:** | ||
| - CNN architecture for image classification | ||
| - Distributed data loading with DistributedSampler | ||
| - Automatic mixed precision (AMP) training | ||
| - Comprehensive logging and metrics | ||
|
|
||
| ### **TRL GPT-2 Fine-tuning** | ||
|
|
||
| The `trl_training.py` script demonstrates: | ||
|
|
||
| - **Instruction Following**: Fine-tuning GPT-2 on Alpaca dataset | ||
| - **PEFT-LoRA**: Parameter-efficient fine-tuning | ||
| - **Checkpoint Management**: TRL-compatible checkpointing | ||
| - **Distributed Coordination**: Multi-node training coordination | ||
|
|
||
| **Key Features:** | ||
| - SFTTrainer for supervised fine-tuning | ||
| - LoRA adapters for efficient parameter updates | ||
| - Instruction-following dataset processing | ||
| - Hugging Face model integration | ||
|
|
||
| --- | ||
|
|
||
| ## 📚 **References and Documentation** | ||
|
|
||
| - **[Kubeflow Trainer SDK](https://github.com/kubeflow/sdk)**: Official SDK documentation | ||
| - **[TRL Documentation](https://huggingface.co/docs/trl/)**: Transformer Reinforcement Learning | ||
| - **[PEFT Documentation](https://huggingface.co/docs/peft/)**: Parameter-Efficient Fine-Tuning | ||
| - **[PyTorch Distributed Training](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)**: Distributed training guide | ||
| - **[OpenShift AI Documentation](https://access.redhat.com/documentation/en-us/red_hat_openshift_ai)**: RHOAI documentation | ||
|
|
||
| --- | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,141 @@ | ||
| apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: ClusterTrainingRuntime | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should aim to use one of the pre-installed ClusterTrainingRuntimes.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @astefanutti Yes that was the plan but at the moment the SDK doesn't allow providing volume mount specs for trainjob.. so I had to explicitly add all the needed configs/ env variables and volume mounts in the ClusterTrainingRuntime itself..
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@abhijeet-dhumal right, that was my understanding.
No, I mean we should try to fill the gaps, and see how we can improve the SDK flexibility. Could that be one of the option we are adding to the SDK?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, Once we have this options fix merged for SDK : kubeflow/sdk#91
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm thinking of adjusting this PR again promptly , will definitely share the results soon..
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to keep that PR open until we close this gap. |
||
| metadata: | ||
| name: torch-cuda-custom | ||
| labels: | ||
| trainer.kubeflow.org/framework: torch | ||
| spec: | ||
| mlPolicy: | ||
| numNodes: 2 | ||
| torch: | ||
| numProcPerNode: 1 | ||
| template: | ||
| metadata: {} | ||
| spec: | ||
| replicatedJobs: | ||
| - name: dataset-initializer | ||
| replicas: 1 | ||
| template: | ||
| metadata: | ||
| labels: | ||
| trainer.kubeflow.org/trainjob-ancestor-step: dataset-initializer | ||
| spec: | ||
| template: | ||
| spec: | ||
| containers: | ||
| - env: | ||
| - name: HF_HOME | ||
| value: /workspace/cache | ||
| - name: DATASET_NAME | ||
| value: tatsu-lab/alpaca | ||
| - name: DATASET_CONFIG | ||
| value: main | ||
| - name: DATASET_SPLIT | ||
| value: 'train[:500]' | ||
| - name: DATASET_FORMAT | ||
| value: json | ||
| - name: WORKSPACE_PATH | ||
| value: /workspace | ||
| image: 'ghcr.io/kubeflow/trainer/dataset-initializer:v2.0.0' | ||
| name: dataset-initializer | ||
| resources: | ||
| limits: | ||
| cpu: '2' | ||
| memory: 4Gi | ||
| requests: | ||
| cpu: '1' | ||
| memory: 2Gi | ||
| volumeMounts: | ||
| - mountPath: /workspace | ||
| name: workspace | ||
| restartPolicy: Never | ||
| volumes: | ||
| - name: workspace | ||
| persistentVolumeClaim: | ||
| claimName: workspace | ||
| - dependsOn: | ||
| - name: dataset-initializer | ||
| status: Complete | ||
| name: model-initializer | ||
| replicas: 1 | ||
| template: | ||
| metadata: | ||
| labels: | ||
| trainer.kubeflow.org/trainjob-ancestor-step: model-initializer | ||
| spec: | ||
| template: | ||
| spec: | ||
| containers: | ||
| - env: | ||
| - name: HF_HOME | ||
| value: /workspace/cache | ||
| - name: MODEL_NAME | ||
| value: gpt2 | ||
| - name: MODEL_REVISION | ||
| value: main | ||
| - name: DOWNLOAD_MODE | ||
| value: force_redownload | ||
| - name: WORKSPACE_PATH | ||
| value: /workspace | ||
| image: 'ghcr.io/kubeflow/trainer/model-initializer:v2.0.0' | ||
| name: model-initializer | ||
| resources: | ||
| limits: | ||
| cpu: '2' | ||
| memory: 4Gi | ||
| requests: | ||
| cpu: '1' | ||
| memory: 2Gi | ||
| volumeMounts: | ||
| - mountPath: /workspace | ||
| name: workspace | ||
| restartPolicy: Never | ||
| volumes: | ||
| - name: workspace | ||
| persistentVolumeClaim: | ||
| claimName: workspace | ||
| - dependsOn: | ||
| - name: model-initializer | ||
| status: Complete | ||
| name: node | ||
| replicas: 1 | ||
| template: | ||
| metadata: | ||
| labels: | ||
| trainer.kubeflow.org/trainjob-ancestor-step: trainer | ||
| spec: | ||
| template: | ||
| metadata: {} | ||
| spec: | ||
| containers: | ||
| - env: | ||
| - name: PYTHONUNBUFFERED | ||
| value: '1' | ||
| - name: NCCL_DEBUG | ||
| value: INFO | ||
| - name: NCCL_SOCKET_IFNAME | ||
| value: eth0 | ||
| - name: NCCL_IB_DISABLE | ||
| value: '1' | ||
| - name: NCCL_P2P_DISABLE | ||
| value: '1' | ||
| - name: TRAINJOB_PROGRESSION_FILE_PATH | ||
| value: /tmp/training_progression.json | ||
| - name: CHECKPOINT_DIR | ||
| value: /workspace/checkpoints | ||
| image: 'quay.io/modh/training:py311-cuda124-torch251' | ||
| name: node | ||
| resources: | ||
| limits: | ||
| cpu: '2' | ||
| memory: 4Gi | ||
| requests: | ||
| cpu: '1' | ||
| memory: 2Gi | ||
| volumeMounts: | ||
| - mountPath: /workspace | ||
| name: workspace | ||
| volumes: | ||
| - name: workspace | ||
| persistentVolumeClaim: | ||
| claimName: workspace | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| apiVersion: v1 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could the Kubernetes Python SDK be used in the notebook to create that PVC?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah that will work too, on it!
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then ClusterTrainingRuntime can also be created, I guess using Kubernetes custom_resource api, is it ok? |
||
| kind: PersistentVolumeClaim | ||
| metadata: | ||
| name: workspace | ||
| spec: | ||
| accessModes: | ||
| - ReadWriteMany | ||
| resources: | ||
| requests: | ||
| storage: 50Gi | ||
| storageClassName: nfs-csi | ||
| volumeMode: Filesystem | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trainer v2 component should probably be
trainer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I will update it accoridngly, Thanks!
Thinking is it ok to put it that way, as we don't have this utility available yet 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this example be available before v2 is in RHOAI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would likely confuse people to have this example available using the v1 component so it may be better to hold this PR and wait until v2 is in RHOAI, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fiona-Waters @astefanutti Yeah that makes sense, I will keep it as a draft for now!
There are some upcoming changes in Kubeflow SDK which will further simplify overall workflow here,
Specially TrainJob Options implementation including PodSpecOverrides capability which will allow mounting volume mounts without a need of customising default ClusterTrainingRuntimes 👍
So it will inturn help reducing oc dependencies expected from user ✅