Skip to content

Commit 4e0fd36

Browse files
committed
Merge remote-tracking branch 'upstream/main'
2 parents 76ed73f + 9e9b173 commit 4e0fd36

File tree

11 files changed

+1557
-0
lines changed

11 files changed

+1557
-0
lines changed

examples/kfto-feast/README.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# 🚀 Quickstart: Fine‑Tuning Granite Models with FSDP/DeepSpeed & LoRA/QLoRA Using a Simple FEAST Store example
2+
3+
This notebook guides you through the process of fine-tuning **Large Language Models** using **Feast**, **Kubeflow-Training**, and modern optimization strategies like **FSDP**, **DeepSpeed**, and **LoRA** to boost training performance and efficiency.
4+
5+
In particular, this example demonstrates:
6+
1. How to implement **Fully Sharded Data Parallel (FSDP)** and **DeepSpeed** to distribute training across multiple GPUs, enhancing scalability and speed.
7+
2. How to apply **Low-Rank Adaptation (LoRA)** or **Quantized Low-Rank Adaptation (QLoRA)** via the [PEFT library](https://github.com/huggingface/peft) for parameter-efficient fine-tuning, reducing computational and memory overhead.
8+
3. How to retrieve and manage **training features using Feast**, enabling consistent, scalable, and reproducible ML pipelines.
9+
10+
---
11+
12+
## 🍽️ What is Feast and How Are We Using It?
13+
14+
[Feast (Feature Store)](https://github.com/feast-dev/feast) is a powerful operational data system for machine learning that helps manage, store, and serve features consistently during training and inference. In this workflow, **Feast acts as the centralized source of truth for model features**, decoupling feature engineering from model training.
15+
16+
Specifically, we use Feast to:
17+
18+
- **Define and register feature definitions** for training data using a standardized interface.
19+
- **Ingest and materialize features** from upstream sources (e.g., batch files, data warehouses).
20+
- **Fetch training features** as PyTorch-friendly tensors, ensuring consistency across training and production.
21+
- **Version control feature sets** to improve reproducibility and traceability in experiments.
22+
23+
By integrating Feast into the fine-tuning pipeline, we ensure that the training process is not only scalable but also **robust, modular, and production-ready**.
24+
25+
---
26+
27+
## 💡 Why Use FSDP, DeepSpeed, LoRA, and Feast for Fine-Tuning?
28+
29+
- **Efficient Distributed Training:** Utilize FSDP or DeepSpeed to handle large models by distributing the training process, enabling faster and more scalable training sessions.
30+
- **Parameter Efficiency with LoRA/QLoRA:** Implement LoRA to fine-tune models by updating only a subset of parameters, LoRA’s low‑rank adapters let you adapt a full LLM by training a tiny fraction of parameters, saving compute and storage by ~90% and speeding up training ; QLoRA extends the LoRA technique by loading the model in 4‑bit quantization, freezing the quantized weights, and updating only the adapters to support massive models on limited‑memory GPUs.
31+
- **Feature Management with Feast:** Fetch well-defined, version-controlled features seamlessly into your pipeline, boosting reproducibility and easing data integration.
32+
- **Flexible Configuration Management:** Store DeepSpeed and LoRA settings in separate YAML files, allowing for easy modifications and experimentation without altering the core training script.
33+
- **Mixed-Precision Training:** Leverage automatic mixed precision (AMP) to accelerate training and reduce memory usage by combining different numerical precisions.
34+
- **Model Saving and Uploading:** Save the fine-tuned model and tokenizer locally and upload them to an S3 bucket for persistent storage and easy deployment.
35+
36+
---
37+
38+
## Requirements
39+
40+
* An OpenShift cluster with OpenShift AI (RHOAI) 2.17+ installed:
41+
* The `dashboard`, `trainingoperator` and `workbenches` components enabled
42+
* Sufficient worker nodes for your configuration(s) with NVIDIA GPUs (Ampere-based or newer recommended)
43+
* If using PEFT LoRA/QLoRA techniques, then can use NVIDIA GPUs (G4dn)
44+
* AWS S3 storage available
45+
46+
---
47+
48+
49+
By following this notebook, you'll gain hands-on experience in setting up a **feature-rich, efficient, and scalable** fine-tuning pipeline for **Granite language models**, leveraging tooling across model training and feature engineering.
50+
51+
52+
## Setup
53+
54+
* Access the OpenShift AI dashboard, for example from the top navigation bar menu:
55+
56+
![](./docs/01.png)
57+
58+
* Log in, then go to _Data Science Projects_ and create a project:
59+
60+
![](./docs/02.png)
61+
62+
* Once the project is created, click on _Create a workbench_:
63+
64+
![](./docs/03.png)
65+
66+
* Then create a workbench with the following settings:
67+
68+
* Select the `PyTorch` (or the `ROCm-PyTorch`) notebook image:
69+
70+
![](./docs/04a.png)
71+
72+
* Select the _Small_ container size and a sufficient persistent storage volume.
73+
* In _'Environment variables'_ section, use variable type as _Secret_ and provide key/value pair to store _HF-TOKEN_ as a kubernetes secret :
74+
75+
![](./docs/04b.png)
76+
77+
* Click on _Create connection_ to create a workbench connection to your S3 compatible storage bucket:
78+
79+
* Select option : _S3 compatible object storage - v1_
80+
81+
82+
![](./docs/04ci.png)
83+
84+
* Fill all the needed fields, also specify _Bucket_ value (it is used in the workbench), then confirm:
85+
86+
87+
![](./docs/04cii.png)
88+
89+
> [!NOTE]
90+
>
91+
> * Adding an accelerator is only needed to test the fine-tuned model from within the workbench so you can spare an accelerator if needed.
92+
> * Keep the default 20GB workbench storage, it is enough to run the inference from within the workbench.
93+
> * If you use different connection name than _s3-data-connection_ then you need to adjust the _aws_connection_name_ properly in notebook to refer to this new name.
94+
95+
96+
* Review the configuration and click _Create workbench_:
97+
98+
![](./docs/04d.png)
99+
100+
* From "Workbenches" page, click on _Open_ when the workbench you've just created becomes ready:
101+
102+
![](./docs/05.png)
103+
104+
* From the workbench, clone this repository, i.e., `https://github.com/opendatahub-io/distributed-workloads.git`
105+
106+
* Navigate to the `distributed-workloads/examples/kfto-feast` directory and open the `kfto_feast` notebook
107+
108+
109+
110+
You can now proceed with the instructions from the notebook. Enjoy!
111+
112+

examples/kfto-feast/docs/01.png

112 KB
Loading

examples/kfto-feast/docs/02.png

64.7 KB
Loading

examples/kfto-feast/docs/03.png

122 KB
Loading

examples/kfto-feast/docs/04a.png

96.4 KB
Loading

examples/kfto-feast/docs/04b.png

77.9 KB
Loading

examples/kfto-feast/docs/04ci.png

124 KB
Loading

examples/kfto-feast/docs/04cii.png

91 KB
Loading

examples/kfto-feast/docs/04d.png

97 KB
Loading

examples/kfto-feast/docs/05.png

118 KB
Loading

0 commit comments

Comments
 (0)