Skip to content

Commit 2330ef9

Browse files
committed
Merge remote-tracking branch 'upstream/main'
2 parents 25597f7 + a16a347 commit 2330ef9

File tree

10 files changed

+708
-0
lines changed

10 files changed

+708
-0
lines changed
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# LLM Fine-Tuning Workshop
2+
3+
## Requirements
4+
5+
* An OpenShift cluster with admin permissions (for the setup steps)
6+
* The `oc`, `curl`, and `git` (or equivalent) binaries installed locally
7+
* Enough worker nodes with NVIDIA GPUs (Ampere-based or newer recommended) or AMD GPUs (AMD Instinct MI300X)
8+
* The NFD operator and the NVIDIA GPU operator or AMD GPU operator installed and configured
9+
* A dynamic storage provisioner supporting RWX PVC provisioning (or see the NFS provisioner section)
10+
11+
## Setup
12+
13+
### Install OpenShift AI
14+
15+
* Log into your OpenShift Web console
16+
* Go to "Operators" > "OperatorHub" > "AI/Machine Learning"
17+
* Select the "Red Hat OpenShift AI" operator
18+
* Install the latest version and Create a default DataScienceCluster resource
19+
20+
### Checkout the workshop
21+
22+
* Clone the following repository:
23+
```console
24+
git clone https://github.com/opendatahub-io/distributed-workloads.git
25+
```
26+
* Change directory:
27+
```console
28+
cd workshops/llm-fine-tuning
29+
```
30+
31+
### NFS Provisioner (optional)
32+
33+
> [!NOTE]
34+
> This is optional if your cluster already has a PVC dynamic provisioner with RWX support.
35+
36+
* Install the NFS CSI driver:
37+
```console
38+
curl -skSL https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/v4.9.0/deploy/install-driver.sh | bash -s v4.9.0 --
39+
```
40+
* Create a new project:
41+
```console
42+
oc new-project nfs
43+
```
44+
* Deploy the in-cluster NFS server:
45+
```console
46+
oc apply -f nfs/nfs_deployment.yaml
47+
```
48+
* Create the NFS StorageClass:
49+
```console
50+
oc apply -f nfs/nfs_storage_class.yaml
51+
```
52+
53+
### Configure RHOAI
54+
55+
* Go the OpenShift AI Dashboard (accessible from the applications menu in the top navigation bar)
56+
* Go to "Settings" > "Storage classes"
57+
* Check the storage class supporting RWX PVC provisioning you plan to use, or the `nfs-csi` one created previously, is enabled
58+
59+
## Manage Quotas with Kueue
60+
61+
* Update the `nodeLabels` in the `kueue/resource_flavor.yaml` file to match those of your AI worker nodes
62+
* Create the ResourceFlavor:
63+
```console
64+
oc apply -f kueue/resource_flavor.yaml
65+
```
66+
* Update the `team1` and `team2` ClusterQueues according to your cluster compute resources and the ResourceFlavor you've just created
67+
* Create the ClusterQueues:
68+
```console
69+
oc apply -f "kueue/team*_cq.yaml"
70+
```
71+
72+
## Fine-Tune LLama 3.1 with Ray
73+
74+
### Create a new project
75+
76+
* Go the OpenShift AI Dashboard (accessible from the applications menu in the top navigation bar)
77+
* Go to "Data Science Projects"
78+
* Click "Create project"
79+
* Choose a name and click "Create"
80+
81+
### Create a local queue
82+
83+
* From a terminal, create a LocalQueue pointing to your team ClusterQueue:
84+
```console
85+
oc apply -f "kueue/local_queue.yaml"
86+
```
87+
88+
### Create a workbench
89+
90+
* In the project you've just created, click "Create workbench"
91+
* Enter a name
92+
* Select the "Standard Data Science" notebook image
93+
* In "Cluster storage", click "Create storage"
94+
* Enter `training-storage` as a name and select the storage class with RWX capability, or the `nfs-csi` one if you created it previously
95+
* Enter a mount directory under `/opt/app-root/src/`
96+
* Click "Create workbench"
97+
* Back to the project page, wait for the workbench to become ready and then open it
98+
99+
### Create a Ray cluster
100+
101+
* In the workbench you've just created, clone the https://github.com/opendatahub-io/distributed-workloads.git repository (you can click on the "Git clone" under the top menu)
102+
* Navigate to "distributed-workloads" / "examples" / "ray-finetune-llm-deepspeed"
103+
* Open the "ray_finetune_llm_deepspeed.ipynb" notebook
104+
* In the "Authenticate the CodeFlare SDK" cell, enter your cluster API server URL and your authorization token
105+
** This can either be retrieved by running `oc whoami -t`,
106+
** Or from the OpenShift Web console, click on the user name at the right-hand side of the top navigation bar, and then select "Copy login command"
107+
* In the "Configure the Ray cluster" cell:
108+
* Add the following fields to the `ClusterConfiguration`:
109+
```python
110+
volumes=[
111+
V1Volume(
112+
name="training-storage",
113+
persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(claim_name="training-storage"),
114+
),
115+
],
116+
volume_mounts=[
117+
V1VolumeMount(name="training-storage", mount_path="/opt/app-root/src/training/"),
118+
],
119+
```
120+
* Review the compute resources so they match that of your cluster
121+
122+
### Open the Ray cluster dashboard
123+
124+
* Wait until the Ray cluster becomes ready
125+
* Once you've executed the `cluster.details()` cell, you can click on the Ray cluster dashboard URL printed in the output.
126+
127+
### Submit the fine-tuning job
128+
129+
* In the "Storage configuration" cell, set the `storage_path` variable to `/opt/app-root/src/training`
130+
* In the "job submission" cell:
131+
* Add the `HF_HOME` environment variable and set it to `f'{storage_path}/.cache'`
132+
* Review the compute resources so they match that of the Ray cluster you've created
133+
134+
### Monitor training with TensorBoard
135+
136+
* Install TensorBoard in the Ray head node:
137+
```console
138+
oc exec `oc get pod -l ray.io/node-type=head -o name` -- pip install tensorboard
139+
```
140+
* Start TensorBoard server:
141+
```console
142+
oc exec `oc get pod -l ray.io/node-type=head -o name` -- tensorboard --logdir /tmp/ray --bind_all --port 6006
143+
```
144+
* Port-foward the TensorBoard UI endpoint:
145+
```console
146+
oc port-forward `oc get pod -l ray.io/node-type=head -o name` 6006:6006
147+
```
148+
* Access TensorBoard at http://localhost:6006
149+
150+
## Fine-Tune LLama 3.1 with Kubeflow Training
151+
152+
### Enable the training operator
153+
154+
* In the OpenShift Web console, navigate to
155+
* DataScienceCluster resource
156+
157+
### Configure the fine-tuning job
158+
159+
* Review / edit the `kfto/config.yaml` configuration file
160+
* Create the fine-tuning job ConfigMap by running:
161+
```console
162+
oc create configmap llm-training --from-file=config.yaml=kfto/config.yaml --from-file=sft.py=kfto/sft.py
163+
```
164+
165+
### Create the fine-tuning job
166+
167+
* Review / edit the `kfto/job.yaml` file
168+
* Set the value of the `HF_TOKEN` environment variable if needed
169+
* Create the fine-tuning PyTorchJob by running:
170+
```
171+
oc apply -f kfto/job.yaml
172+
```
173+
174+
### Monitor training with TensorBoard
175+
176+
* Start TensorBoard server:
177+
```console
178+
oc exec `oc get pod -l training.kubeflow.org/job-role=master -o name` -- tensorboard --logdir /mnt/runs --bind_all --port 6006
179+
```
180+
* Port-foward the TensorBoard UI endpoint:
181+
```console
182+
oc port-forward `oc get pod -l training.kubeflow.org/job-role=master -o name` 6006:6006
183+
```
184+
* Access TensorBoard at http://localhost:6006
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Model
2+
model_id_or_path: Meta-Llama/Meta-Llama-3.1-8B-Instruct
3+
#tokenizer_name_or_path: Meta-Llama/Meta-Llama-3.1-8B-Instruct
4+
#model_revision: main
5+
torch_dtype: bfloat16
6+
#attn_implementation: flash_attention_2
7+
#use_liger: true
8+
bf16: true # use bfloat16 precision
9+
tf32: true # use tf32 precision
10+
11+
# Quantization / BitsAndBytes
12+
use_bnb: false
13+
load_in_4bit: true
14+
15+
# LoRA / PEFT
16+
use_peft: true
17+
lora_target_modules: "all-linear"
18+
lora_modules_to_save: ["lm_head", "embed_tokens"]
19+
lora_r: 16
20+
lora_alpha: 8
21+
lora_dropout: 0.05
22+
23+
# SFT
24+
dataset_id_or_path: gsm8k # id or path to the dataset
25+
dataset_config_name: main # name of the dataset configuration
26+
#dataset_batch_size: 64 # mini batch size
27+
max_seq_length: 512 # max sequence length for model and packing of the dataset
28+
packing: true
29+
30+
# FSDP
31+
fsdp: "full_shard auto_wrap offload" # remove offload if enough GPU memory
32+
fsdp_config:
33+
backward_prefetch: "backward_pre"
34+
forward_prefetch: "false"
35+
use_orig_params: "false"
36+
37+
# Training
38+
num_train_epochs: 20 # number of training epochs
39+
40+
per_device_train_batch_size: 8 # batch size per device during training
41+
per_device_eval_batch_size: 8 # batch size for evaluation
42+
evaluation_strategy: epoch # evaluate every epoch
43+
44+
max_grad_norm: 0.3 # max gradient norm
45+
gradient_accumulation_steps: 1 # number of steps before performing a backward/update pass
46+
learning_rate: 2.0e-4 # learning rate
47+
lr_scheduler_type: constant # learning rate scheduler
48+
optim: adamw_torch # use torch adamw optimizer
49+
warmup_ratio: 0.1 # warmup ratio
50+
seed: 42
51+
52+
# Checkpointing
53+
gradient_checkpointing: true # use gradient checkpointing to save memory
54+
gradient_checkpointing_kwargs:
55+
use_reentrant: false
56+
save_strategy: "epoch" # save checkpoint every epoch
57+
58+
# Logging
59+
logging_strategy: steps
60+
logging_steps: 1 # log every 10 steps
61+
report_to:
62+
- tensorboard # report metrics to tensorboard
63+
64+
output_dir: /mnt/nfs/runs/Meta-Llama-3.1-8B-Instruct
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
apiVersion: kubeflow.org/v1
2+
kind: PyTorchJob
3+
metadata:
4+
name: llm-training
5+
spec:
6+
pytorchReplicaSpecs:
7+
Master:
8+
replicas: 1
9+
restartPolicy: OnFailure
10+
template:
11+
metadata: &metadata
12+
labels:
13+
app: llm-training
14+
spec:
15+
affinity: &affinity
16+
containers: &containers
17+
- command:
18+
- /bin/bash
19+
- -c
20+
- "pip install tensorboard && torchrun /etc/config/sft.py --config /etc/config/config.yaml"
21+
env:
22+
- name: HF_HOME
23+
value: /mnt/.cache
24+
- name: HF_TOKEN
25+
value: ""
26+
- name: TRITON_CACHE_DIR
27+
value: /tmp/.triton
28+
- name: TOKENIZERS_PARALLELISM
29+
value: "false"
30+
- name: PYTORCH_CUDA_ALLOC_CONF
31+
value: "expandable_segments:True"
32+
- name: NCCL_DEBUG
33+
value: INFO
34+
image: quay.io/modh/training:py311-cuda121-torch241
35+
imagePullPolicy: IfNotPresent
36+
name: pytorch
37+
resources:
38+
limits:
39+
cpu: "4"
40+
memory: 64Gi
41+
nvidia.com/gpu: "1"
42+
requests:
43+
cpu: "4"
44+
memory: 64Gi
45+
nvidia.com/gpu: "1"
46+
volumeMounts:
47+
- mountPath: /etc/config
48+
name: config
49+
- mountPath: /tmp
50+
name: tmp
51+
- mountPath: /mnt
52+
name: training-storage
53+
tolerations: &tolerations
54+
- key: nvidia.com/gpu
55+
operator: Exists
56+
volumes: &volumes
57+
- configMap:
58+
name: llm-training
59+
name: config
60+
- emptyDir: {}
61+
name: tmp
62+
- name: training-storage
63+
persistentVolumeClaim:
64+
claimName: training-storage
65+
Worker:
66+
replicas: 3
67+
restartPolicy: OnFailure
68+
template:
69+
metadata: *metadata
70+
spec:
71+
affinity: *affinity
72+
containers: *containers
73+
tolerations: *tolerations
74+
volumes: *volumes

0 commit comments

Comments
 (0)