|
| 1 | +# LLM Fine-Tuning Workshop |
| 2 | + |
| 3 | +## Requirements |
| 4 | + |
| 5 | +* An OpenShift cluster with admin permissions (for the setup steps) |
| 6 | +* The `oc`, `curl`, and `git` (or equivalent) binaries installed locally |
| 7 | +* Enough worker nodes with NVIDIA GPUs (Ampere-based or newer recommended) or AMD GPUs (AMD Instinct MI300X) |
| 8 | +* The NFD operator and the NVIDIA GPU operator or AMD GPU operator installed and configured |
| 9 | +* A dynamic storage provisioner supporting RWX PVC provisioning (or see the NFS provisioner section) |
| 10 | + |
| 11 | +## Setup |
| 12 | + |
| 13 | +### Install OpenShift AI |
| 14 | + |
| 15 | +* Log into your OpenShift Web console |
| 16 | +* Go to "Operators" > "OperatorHub" > "AI/Machine Learning" |
| 17 | +* Select the "Red Hat OpenShift AI" operator |
| 18 | +* Install the latest version and Create a default DataScienceCluster resource |
| 19 | + |
| 20 | +### Checkout the workshop |
| 21 | + |
| 22 | +* Clone the following repository: |
| 23 | + ```console |
| 24 | + git clone https://github.com/opendatahub-io/distributed-workloads.git |
| 25 | + ``` |
| 26 | +* Change directory: |
| 27 | + ```console |
| 28 | + cd workshops/llm-fine-tuning |
| 29 | + ``` |
| 30 | + |
| 31 | +### NFS Provisioner (optional) |
| 32 | + |
| 33 | +> [!NOTE] |
| 34 | +> This is optional if your cluster already has a PVC dynamic provisioner with RWX support. |
| 35 | + |
| 36 | +* Install the NFS CSI driver: |
| 37 | + ```console |
| 38 | + curl -skSL https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/v4.9.0/deploy/install-driver.sh | bash -s v4.9.0 -- |
| 39 | + ``` |
| 40 | +* Create a new project: |
| 41 | + ```console |
| 42 | + oc new-project nfs |
| 43 | + ``` |
| 44 | +* Deploy the in-cluster NFS server: |
| 45 | + ```console |
| 46 | + oc apply -f nfs/nfs_deployment.yaml |
| 47 | + ``` |
| 48 | +* Create the NFS StorageClass: |
| 49 | + ```console |
| 50 | + oc apply -f nfs/nfs_storage_class.yaml |
| 51 | + ``` |
| 52 | + |
| 53 | +### Configure RHOAI |
| 54 | + |
| 55 | +* Go the OpenShift AI Dashboard (accessible from the applications menu in the top navigation bar) |
| 56 | +* Go to "Settings" > "Storage classes" |
| 57 | +* Check the storage class supporting RWX PVC provisioning you plan to use, or the `nfs-csi` one created previously, is enabled |
| 58 | + |
| 59 | +## Manage Quotas with Kueue |
| 60 | + |
| 61 | +* Update the `nodeLabels` in the `kueue/resource_flavor.yaml` file to match those of your AI worker nodes |
| 62 | +* Create the ResourceFlavor: |
| 63 | + ```console |
| 64 | + oc apply -f kueue/resource_flavor.yaml |
| 65 | + ``` |
| 66 | +* Update the `team1` and `team2` ClusterQueues according to your cluster compute resources and the ResourceFlavor you've just created |
| 67 | +* Create the ClusterQueues: |
| 68 | + ```console |
| 69 | + oc apply -f "kueue/team*_cq.yaml" |
| 70 | + ``` |
| 71 | + |
| 72 | +## Fine-Tune LLama 3.1 with Ray |
| 73 | + |
| 74 | +### Create a new project |
| 75 | + |
| 76 | +* Go the OpenShift AI Dashboard (accessible from the applications menu in the top navigation bar) |
| 77 | +* Go to "Data Science Projects" |
| 78 | +* Click "Create project" |
| 79 | +* Choose a name and click "Create" |
| 80 | + |
| 81 | +### Create a local queue |
| 82 | + |
| 83 | +* From a terminal, create a LocalQueue pointing to your team ClusterQueue: |
| 84 | + ```console |
| 85 | + oc apply -f "kueue/local_queue.yaml" |
| 86 | + ``` |
| 87 | + |
| 88 | +### Create a workbench |
| 89 | + |
| 90 | +* In the project you've just created, click "Create workbench" |
| 91 | +* Enter a name |
| 92 | +* Select the "Standard Data Science" notebook image |
| 93 | +* In "Cluster storage", click "Create storage" |
| 94 | +* Enter `training-storage` as a name and select the storage class with RWX capability, or the `nfs-csi` one if you created it previously |
| 95 | +* Enter a mount directory under `/opt/app-root/src/` |
| 96 | +* Click "Create workbench" |
| 97 | +* Back to the project page, wait for the workbench to become ready and then open it |
| 98 | + |
| 99 | +### Create a Ray cluster |
| 100 | + |
| 101 | +* In the workbench you've just created, clone the https://github.com/opendatahub-io/distributed-workloads.git repository (you can click on the "Git clone" under the top menu) |
| 102 | +* Navigate to "distributed-workloads" / "examples" / "ray-finetune-llm-deepspeed" |
| 103 | +* Open the "ray_finetune_llm_deepspeed.ipynb" notebook |
| 104 | +* In the "Authenticate the CodeFlare SDK" cell, enter your cluster API server URL and your authorization token |
| 105 | +** This can either be retrieved by running `oc whoami -t`, |
| 106 | +** Or from the OpenShift Web console, click on the user name at the right-hand side of the top navigation bar, and then select "Copy login command" |
| 107 | +* In the "Configure the Ray cluster" cell: |
| 108 | + * Add the following fields to the `ClusterConfiguration`: |
| 109 | + ```python |
| 110 | + volumes=[ |
| 111 | + V1Volume( |
| 112 | + name="training-storage", |
| 113 | + persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(claim_name="training-storage"), |
| 114 | + ), |
| 115 | + ], |
| 116 | + volume_mounts=[ |
| 117 | + V1VolumeMount(name="training-storage", mount_path="/opt/app-root/src/training/"), |
| 118 | + ], |
| 119 | + ``` |
| 120 | + * Review the compute resources so they match that of your cluster |
| 121 | + |
| 122 | +### Open the Ray cluster dashboard |
| 123 | + |
| 124 | +* Wait until the Ray cluster becomes ready |
| 125 | +* Once you've executed the `cluster.details()` cell, you can click on the Ray cluster dashboard URL printed in the output. |
| 126 | + |
| 127 | +### Submit the fine-tuning job |
| 128 | + |
| 129 | +* In the "Storage configuration" cell, set the `storage_path` variable to `/opt/app-root/src/training` |
| 130 | +* In the "job submission" cell: |
| 131 | + * Add the `HF_HOME` environment variable and set it to `f'{storage_path}/.cache'` |
| 132 | + * Review the compute resources so they match that of the Ray cluster you've created |
| 133 | + |
| 134 | +### Monitor training with TensorBoard |
| 135 | + |
| 136 | +* Install TensorBoard in the Ray head node: |
| 137 | + ```console |
| 138 | + oc exec `oc get pod -l ray.io/node-type=head -o name` -- pip install tensorboard |
| 139 | + ``` |
| 140 | +* Start TensorBoard server: |
| 141 | + ```console |
| 142 | + oc exec `oc get pod -l ray.io/node-type=head -o name` -- tensorboard --logdir /tmp/ray --bind_all --port 6006 |
| 143 | + ``` |
| 144 | +* Port-foward the TensorBoard UI endpoint: |
| 145 | + ```console |
| 146 | + oc port-forward `oc get pod -l ray.io/node-type=head -o name` 6006:6006 |
| 147 | + ``` |
| 148 | +* Access TensorBoard at http://localhost:6006 |
| 149 | + |
| 150 | +## Fine-Tune LLama 3.1 with Kubeflow Training |
| 151 | + |
| 152 | +### Enable the training operator |
| 153 | + |
| 154 | +* In the OpenShift Web console, navigate to |
| 155 | +* DataScienceCluster resource |
| 156 | + |
| 157 | +### Configure the fine-tuning job |
| 158 | + |
| 159 | +* Review / edit the `kfto/config.yaml` configuration file |
| 160 | +* Create the fine-tuning job ConfigMap by running: |
| 161 | + ```console |
| 162 | + oc create configmap llm-training --from-file=config.yaml=kfto/config.yaml --from-file=sft.py=kfto/sft.py |
| 163 | + ``` |
| 164 | + |
| 165 | +### Create the fine-tuning job |
| 166 | + |
| 167 | +* Review / edit the `kfto/job.yaml` file |
| 168 | +* Set the value of the `HF_TOKEN` environment variable if needed |
| 169 | +* Create the fine-tuning PyTorchJob by running: |
| 170 | + ``` |
| 171 | + oc apply -f kfto/job.yaml |
| 172 | + ``` |
| 173 | + |
| 174 | +### Monitor training with TensorBoard |
| 175 | + |
| 176 | +* Start TensorBoard server: |
| 177 | + ```console |
| 178 | + oc exec `oc get pod -l training.kubeflow.org/job-role=master -o name` -- tensorboard --logdir /mnt/runs --bind_all --port 6006 |
| 179 | + ``` |
| 180 | +* Port-foward the TensorBoard UI endpoint: |
| 181 | + ```console |
| 182 | + oc port-forward `oc get pod -l training.kubeflow.org/job-role=master -o name` 6006:6006 |
| 183 | + ``` |
| 184 | +* Access TensorBoard at http://localhost:6006 |
0 commit comments