You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: setup.KubeConEU25/README.md
+226-1Lines changed: 226 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -823,8 +823,233 @@ In this example, `alice` uses [KubeRay](https://github.com/ray-project/kuberay)
823
823
to run a job that uses [Ray](https://github.com/ray-project/ray) to fine tune a
824
824
machine learning model.
825
825
826
+
This workload is an adaptation from [this blog post by Red Hat](https://developers.redhat.com/articles/2024/09/30/fine-tune-llama-openshift-ai), in turn adapted from [an example on Ray documentation](https://github.com/ray-project/ray/tree/master/doc/source/templates/04_finetuning_llms_with_deepspeed).
827
+
The example is about fine tuning Llama 3.1 with Ray, with DeepSpeed and LoRA.
828
+
826
829
<details>
827
830
828
-
TODO
831
+
Let's set up the environment by installing Ray and cloning the repository
We are going to impersonate Alice in this example.
838
+
839
+
First, we create the PVC where we can download the model and save the checkpoints from the fine tuning job. We are calling this PVC `finetuning-pvc` and we need to add this to the Ray cluster YAML. If another name is used, please update the `claimName` entry in the Ray cluster definition.
840
+
841
+
```bash
842
+
kubectl apply --as alice -n blue -f- << EOF
843
+
apiVersion: v1
844
+
kind: PersistentVolumeClaim
845
+
metadata:
846
+
name: finetuning-pvc
847
+
spec:
848
+
accessModes:
849
+
- ReadWriteMany
850
+
resources:
851
+
requests:
852
+
storage: 100Gi
853
+
storageClassName: nfs-client-pokprod
854
+
EOF
855
+
```
856
+
857
+
Now, let's create an AppWrapper version of the Ray cluster. Notice that:
858
+
859
+
- We are using the container image `quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26` from Red Hat, but you can use the images from DockerHub if preferred
860
+
- We are setting the number of worker replicas to `7`. Since we want to run on one GPU node, we are assigning one to the Ray Head pod, and one each to the 7 worker pods.
Now we can submit the job while impersonating Alice
981
+
982
+
```bash
983
+
kubectl create -f ray-aw.yaml -n blue --as alice
984
+
```
985
+
986
+
Now that the Ray cluster is set up, first we need to expose the `ray-head` service, as that is the entrypoint for all job submissions. In another terminal, type:
987
+
988
+
```bash
989
+
kubectl port-forward svc/ray-head-svc 8265:8265 -n blue --as alice
990
+
```
991
+
992
+
Now we can download the git repository with the fine tuning workload.
2025-03-24 16:37:53,029 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_21ddaa8b13d30deb.zip.
1044
+
2025-03-24 16:37:53,030 INFO packaging.py:575 -- Creating a file package for local module './'.
1045
+
Use the following command to follow this Job's logs:
1046
+
ray job logs 'raysubmit_C6hVCvdhpmapgQB8' --address http://127.0.0.1:8265 --follow
1047
+
```
1048
+
1049
+
We can now either follow the logs on the terminal with `ray job logs` command, or open the Ray dashboard and follow from there. To access the Ray dashboard from localhost, as we exposed the service earlier.
1050
+
1051
+
Once the job is completed, the checkpoint with the fine tuned model is saved in the folder
0 commit comments