You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: setup.KubeConEU25/README.md
+226-1Lines changed: 226 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -692,8 +692,233 @@ In this example, `alice` uses [KubeRay](https://github.com/ray-project/kuberay)
692
692
to run a job that uses [Ray](https://github.com/ray-project/ray) to fine tune a
693
693
machine learning model.
694
694
695
+
This workload is an adaptation from [this blog post by Red Hat](https://developers.redhat.com/articles/2024/09/30/fine-tune-llama-openshift-ai), in turn adapted from [an example on Ray documentation](https://github.com/ray-project/ray/tree/master/doc/source/templates/04_finetuning_llms_with_deepspeed).
696
+
The example is about fine tuning Llama 3.1 with Ray, with DeepSpeed and LoRA.
697
+
695
698
<details>
696
699
697
-
TODO
700
+
Let's set up the environment by installing Ray and cloning the repository
We are going to impersonate Alice in this example.
707
+
708
+
First, we create the PVC where we can download the model and save the checkpoints from the fine tuning job. We are calling this PVC `finetuning-pvc` and we need to add this to the Ray cluster YAML. If another name is used, please update the `claimName` entry in the Ray cluster definition.
709
+
710
+
```bash
711
+
kubectl apply --as alice -n blue -f- <<EOF
712
+
apiVersion: v1
713
+
kind: PersistentVolumeClaim
714
+
metadata:
715
+
name: finetuning-pvc
716
+
spec:
717
+
accessModes:
718
+
- ReadWriteMany
719
+
resources:
720
+
requests:
721
+
storage: 100Gi
722
+
storageClassName: nfs-client-pokprod
723
+
EOF
724
+
```
725
+
726
+
Now, let's create an AppWrapper version of the Ray cluster. Notice that:
727
+
728
+
- We are using the container image `quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26` from Red Hat, but you can use the images from DockerHub if preferred
729
+
- We are setting the number of worker replicas to `7`. Since we want to run on one GPU node, we are assigning one to the Ray Head pod, and one each to the 7 worker pods.
Now we can submit the job while impersonating Alice
850
+
851
+
```bash
852
+
kubectl create -f ray-aw.yaml -n blue --as alice
853
+
```
854
+
855
+
Now that the Ray cluster is set up, first we need to expose the `ray-head` service, as that is the entrypoint for all job submissions. In another terminal, type:
856
+
857
+
```bash
858
+
kubectl port-forward svc/ray-head-svc 8265:8265 -n blue --as alice
859
+
```
860
+
861
+
Now we can download the git repository with the fine tuning workload.
2025-03-24 16:37:53,029 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_21ddaa8b13d30deb.zip.
913
+
2025-03-24 16:37:53,030 INFO packaging.py:575 -- Creating a file package forlocal module './'.
914
+
Use the following command to follow this Job's logs:
915
+
ray job logs 'raysubmit_C6hVCvdhpmapgQB8' --address http://127.0.0.1:8265 --follow
916
+
```
917
+
918
+
We can now either follow the logs on the terminal with `ray job logs` command, or open the Ray dashboard and follow from there. To access the Ray dashboard from localhost, as we exposed the service earlier.
919
+
920
+
Once the job is completed, the checkpoint with the fine tuned model is saved in the folder
0 commit comments