|
| 1 | +# Ray Cluster Setup on Google Cloud Platform (GCP) |
| 2 | + |
| 3 | +This tutorial covers the setup of Ray Clusters on GCP. Ray Clusters are a way to |
| 4 | +start compute intensive jobs (e.g. Autotuner) on a distributed set of nodes spawned |
| 5 | +automatically. For more information on Ray Cluster, refer to [here](https://docs.ray.io/en/latest/cluster/getting-started.html). |
| 6 | + |
| 7 | +To run Autotuner jobs on Ray Cluster, we have to first install ORFS onto the |
| 8 | +GCP nodes. |
| 9 | + |
| 10 | +How does this differ from the previous Kubernetes approach? |
| 11 | +- Support for autoscaling |
| 12 | +- Faster startup time using Docker (no need for JIT rebuilds of runtime dependencies) |
| 13 | +- Simplified architecture and codebase |
| 14 | + |
| 15 | +There are two different ways for ORFS setup on Ray Cluster, namely: |
| 16 | +- [Public](#public-cluster-setup): Upload Docker image to Dockerhub (or any public Docker registry). |
| 17 | +- [Private](#private-cluster-setup): Upload Docker image to private registry. Authentication needs then to be handled for Kubernetes. |
| 18 | + |
| 19 | +```note |
| 20 | +Currently it looks like the `autoscaler.yaml` file might only be used for public.yaml |
| 21 | +For private deployments, we might have to use KubeRay |
| 22 | +1. https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/ray-on-gke |
| 23 | +2. https://www.paulsblog.dev/how-to-install-a-private-docker-container-registry-in-kubernetes/ |
| 24 | +``` |
| 25 | + |
| 26 | +## TODO |
| 27 | + |
| 28 | +- Look up how to preserve the cache during pip install. |
| 29 | +- Public flow, fixed: via autotuner script |
| 30 | + - Tune |
| 31 | + - Sweep |
| 32 | +- Public flow, fixed: via ray API. |
| 33 | +- Public flow, autoscaling |
| 34 | +- test using private registry on dockerhub same flow |
| 35 | +- Scaling concerns |
| 36 | + - increase storage of head node. |
| 37 | + - Object store memory - does that affect file transfer? |
| 38 | + |
| 39 | +## Prerequisites |
| 40 | + |
| 41 | +Make sure Autotuner prerequisites are installed. To do so, refer to the installation script. |
| 42 | + |
| 43 | +```bash |
| 44 | +pip install ray[default] google-api-python-client cryptography cloudpathlib |
| 45 | +``` |
| 46 | + |
| 47 | +## Public cluster setup |
| 48 | + |
| 49 | +0a. Authenticate the necessary GCP account with enough privileges to do: |
| 50 | +- `setIamPolicy` |
| 51 | + |
| 52 | +```bash |
| 53 | +gcloud auth application-default login |
| 54 | +``` |
| 55 | + |
| 56 | +0b. Generate your service account keys for `ray-autoscaler-sa-v1@<project_id>.iam.gserviceaccount.com`. |
| 57 | +Rename it `service_account.json`. |
| 58 | + |
| 59 | +1. Set up `.env` with Docker registry username/password. Also, set up the `public.yaml` |
| 60 | +file accordingly to your desired specifications. |
| 61 | + |
| 62 | +```bash |
| 63 | +cp .env.sample .env |
| 64 | +cp public.yaml.template public.yaml |
| 65 | +``` |
| 66 | + |
| 67 | +2. Run the following commands to build, tag and upload the public image: |
| 68 | + |
| 69 | +```bash |
| 70 | +make clean |
| 71 | +make base |
| 72 | +make docker |
| 73 | +make upload |
| 74 | +``` |
| 75 | + |
| 76 | +3. Launch your cluster as follows: |
| 77 | + |
| 78 | +```bash |
| 79 | +make up |
| 80 | +``` |
| 81 | + |
| 82 | +4. Ray CLI API |
| 83 | + |
| 84 | +```bash |
| 85 | +# Commands on machine (assume files/commands are present on cluster) |
| 86 | +ray job submit --address http://localhost:8265 ls |
| 87 | + |
| 88 | +# Case 1: 1 job |
| 89 | +ray job submit --address http://localhost:8265 -- python3 -m autotuner.distributed --design gcd --platform asap7 --config ../../flow/designs/asap7/gcd/autotuner.json --cloud_dir gs://autotuner_test tune --samples 1 |
| 90 | + |
| 91 | +# Case 2A: 2 job, with resource spec. |
| 92 | +HEAD_SERVER=10.138.0.13 |
| 93 | +ray job submit --address http://localhost:8265 --entrypoint-num-cpus 2 -- python3 -m autotuner.distributed --design gcd --platform asap7 --server $HEAD_SERVER --config ../../flow/designs/asap7/gcd/autotuner.json --cloud_dir gs://autotuner_test tune --samples 1 |
| 94 | +ray job submit --address http://localhost:8265 --entrypoint-num-cpus 2 -- python3 -m autotuner.distributed --design gcd --platform asap7 --server $HEAD_SERVER --config ../../flow/designs/asap7/gcd/autotuner.json --cloud_dir gs://autotuner_test tune --samples 1 |
| 95 | + |
| 96 | +# Case 2B: 2 job, with resource spec (sweep) |
| 97 | +HEAD_SERVER=10.138.0.13 |
| 98 | +ray job submit --address http://localhost:8265 --entrypoint-num-cpus 2 -- python3 -m autotuner.distributed --design gcd --platform asap7 --server $HEAD_SERVER --config ./src/autotuner/distributed-sweep-example.json --cloud_dir gs://autotuner_test sweep |
| 99 | +ray job submit --address http://localhost:8265 --entrypoint-num-cpus 2 -- python3 -m autotuner.distributed --design gcd --platform asap7 --server $HEAD_SERVER --config ./src/autotuner/distributed-sweep-example.json --cloud_dir gs://autotuner_test sweep |
| 100 | + |
| 101 | +# Case 3: Overprovisioned resource spec (should fail because the cluster cannot meet this demand.) |
| 102 | +HEAD_SERVER=10.138.0.13 |
| 103 | +ray job submit --address http://localhost:8265 --entrypoint-num-cpus 4 -- python3 -m autotuner.distributed --design gcd --platform asap7 --server $HEAD_SERVER --config ../../flow/designs/asap7/gcd/autotuner.json --cloud_dir gs://autotuner_test tune --samples 1 |
| 104 | + |
| 105 | +# Commands on machine (sync local working dir, note the dir is stored as some /tmp dir) |
| 106 | +ray job submit --address http://localhost:8265 \ |
| 107 | + --working-dir scripts -- python3 hello_world.py |
| 108 | +``` |
| 109 | + |
| 110 | +## Useful commands |
| 111 | + |
| 112 | +```bash |
| 113 | +HEAD_SERVER=10.138.0.13 |
| 114 | +ray job stop --address $HEAD_SERVER:6379 --no-wait {{ JOB_SUBMIT_ID }} |
| 115 | +``` |
| 116 | + |
| 117 | +## Private cluster setup |
| 118 | + |
| 119 | +Coming soon. |
0 commit comments