Skip to content

Commit c6cf7d1

Browse files
committed
initial commit for enabling distributed
Signed-off-by: Jack Luar <[email protected]>
1 parent 044183f commit c6cf7d1

File tree

16 files changed

+1819
-28
lines changed

16 files changed

+1819
-28
lines changed

tools/AutoTuner/.dockerignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
*
2+
!src
3+
!distributed
4+
!requirements.txt
5+
!requirements-dev.txt
6+
!setup.sh
7+
!pyproject.toml
8+

tools/AutoTuner/.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,13 @@ __pycache__/
1010
# Autotuner env
1111
autotuner_env
1212
.env
13+
14+
# Ray distributed
15+
public.yaml
16+
private.yaml
17+
18+
# Docker build
19+
docker-build.log
20+
21+
# GCP
22+
service_account.json
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
DOCKERHUB_USERNAME={{DOCKERHUB_USERNAME}}
2+
DOCKERHUB_PASSWORD={{DOCKERHUB_PASSWORD}}
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
ARG BASE_IMAGE
2+
FROM ${BASE_IMAGE:-openroad/orfs:latest}
3+
4+
# Customize this based on the user's needs
5+
ENV GOOGLE_APPLICATION_CREDENTIALS=/OpenROAD-flow-scripts/service_account.json
6+
7+
# Install AT required packages
8+
WORKDIR /OpenROAD-flow-scripts/tools/AutoTuner
9+
RUN pip3 install --no-cache-dir --upgrade pip
10+
RUN pip3 install --no-cache-dir -r requirements.txt
11+
RUN pip3 install --no-cache-dir -r requirements-dev.txt
12+
13+
# Install rsync (https://github.com/ray-project/ray/issues/40566)
14+
# ssh: to communicate with other nodes within image.
15+
# sudo: for grpc calls in ray.
16+
RUN apt-get update && \
17+
apt-get install -y rsync openssh-client sudo && \
18+
rm -rf /var/lib/apt/lists/*
19+
20+
# Replace pre-existing AT files with local copy
21+
RUN rm -rf /OpenROAD-flow-scripts/tools/AutoTuner
22+
COPY . /OpenROAD-flow-scripts/tools/AutoTuner
23+
24+
# Install AT package in editable mode
25+
RUN pip3 install --no-cache-dir -e .
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
include .env
2+
export
3+
4+
BASE_TAG=$(shell cd ../../../ && ./etc/DockerTag.sh -dev)
5+
ORFS_IMAGE := openroad/orfs:v3.0-2422-g3dc9e665 # update manually (do not use latest)
6+
ORFS_AUTOTUNER_IMAGE := orfs-autotuner
7+
8+
.PHONY: init
9+
init:
10+
@echo "Setting up environment..."
11+
@../installer.sh
12+
13+
.PHONY: clean
14+
clean:
15+
@echo "Cleaning up old images"
16+
@docker rmi ${ORFS_AUTOTUNER_IMAGE}:latest || true
17+
18+
.PHONY: docker
19+
docker: clean
20+
@echo "Building docker image..."
21+
@docker build -t ${ORFS_AUTOTUNER_IMAGE}:latest -f Dockerfile --build-arg BASE_IMAGE=${ORFS_IMAGE} .. | tee docker-build.log
22+
@docker tag ${ORFS_AUTOTUNER_IMAGE}:latest ${ORFS_AUTOTUNER_IMAGE}:$(BASE_TAG)
23+
24+
.PHONY: upload
25+
upload: docker
26+
@echo "Uploading docker image..."
27+
@docker login -u $(DOCKERHUB_USERNAME) -p $(DOCKERHUB_PASSWORD)
28+
@echo "Base image: $(BASE_TAG)"
29+
@docker tag ${ORFS_AUTOTUNER_IMAGE}:latest ${DOCKERHUB_USERNAME}/${ORFS_AUTOTUNER_IMAGE}:$(BASE_TAG)
30+
@docker tag ${ORFS_AUTOTUNER_IMAGE}:latest ${DOCKERHUB_USERNAME}/${ORFS_AUTOTUNER_IMAGE}:latest
31+
@docker push ${DOCKERHUB_USERNAME}/${ORFS_AUTOTUNER_IMAGE}:$(BASE_TAG)
32+
@docker push ${DOCKERHUB_USERNAME}/${ORFS_AUTOTUNER_IMAGE}:latest
33+
@docker logout
34+
35+
.PHONY: up
36+
up:
37+
@echo "Starting Ray cluster..."
38+
@ray disable-usage-stats
39+
@ray up -y public.yaml
40+
41+
.PHONY: down
42+
down:
43+
@echo "Stopping Ray cluster..."
44+
@ray down -y public.yaml
45+
46+
.PHONY: dashboard
47+
dashboard:
48+
@echo "Starting Ray dashboard..."
49+
@ray dashboard public.yaml
50+
51+
.PHONY: monitor
52+
monitor:
53+
@echo "Monitoring Ray cluster..."
54+
@ray monitor public.yaml
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
1) Setup two AT instances on same internal network
2+
2) Setup the requirements
3+
4+
```
5+
sudo apt-get install -y python3-pip python3-venv
6+
python3 -m venv .venv
7+
.venv/bin/activate && pip install ray[tune]
8+
9+
```
10+
11+
3) Common setup script
12+
- `at_distributed.sh`
13+
14+
4) Worker script
15+
- `at_worker.py`
16+
- `mkdir -p /tmp/owo && touch /tmp/owo/abc`
17+
18+
19+
5) Benchmark file transfers (do on worker)
20+
- Observation: sync_dir just makes sure the files are sync-ed. So neat feature is that only file diffs are transffered.
21+
- You do not have to create the dest_dir, sync_dir does that for you.
22+
- `max_size_bytes` is limited to 1GiB. So we have to lift up the restriction manually if needed.
23+
- Bottleneck seems to start at 1GiB transfers and above
24+
- `dd if=/dev/zero of=/tmp/owo/owo bs=1M count=100` - creates 100MB file. (Time taken: 2.2103039264678954 ± 0.556972017400803)
25+
- `dd if=/dev/zero of=/tmp/owo/owo bs=1M count=1000` - creates 1Gb file. (Time taken: 8.897777223587036 ± 0.6503669298689543)
26+
- `dd if=/dev/zero of=/tmp/owo/owo bs=1M count=5000` - creates 5Gb file. (Time taken: 54.920665216445926 ± 1.0533714623736783)
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Ray Cluster Setup on Google Cloud Platform (GCP)
2+
3+
This tutorial covers the setup of Ray Clusters on GCP. Ray Clusters are a way to
4+
start compute intensive jobs (e.g. Autotuner) on a distributed set of nodes spawned
5+
automatically. For more information on Ray Cluster, refer to [here](https://docs.ray.io/en/latest/cluster/getting-started.html).
6+
7+
To run Autotuner jobs on Ray Cluster, we have to first install ORFS onto the
8+
GCP nodes.
9+
10+
How does this differ from the previous Kubernetes approach?
11+
- Support for autoscaling
12+
- Faster startup time using Docker (no need for JIT rebuilds of runtime dependencies)
13+
- Simplified architecture and codebase
14+
15+
There are two different ways for ORFS setup on Ray Cluster, namely:
16+
- [Public](#public-cluster-setup): Upload Docker image to Dockerhub (or any public Docker registry).
17+
- [Private](#private-cluster-setup): Upload Docker image to private registry. Authentication needs then to be handled for Kubernetes.
18+
19+
```note
20+
Currently it looks like the `autoscaler.yaml` file might only be used for public.yaml
21+
For private deployments, we might have to use KubeRay
22+
1. https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/ray-on-gke
23+
2. https://www.paulsblog.dev/how-to-install-a-private-docker-container-registry-in-kubernetes/
24+
```
25+
26+
## TODO
27+
28+
- Look up how to preserve the cache during pip install.
29+
- Public flow, fixed: via autotuner script
30+
- Tune
31+
- Sweep
32+
- Public flow, fixed: via ray API.
33+
- Public flow, autoscaling
34+
- test using private registry on dockerhub same flow
35+
- Scaling concerns
36+
- increase storage of head node.
37+
- Object store memory - does that affect file transfer?
38+
39+
## Prerequisites
40+
41+
Make sure Autotuner prerequisites are installed. To do so, refer to the installation script.
42+
43+
```bash
44+
pip install ray[default] google-api-python-client cryptography cloudpathlib
45+
```
46+
47+
## Public cluster setup
48+
49+
0a. Authenticate the necessary GCP account with enough privileges to do:
50+
- `setIamPolicy`
51+
52+
```bash
53+
gcloud auth application-default login
54+
```
55+
56+
0b. Generate your service account keys for `ray-autoscaler-sa-v1@<project_id>.iam.gserviceaccount.com`.
57+
Rename it `service_account.json`.
58+
59+
1. Set up `.env` with Docker registry username/password. Also, set up the `public.yaml`
60+
file accordingly to your desired specifications.
61+
62+
```bash
63+
cp .env.sample .env
64+
cp public.yaml.template public.yaml
65+
```
66+
67+
2. Run the following commands to build, tag and upload the public image:
68+
69+
```bash
70+
make clean
71+
make base
72+
make docker
73+
make upload
74+
```
75+
76+
3. Launch your cluster as follows:
77+
78+
```bash
79+
make up
80+
```
81+
82+
4. Ray CLI API
83+
84+
```bash
85+
# Commands on machine (assume files/commands are present on cluster)
86+
ray job submit --address http://localhost:8265 ls
87+
88+
# Case 1: 1 job
89+
ray job submit --address http://localhost:8265 -- python3 -m autotuner.distributed --design gcd --platform asap7 --config ../../flow/designs/asap7/gcd/autotuner.json --cloud_dir gs://autotuner_test tune --samples 1
90+
91+
# Case 2A: 2 job, with resource spec.
92+
HEAD_SERVER=10.138.0.13
93+
ray job submit --address http://localhost:8265 --entrypoint-num-cpus 2 -- python3 -m autotuner.distributed --design gcd --platform asap7 --server $HEAD_SERVER --config ../../flow/designs/asap7/gcd/autotuner.json --cloud_dir gs://autotuner_test tune --samples 1
94+
ray job submit --address http://localhost:8265 --entrypoint-num-cpus 2 -- python3 -m autotuner.distributed --design gcd --platform asap7 --server $HEAD_SERVER --config ../../flow/designs/asap7/gcd/autotuner.json --cloud_dir gs://autotuner_test tune --samples 1
95+
96+
# Case 2B: 2 job, with resource spec (sweep)
97+
HEAD_SERVER=10.138.0.13
98+
ray job submit --address http://localhost:8265 --entrypoint-num-cpus 2 -- python3 -m autotuner.distributed --design gcd --platform asap7 --server $HEAD_SERVER --config ./src/autotuner/distributed-sweep-example.json --cloud_dir gs://autotuner_test sweep
99+
ray job submit --address http://localhost:8265 --entrypoint-num-cpus 2 -- python3 -m autotuner.distributed --design gcd --platform asap7 --server $HEAD_SERVER --config ./src/autotuner/distributed-sweep-example.json --cloud_dir gs://autotuner_test sweep
100+
101+
# Case 3: Overprovisioned resource spec (should fail because the cluster cannot meet this demand.)
102+
HEAD_SERVER=10.138.0.13
103+
ray job submit --address http://localhost:8265 --entrypoint-num-cpus 4 -- python3 -m autotuner.distributed --design gcd --platform asap7 --server $HEAD_SERVER --config ../../flow/designs/asap7/gcd/autotuner.json --cloud_dir gs://autotuner_test tune --samples 1
104+
105+
# Commands on machine (sync local working dir, note the dir is stored as some /tmp dir)
106+
ray job submit --address http://localhost:8265 \
107+
--working-dir scripts -- python3 hello_world.py
108+
```
109+
110+
## Useful commands
111+
112+
```bash
113+
HEAD_SERVER=10.138.0.13
114+
ray job stop --address $HEAD_SERVER:6379 --no-wait {{ JOB_SUBMIT_ID }}
115+
```
116+
117+
## Private cluster setup
118+
119+
Coming soon.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/bash -eu
2+
3+
# Ray script that starts the Ray HEAD/WORKER node based on the command line arg
4+
5+
IS_HEAD=${1:-false}
6+
RAY_HEAD_IP_ADDRESS=${2:-10.129.0.4}
7+
8+
9+
if [ "$IS_HEAD" = "true" ]; then
10+
echo "Starting Ray HEAD node"
11+
ray start --head --port=6379
12+
else
13+
echo "Starting Ray WORKER node"
14+
ray start --address=$RAY_HEAD_IP_ADDRESS:6379
15+
fi
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
import ray
2+
from ray.tune.utils.file_transfer import sync_dir_between_nodes
3+
import time
4+
5+
ray.init()
6+
_DEFAULT_MAX_SIZE_BYTES = 5 * 1024 * 1024 * 1024 # 1 GiB
7+
8+
res = []
9+
for i in range(5):
10+
start = time.time()
11+
sync_dir_between_nodes(
12+
source_ip="10.129.0.5",
13+
source_path="/tmp/owo",
14+
target_ip="10.129.0.4",
15+
target_path=f"/tmp/owo{i}",
16+
max_size_bytes=_DEFAULT_MAX_SIZE_BYTES,
17+
)
18+
print(f"Time taken: {time.time() - start}")
19+
res.append(time.time() - start)
20+
21+
print(
22+
"Time taken: {} ± {}".format(
23+
sum(res) / len(res),
24+
sum((x - sum(res) / len(res)) ** 2 for x in res) / len(res) ** 0.5,
25+
)
26+
)

0 commit comments

Comments
 (0)