Skip to content

Commit 5fd6a81

Browse files
committed
Update README
1 parent bc0c8b5 commit 5fd6a81

File tree

2 files changed

+100
-13
lines changed

2 files changed

+100
-13
lines changed

pathwaysutils/experimental/shared_pathways_service/README.md

Lines changed: 99 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,33 +8,111 @@ service that manages scheduling and error handling.
88

99
## Requirements
1010

11-
Make sure that your GKE cluster is running the Resource Manager and Worker pods.
12-
You can follow the steps
13-
<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>
14-
to confirm the status of these pods. If you haven't started the Pathways pods
15-
yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml).
16-
Make sure to modify the following values to deploy these pods:
11+
1. You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports
12+
single-host Trillium slices only, this support will be extended soon.
13+
14+
2. Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml).
15+
Make sure to modify the following values to deploy the Pathways pods:
1716

1817
- A unique Jobset name for the cluster's Pathways pods
1918
- GCS bucket path
2019
- TPU type and topology
2120
- Number of slices
2221

23-
These fields are highlighted in the YAML file with trailing comments for easier
24-
understanding.
22+
3. Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker
23+
pods.
24+
25+
Check that the required pods are running.
26+
```
27+
# Set the environment variables.
28+
$ PROJECT=<your-project>
29+
$ CLUSTER_NAME=<your-cluster>
30+
$ REGION=<cluster-region> # e.g., us-central2
31+
32+
# Get credentials for your cluster.
33+
$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default
34+
35+
# Check the status of RM and Worker pods.
36+
$ kubectl get pods
37+
38+
# Sample expected output
39+
NAME READY STATUS RESTARTS AGE
40+
pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s
41+
pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s
42+
pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s
43+
```
44+
45+
You can also verify the pod status by looking at the project logs. Look for the below substring for the respective pod
46+
type.
47+
48+
(Detailed instructions are <a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>)
49+
50+
```
51+
# Set the environment variables
52+
$ HEAD_POD_NAME=pathways-cluster-pathways-head-0-0-zzmn2
53+
$ WORKER0_POD_NAME=pathways-cluster-worker-0-0-bdzq4
54+
$ WORKER1_POD_NAME=pathways-cluster-worker-1-0-km2rf
55+
```
56+
57+
- RM
58+
```
59+
$ kubectl logs $HEAD_POD_NAME --container pathways-rm
60+
...
61+
I1208 20:10:04.992524 ...] Pathways Server serving on [::]:29001
62+
...
63+
I1208 20:10:23.848070 ...] *** 2/2 Pathways Slices Now Ready
64+
```
65+
66+
- Worker
67+
```
68+
$ kubectl logs $WORKER0_POD_NAME --container pathways-worker
69+
...
70+
I1208 20:10:23.838022 ...] Pathways Server serving on [::]:29005
71+
...
72+
I1208 20:10:25.249167 ...] MegaScale transport initialized.
73+
I1208 20:10:25.249172 ...] MegaScale transport init succeeded.
74+
75+
$ kubectl logs $WORKER1_POD_NAME --container pathways-worker
76+
...
77+
I1208 20:10:23.579361 ...] Pathways Server serving on [::]:29005
78+
I1208 20:10:24.994411 ...] MegaScale transport initialized.
79+
I1208 20:10:24.994416 ...] MegaScale transport init succeeded.
80+
...
81+
```
82+
83+
<a name="find-pw-service"></a>
84+
4. Find the address of the Pathways service.
85+
```
86+
$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"
87+
I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
88+
```
2589

2690
## Instructions
2791

2892
1. Clone `pathwaysutils`.
2993

30-
`git clone https://github.com/AI-Hypercomputer/pathways-utils.git`
94+
```
95+
git clone https://github.com/AI-Hypercomputer/pathways-utils.git
96+
```
97+
98+
2. Install `portpicker`.
3199

32-
2. Install portpicker
100+
```
101+
pip install portpicker
102+
```
33103

34-
`pip install portpicker`
104+
3. In your script,
35105

36-
3. Import `isc_pathways` and move your workload under
37-
`with isc_pathways.connect()` statement. Refer to
106+
- Import `isc_pathways`
107+
- Add `with isc_pathways.connect(...)` statement. The function takes the below values:
108+
- Cluster name
109+
- Project name
110+
- Region
111+
- GCS bucket name
112+
- Pathways Service (See instructions to find the Pathways address [here](#find-pw-service))
113+
- Write your ML code under this `with` block to run it on the underlying TPUs.
114+
115+
See
38116
[run_connect_example.py](run_connect_example.py) for reference. Example code:
39117

40118
```
@@ -59,3 +137,11 @@ understanding.
59137

60138
The connect block will deploy a proxy pod dedicated to your client and connect
61139
your local runtime environment to the proxy pod via port-forwarding.
140+
141+
4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways
142+
Service finds free TPU(s) that match your request, your workload will start running on the free resources. However,
143+
if all TPUs are occupied, you can expect your script to fail.
144+
145+
## Troubleshooting
146+
Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways)
147+
if your Pathways pods do not come up!

pathwaysutils/experimental/shared_pathways_service/validators.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ def _validate_tpu_supported(tpu_instance_with_topology: str) -> None:
4747
Raises ValueError if the instance is not a valid TPU host.
4848
"""
4949
# Mapping from Cloud TPU type prefix to max chips per host.
50+
# Make sure to edit the project README if you update this mapping.
5051
single_host_max_chips = {
5152
"tpuv6e": 8, # Cloud TPU v6e (2x4)
5253
}

0 commit comments

Comments
 (0)