@@ -8,33 +8,111 @@ service that manages scheduling and error handling.
88
99## Requirements
1010
11- Make sure that your GKE cluster is running the Resource Manager and Worker pods.
12- You can follow the steps
13- <a href =" https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring " target =" _blank " >here</a >
14- to confirm the status of these pods. If you haven't started the Pathways pods
15- yet, you can use [ pw-service-example.yaml] ( yamls/pw-service-example.yaml ) .
16- Make sure to modify the following values to deploy these pods:
11+ 1 . You have a GKE cluster with atleast 1 slice of ` v6e-4 ` or ` v6e-8 ` . Note that the Shared Pathways Service supports
12+ single-host Trillium slices only, this support will be extended soon.
13+
14+ 2 . Start the Shared Pathways Service by using [ pw-service-example.yaml] ( yamls/pw-service-example.yaml ) .
15+ Make sure to modify the following values to deploy the Pathways pods:
1716
1817- A unique Jobset name for the cluster's Pathways pods
1918- GCS bucket path
2019- TPU type and topology
2120- Number of slices
2221
23- These fields are highlighted in the YAML file with trailing comments for easier
24- understanding.
22+ 3 . Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker
23+ pods.
24+
25+ Check that the required pods are running.
26+ ```
27+ # Set the environment variables.
28+ $ PROJECT=<your-project>
29+ $ CLUSTER_NAME=<your-cluster>
30+ $ REGION=<cluster-region> # e.g., us-central2
31+
32+ # Get credentials for your cluster.
33+ $ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default
34+
35+ # Check the status of RM and Worker pods.
36+ $ kubectl get pods
37+
38+ # Sample expected output
39+ NAME READY STATUS RESTARTS AGE
40+ pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s
41+ pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s
42+ pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s
43+ ```
44+
45+ You can also verify the pod status by looking at the project logs. Look for the below substring for the respective pod
46+ type.
47+
48+ (Detailed instructions are <a href =" https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring " target =" _blank " >here</a >)
49+
50+ ```
51+ # Set the environment variables
52+ $ HEAD_POD_NAME=pathways-cluster-pathways-head-0-0-zzmn2
53+ $ WORKER0_POD_NAME=pathways-cluster-worker-0-0-bdzq4
54+ $ WORKER1_POD_NAME=pathways-cluster-worker-1-0-km2rf
55+ ```
56+
57+ - RM
58+ ```
59+ $ kubectl logs $HEAD_POD_NAME --container pathways-rm
60+ ...
61+ I1208 20:10:04.992524 ...] Pathways Server serving on [::]:29001
62+ ...
63+ I1208 20:10:23.848070 ...] *** 2/2 Pathways Slices Now Ready
64+ ```
65+
66+ - Worker
67+ ```
68+ $ kubectl logs $WORKER0_POD_NAME --container pathways-worker
69+ ...
70+ I1208 20:10:23.838022 ...] Pathways Server serving on [::]:29005
71+ ...
72+ I1208 20:10:25.249167 ...] MegaScale transport initialized.
73+ I1208 20:10:25.249172 ...] MegaScale transport init succeeded.
74+
75+ $ kubectl logs $WORKER1_POD_NAME --container pathways-worker
76+ ...
77+ I1208 20:10:23.579361 ...] Pathways Server serving on [::]:29005
78+ I1208 20:10:24.994411 ...] MegaScale transport initialized.
79+ I1208 20:10:24.994416 ...] MegaScale transport init succeeded.
80+ ...
81+ ```
82+
83+ <a name =" find-pw-service " ></a >
84+ 4 . Find the address of the Pathways service.
85+ ```
86+ $ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"
87+ I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
88+ ```
2589
2690## Instructions
2791
28921 . Clone ` pathwaysutils ` .
2993
30- ` git clone https://github.com/AI-Hypercomputer/pathways-utils.git `
94+ ```
95+ git clone https://github.com/AI-Hypercomputer/pathways-utils.git
96+ ```
97+
98+ 2 . Install ` portpicker ` .
3199
32- 2 . Install portpicker
100+ ```
101+ pip install portpicker
102+ ```
33103
34- ` pip install portpicker `
104+ 3 . In your script,
35105
36- 3 . Import ` isc_pathways ` and move your workload under
37- ` with isc_pathways.connect() ` statement. Refer to
106+ - Import ` isc_pathways `
107+ - Add ` with isc_pathways.connect(...) ` statement. The function takes the below values:
108+ - Cluster name
109+ - Project name
110+ - Region
111+ - GCS bucket name
112+ - Pathways Service (See instructions to find the Pathways address [ here] ( #find-pw-service ) )
113+ - Write your ML code under this ` with ` block to run it on the underlying TPUs.
114+
115+ See
38116[ run_connect_example.py] ( run_connect_example.py ) for reference. Example code:
39117
40118```
@@ -59,3 +137,11 @@ understanding.
59137
60138The connect block will deploy a proxy pod dedicated to your client and connect
61139your local runtime environment to the proxy pod via port-forwarding.
140+
141+ 4 . You can start another client that uses the same ` pathways_service ` (similar to Step #3 ). If the Shared Pathways
142+ Service finds free TPU(s) that match your request, your workload will start running on the free resources. However,
143+ if all TPUs are occupied, you can expect your script to fail.
144+
145+ ## Troubleshooting
146+ Refer to [ this guide] ( https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways )
147+ if your Pathways pods do not come up!
0 commit comments