@@ -8,22 +8,28 @@ service that manages scheduling and error handling.
88
99## Requirements
1010
11- 1 . You have a GKE cluster with atleast 1 slice of ` v6e-4 ` or ` v6e-8 ` . Note that the Shared Pathways Service supports
11+ ### 1. Create a GKE cluster with TPUs
12+
13+ You have a GKE cluster with atleast 1 slice of ` v6e-4 ` or ` v6e-8 ` . Note that the Shared Pathways Service supports
1214single-host Trillium slices only, this support will be extended soon.
1315
1416<a name =" pw-service-yaml " ></a >
15- 2 . Start the Shared Pathways Service by using [ pw-service-example.yaml] ( yamls/pw-service-example.yaml ) .
17+
18+ ### 2. Deploy the Pathways head pod
19+
20+ Start the Shared Pathways Service by using [ pw-service-example.yaml] ( yamls/pw-service-example.yaml ) .
1621Make sure to modify the following values to deploy the Pathways pods:
1722
1823- A unique Jobset name for the cluster's Pathways pods
1924- GCS bucket path
2025- TPU type and topology
2126- Number of slices
2227
23- 3 . Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker
28+ ### 3. Verify that the pods created in Step #2 are running
29+
30+ Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker
2431pods.
2532
26- Check that the required pods are running.
2733```
2834# Set the environment variables.
2935$ PROJECT=<your-project>
@@ -32,8 +38,11 @@ $ REGION=<cluster-region> # e.g., us-central2
3238
3339# Get credentials for your cluster.
3440$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default
41+ ```
42+
43+ #### Option 1: List all pods
3544
36- # Check the status of RM and Worker pods.
45+ ```
3746$ kubectl get pods
3847
3948# Sample expected output (1 Head pod and 1 or more Worker pods)
@@ -43,8 +52,7 @@ pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s
4352pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1
4453```
4554
46- You can also verify the pod status by running below commands or by checking the project logs (Detailed instructions
47- for the logs are <a href =" https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring " target =" _blank " >here</a >).
55+ #### Option 2: Check the status of the specific pods that belong to your Pathways Service
4856
4957```
5058# e.g., pathways-cluster
@@ -57,55 +65,62 @@ $ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${J
5765$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-')
5866```
5967
68+ #### Option 3: Check project logs
69+
70+ Find the detailed instructions
71+ <a href =" https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring " target =" _blank " >here</a >).
72+
6073<a name =" find-pw-service " ></a >
61- 4 . Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
74+ ### 4. Find the Pathways service address
75+ Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
6276```
6377$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"
6478I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
6579```
6680
6781## Instructions
6882
69- 1 . Clone ` pathwaysutils ` .
83+ ### 1. Clone ` pathwaysutils ` .
7084
7185```
7286git clone https://github.com/AI-Hypercomputer/pathways-utils.git
7387```
7488
75- 2 . Install ` portpicker ` .
89+ ### 2. Install ` portpicker ` .
7690
7791```
7892pip install portpicker
7993```
8094
81- 3 . In your script,
95+ ### 3. Use the ` isc_pathways ` Context Manager
8296
83- - Import ` isc_pathways `
84- - Add ` with isc_pathways.connect(...) ` statement. The function takes the below values:
85- - Cluster name
86- - Project name
87- - Region
88- - GCS bucket name
89- - Pathways Service (See instructions to find the Pathways address [ here] ( #find-pw-service ) )
90- - Write your ML code under this ` with ` block to run it on the underlying TPUs.
97+ In your script,
98+
99+ 1 . Import ` isc_pathways `
100+ 2 . Add ` with isc_pathways.connect(...) ` statement. The function takes the below values:
101+ - Cluster name
102+ - Project name
103+ - Region
104+ - GCS bucket name
105+ - Pathways Service (See instructions to find the Pathways address [ here] ( #find-pw-service ) )
106+ 3 . Write your ML code under this ` with ` block to run it on the underlying TPUs.
91107
92108See [ run_connect_example.py] ( run_connect_example.py ) for reference. Example code:
93109
94110```
95- from pathwaysutils.experimental.shared_pathways_service import isc_pathways
96-
97- with isc_pathways.connect(
98- cluster="my-cluster",
99- project="my-project",
100- region="region",
101- gcs_bucket="gs://user-bucket",
102- pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
111+ from pathwaysutils.experimental.shared_pathways_service import isc_pathways
112+ import jax.numpy as jnp
113+ import pathwaysutils
114+ import pprint
115+
116+ with isc_pathways.connect(
117+ cluster="my-cluster",
118+ project="my-project",
119+ region="region",
120+ gcs_bucket="gs://user-bucket",
121+ pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
103122 expected_tpu_instances={"tpuv6e:2x2": 2},
104- ) as tm:
105- import jax.numpy as jnp
106- import pathwaysutils
107- import pprint
108-
123+ ) as tm:
109124 pathwaysutils.initialize()
110125 orig_matrix = jnp.zeros(5)
111126 ...
0 commit comments