AI-Hypercomputer · guptaaka · Dec 8, 2025 · Dec 10, 2025 · Dec 10, 2025 · lukebaumann
diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md
@@ -8,54 +8,131 @@ service that manages scheduling and error handling.
 
 ## Requirements
 
-Make sure that your GKE cluster is running the Resource Manager and Worker pods.
-You can follow the steps
-<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>
-to confirm the status of these pods. If you haven't started the Pathways pods
-yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml).
-Make sure to modify the following values to deploy these pods:
+### 1. Create a GKE cluster with TPUs
+
+You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports
+single-host Trillium slices only, this support will be extended soon.
+
+<a name="pw-service-yaml"></a>
+
+### 2. Deploy the Pathways head pod
+
+Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml).
+Make sure to modify the following values to deploy the Pathways pods:
 
 - A unique Jobset name for the cluster's Pathways pods
 - GCS bucket path
 - TPU type and topology
 - Number of slices
 
-These fields are highlighted in the YAML file with trailing comments for easier
-understanding.
+### 3. Verify that the pods created in Step#2 are running
+
+Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker
+pods.
+
+```
+# Set the environment variables.
+$ PROJECT=<your-project>
+$ CLUSTER_NAME=<your-cluster>
+$ REGION=<cluster-region>  # e.g., us-central2
+
+# Get credentials for your cluster.
+$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default
+```
+
+#### Option 1: List all pods
+
+```
+$ kubectl get pods
+
+# Sample expected output (1 Head pod and 1 or more Worker pods)
+NAME                                       READY   STATUS    RESTARTS   AGE
+pathways-cluster-pathways-head-0-0-zzmn2   2/2     Running   0          3m49s   # HEAD POD
+pathways-cluster-worker-0-0-bdzq4          1/1     Running   0          3m36s   # WORKER 0
+pathways-cluster-worker-1-0-km2rf          1/1     Running   0          3m36s   # WORKER 1
+```
+
+#### Option 2: Check the status of the specific pods that belong to your Pathways Service
+
+```
+# e.g., pathways-cluster
+$ JOBSET_NAME=<your-jobset-name>  # same as you used in [pw-service-example.yaml](#pw-service-yaml)
+
+# e.g., pathways-cluster-pathways-head-0-0-zzmn2
+$ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep head)
+
+# e.g., pathways-cluster-worker-0-0-bdzq4
+$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-')
+```
+
+#### Option 3: Check project logs
+
+Find the detailed instructions
+<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>).
+
+<a name="find-pw-service"></a>
+### 4. Find the Pathways service address
+Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
+```
+$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"
+I1208 20:10:18.148825       ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
+```
 
 ## Instructions
 
-1. Clone `pathwaysutils`.
+### 1. Clone `pathwaysutils`.
 
-`git clone https://github.com/AI-Hypercomputer/pathways-utils.git`
+```
+git clone https://github.com/AI-Hypercomputer/pathways-utils.git
+```
 
-2. Install portpicker
+### 2. Install `portpicker`.
 
-`pip install portpicker`
+```
+pip install portpicker
+```
+
+### 3. Use the `isc_pathways` Context Manager
+
+In your script,
 
-3. Import `isc_pathways` and move your workload under
-`with isc_pathways.connect()` statement. Refer to
-[run_connect_example.py](run_connect_example.py) for reference. Example code:
+1.  Import `isc_pathways`
+2. Add `with isc_pathways.connect(...)` statement. The function takes the below values:
+    - Cluster name
+    - Project name
+    - Region
+    - GCS bucket name
+    - Pathways Service (See instructions to find the Pathways address [here](#find-pw-service))
+3. Write your ML code under this `with` block to run it on the underlying TPUs.
+
+See [run_connect_example.py](run_connect_example.py) for reference. Example code:
 
 ```
- from pathwaysutils.experimental.shared_pathways_service import isc_pathways
+from pathwaysutils.experimental.shared_pathways_service import isc_pathways
+import jax.numpy as jnp
+import pathwaysutils
+import pprint
 
- with isc_pathways.connect(
-     cluster="my-cluster",
-     project="my-project",
-     region="region",
-     gcs_bucket="gs://user-bucket",
-     pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
+with isc_pathways.connect(
+    cluster="my-cluster",
+    project="my-project",
+    region="region",
+    gcs_bucket="gs://user-bucket",
+    pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
      expected_tpu_instances={"tpuv6e:2x2": 2},
- ) as tm:
-   import jax.numpy as jnp
-   import pathwaysutils
-   import pprint
-
+) as tm:
    pathwaysutils.initialize()
    orig_matrix = jnp.zeros(5)
    ...
 ```
 
 The connect block will deploy a proxy pod dedicated to your client and connect
 your local runtime environment to the proxy pod via port-forwarding.
+
+4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways
+Service finds free TPU(s) that match your request, your workload will start running on the free resources. However,
+if all TPUs are occupied, you can expect your script to fail.
+
+## Troubleshooting
+Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways)
+if your Pathways pods do not come up!