From 43fa0934377a6c533022941d72ee14aa74e3155d Mon Sep 17 00:00:00 2001 From: Akanksha Gupta Date: Mon, 8 Dec 2025 14:19:10 -0800 Subject: [PATCH 1/3] Update "Shared Pathways Service" README Add elaborate instructions to validate that the service components are running. --- .../shared_pathways_service/README.md | 113 +++++++++++++++--- 1 file changed, 99 insertions(+), 14 deletions(-) diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index a46dcd4..04e02c3 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -8,34 +8,111 @@ service that manages scheduling and error handling. ## Requirements -Make sure that your GKE cluster is running the Resource Manager and Worker pods. -You can follow the steps -here -to confirm the status of these pods. If you haven't started the Pathways pods -yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml). -Make sure to modify the following values to deploy these pods: +1. You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports +single-host Trillium slices only, this support will be extended soon. + +2. Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml). +Make sure to modify the following values to deploy the Pathways pods: - A unique Jobset name for the cluster's Pathways pods - GCS bucket path - TPU type and topology - Number of slices -These fields are highlighted in the YAML file with trailing comments for easier -understanding. +3. Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker +pods. + +Check that the required pods are running. +``` +# Set the environment variables. +$ PROJECT= +$ CLUSTER_NAME= +$ REGION= # e.g., us-central2 + +# Get credentials for your cluster. +$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default + +# Check the status of RM and Worker pods. +$ kubectl get pods + +# Sample expected output +NAME READY STATUS RESTARTS AGE +pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s +pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s +pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s +``` + +You can also verify the pod status by looking at the project logs. Look for the below substring for the respective pod +type. + +(Detailed instructions are here) + +``` +# Set the environment variables +$ HEAD_POD_NAME=pathways-cluster-pathways-head-0-0-zzmn2 +$ WORKER0_POD_NAME=pathways-cluster-worker-0-0-bdzq4 +$ WORKER1_POD_NAME=pathways-cluster-worker-1-0-km2rf +``` + +- RM +``` +$ kubectl logs $HEAD_POD_NAME --container pathways-rm +... +I1208 20:10:04.992524 ...] Pathways Server serving on [::]:29001 +... +I1208 20:10:23.848070 ...] *** 2/2 Pathways Slices Now Ready +``` + +- Worker +``` +$ kubectl logs $WORKER0_POD_NAME --container pathways-worker +... +I1208 20:10:23.838022 ...] Pathways Server serving on [::]:29005 +... +I1208 20:10:25.249167 ...] MegaScale transport initialized. +I1208 20:10:25.249172 ...] MegaScale transport init succeeded. + +$ kubectl logs $WORKER1_POD_NAME --container pathways-worker +... +I1208 20:10:23.579361 ...] Pathways Server serving on [::]:29005 +I1208 20:10:24.994411 ...] MegaScale transport initialized. +I1208 20:10:24.994416 ...] MegaScale transport init succeeded. +... +``` + + +4. Find the address of the Pathways service. +``` +$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address" +I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001' +``` ## Instructions 1. Clone `pathwaysutils`. -`git clone https://github.com/AI-Hypercomputer/pathways-utils.git` +``` +git clone https://github.com/AI-Hypercomputer/pathways-utils.git +``` + +2. Install `portpicker`. -2. Install portpicker +``` +pip install portpicker +``` -`pip install portpicker` +3. In your script, -3. Import `isc_pathways` and move your workload under -`with isc_pathways.connect()` statement. Refer to -[run_connect_example.py](run_connect_example.py) for reference. Example code: + - Import `isc_pathways` + - Add `with isc_pathways.connect(...)` statement. The function takes the below values: + - Cluster name + - Project name + - Region + - GCS bucket name + - Pathways Service (See instructions to find the Pathways address [here](#find-pw-service)) + - Write your ML code under this `with` block to run it on the underlying TPUs. + +See [run_connect_example.py](run_connect_example.py) for reference. Example code: ``` from pathwaysutils.experimental.shared_pathways_service import isc_pathways @@ -59,3 +136,11 @@ understanding. The connect block will deploy a proxy pod dedicated to your client and connect your local runtime environment to the proxy pod via port-forwarding. + +4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways +Service finds free TPU(s) that match your request, your workload will start running on the free resources. However, +if all TPUs are occupied, you can expect your script to fail. + +## Troubleshooting +Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways) +if your Pathways pods do not come up! From b80acead02f626174a491f85293dbec5f4d86f99 Mon Sep 17 00:00:00 2001 From: Akanksha Gupta Date: Tue, 9 Dec 2025 20:38:45 -0800 Subject: [PATCH 2/3] Reword --- .../shared_pathways_service/README.md | 51 +++++-------------- 1 file changed, 14 insertions(+), 37 deletions(-) diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index 04e02c3..075c970 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -11,6 +11,7 @@ service that manages scheduling and error handling. 1. You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports single-host Trillium slices only, this support will be extended soon. + 2. Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml). Make sure to modify the following values to deploy the Pathways pods: @@ -35,53 +36,29 @@ $ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --pro # Check the status of RM and Worker pods. $ kubectl get pods -# Sample expected output +# Sample expected output (1 Head pod and 1 or more Worker pods) NAME READY STATUS RESTARTS AGE -pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s -pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s -pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s +pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s # HEAD POD +pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s # WORKER 0 +pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1 ``` -You can also verify the pod status by looking at the project logs. Look for the below substring for the respective pod -type. +You can also verify the pod status by running below commands or by checking the project logs (Detailed instructions +for the logs are here). -(Detailed instructions are here) - -``` -# Set the environment variables -$ HEAD_POD_NAME=pathways-cluster-pathways-head-0-0-zzmn2 -$ WORKER0_POD_NAME=pathways-cluster-worker-0-0-bdzq4 -$ WORKER1_POD_NAME=pathways-cluster-worker-1-0-km2rf ``` +# e.g., pathways-cluster +$ JOBSET_NAME= # same as you used in [pw-service-example.yaml](#pw-service-yaml) -- RM -``` -$ kubectl logs $HEAD_POD_NAME --container pathways-rm -... -I1208 20:10:04.992524 ...] Pathways Server serving on [::]:29001 -... -I1208 20:10:23.848070 ...] *** 2/2 Pathways Slices Now Ready -``` +# e.g., pathways-cluster-pathways-head-0-0-zzmn2 +$ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep head) -- Worker -``` -$ kubectl logs $WORKER0_POD_NAME --container pathways-worker -... -I1208 20:10:23.838022 ...] Pathways Server serving on [::]:29005 -... -I1208 20:10:25.249167 ...] MegaScale transport initialized. -I1208 20:10:25.249172 ...] MegaScale transport init succeeded. - -$ kubectl logs $WORKER1_POD_NAME --container pathways-worker -... -I1208 20:10:23.579361 ...] Pathways Server serving on [::]:29005 -I1208 20:10:24.994411 ...] MegaScale transport initialized. -I1208 20:10:24.994416 ...] MegaScale transport init succeeded. -... +# e.g., pathways-cluster-worker-0-0-bdzq4 +$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-') ``` -4. Find the address of the Pathways service. +4. Find the address of the Pathways service from the logs. We check the worker pod logs in the below command. ``` $ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address" I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001' From d7b6d8923cf191f9187ef63f72212c975c32391c Mon Sep 17 00:00:00 2001 From: Akanksha Gupta Date: Tue, 9 Dec 2025 21:04:01 -0800 Subject: [PATCH 3/3] reword --- .../shared_pathways_service/README.md | 79 +++++++++++-------- 1 file changed, 47 insertions(+), 32 deletions(-) diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index 075c970..e0a58ea 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -8,11 +8,16 @@ service that manages scheduling and error handling. ## Requirements -1. You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports +### 1. Create a GKE cluster with TPUs + +You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports single-host Trillium slices only, this support will be extended soon. -2. Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml). + +### 2. Deploy the Pathways head pod + +Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml). Make sure to modify the following values to deploy the Pathways pods: - A unique Jobset name for the cluster's Pathways pods @@ -20,10 +25,11 @@ Make sure to modify the following values to deploy the Pathways pods: - TPU type and topology - Number of slices -3. Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker +### 3. Verify that the pods created in Step#2 are running + +Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker pods. -Check that the required pods are running. ``` # Set the environment variables. $ PROJECT= @@ -32,8 +38,11 @@ $ REGION= # e.g., us-central2 # Get credentials for your cluster. $ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default +``` + +#### Option 1: List all pods -# Check the status of RM and Worker pods. +``` $ kubectl get pods # Sample expected output (1 Head pod and 1 or more Worker pods) @@ -43,8 +52,7 @@ pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1 ``` -You can also verify the pod status by running below commands or by checking the project logs (Detailed instructions -for the logs are here). +#### Option 2: Check the status of the specific pods that belong to your Pathways Service ``` # e.g., pathways-cluster @@ -57,8 +65,14 @@ $ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${J $ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-') ``` +#### Option 3: Check project logs + +Find the detailed instructions +here). + -4. Find the address of the Pathways service from the logs. We check the worker pod logs in the below command. +### 4. Find the Pathways service address +Find the address of the Pathways service from the logs. We check the worker pod logs in the below command. ``` $ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address" I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001' @@ -66,46 +80,47 @@ I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-c ## Instructions -1. Clone `pathwaysutils`. +### 1. Clone `pathwaysutils`. ``` git clone https://github.com/AI-Hypercomputer/pathways-utils.git ``` -2. Install `portpicker`. +### 2. Install `portpicker`. ``` pip install portpicker ``` -3. In your script, +### 3. Use the `isc_pathways` Context Manager - - Import `isc_pathways` - - Add `with isc_pathways.connect(...)` statement. The function takes the below values: - - Cluster name - - Project name - - Region - - GCS bucket name - - Pathways Service (See instructions to find the Pathways address [here](#find-pw-service)) - - Write your ML code under this `with` block to run it on the underlying TPUs. +In your script, + +1. Import `isc_pathways` +2. Add `with isc_pathways.connect(...)` statement. The function takes the below values: + - Cluster name + - Project name + - Region + - GCS bucket name + - Pathways Service (See instructions to find the Pathways address [here](#find-pw-service)) +3. Write your ML code under this `with` block to run it on the underlying TPUs. See [run_connect_example.py](run_connect_example.py) for reference. Example code: ``` - from pathwaysutils.experimental.shared_pathways_service import isc_pathways - - with isc_pathways.connect( - cluster="my-cluster", - project="my-project", - region="region", - gcs_bucket="gs://user-bucket", - pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001", +from pathwaysutils.experimental.shared_pathways_service import isc_pathways +import jax.numpy as jnp +import pathwaysutils +import pprint + +with isc_pathways.connect( + cluster="my-cluster", + project="my-project", + region="region", + gcs_bucket="gs://user-bucket", + pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001", expected_tpu_instances={"tpuv6e:2x2": 2}, - ) as tm: - import jax.numpy as jnp - import pathwaysutils - import pprint - +) as tm: pathwaysutils.initialize() orig_matrix = jnp.zeros(5) ...