From 7848aefd0c0af3be131e320cd4211139fdbfb05f Mon Sep 17 00:00:00 2001 From: Akanksha Gupta Date: Thu, 4 Dec 2025 14:20:10 -0800 Subject: [PATCH 1/7] Add README for shared_pathways_service --- .../shared_pathways_service/README.md | 51 +++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 pathwaysutils/experimental/shared_pathways_service/README.md diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md new file mode 100644 index 0000000..3ff6478 --- /dev/null +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -0,0 +1,51 @@ +# Shared Pathways Service + +Shared pathways service is a multi-tenant Pathways cluster with dedicated TPU +resources. This eliminates the need for complex cloud setup, allowing you to +get started from a familiar local environment (like a laptop or cloud VM) with +minimal overhead: Just wrap your Python entrypoint in a +`with isc_pathways.connect():` block!. + +## Requirements + +Make sure that your cluster is running the Resource Manager and Worker pods. +If not, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml). +Make sure to modify the following values to deploy these pods: + +- A unique Jobset name for the cluster's Pathways pods +- GCS bucket path +- TPU type and topology +- Number of slices + +These fields are highlighted in the YAML file with trailing comments for easier +understanding. + +## Instructions + +1. Clone `pathwaysutils`. + +`git clone https://github.com/AI-Hypercomputer/pathways-utils.git` + +2. Import `isc_pathways.py` and move your workload under +`with isc_pathways.connect()` statement. Refer to +[run_connect_example.py](run_connect_example.py) for reference. Example code: + +``` + from pathwaysutils.experimental.shared_pathways_service import isc_pathways + + with isc_pathways.connect( + "my-cluster", + "my-project", + "region", + "gs://user-bucket", + "pathways-cluster-pathways-head-0-0.pathways-cluster:29001", + {"tpuv6e:2x2": 2}, + ) as tm: + import jax.numpy as jnp + import pathwaysutils + import pprint + + pathwaysutils.initialize() + orig_matrix = jnp.zeros(5) + ... +``` From 5cf1186b863830ea79a470a5f5d9032c4df9d333 Mon Sep 17 00:00:00 2001 From: Akanksha Gupta Date: Thu, 4 Dec 2025 14:39:52 -0800 Subject: [PATCH 2/7] Add that connect block deploys a proxy pod --- pathwaysutils/experimental/shared_pathways_service/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index 3ff6478..51dcaf6 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -49,3 +49,6 @@ understanding. orig_matrix = jnp.zeros(5) ... ``` + +The connect block will deploy a proxy pod to your GKE cluster and connect your local runtime environment to the proxy +pod via port-forwarding. From b669b65532d7c73546325c52eccb0de668501a54 Mon Sep 17 00:00:00 2001 From: Akanksha Gupta Date: Thu, 4 Dec 2025 18:00:07 -0800 Subject: [PATCH 3/7] fix comment --- pathwaysutils/experimental/shared_pathways_service/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index 51dcaf6..0a767f0 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -26,7 +26,7 @@ understanding. `git clone https://github.com/AI-Hypercomputer/pathways-utils.git` -2. Import `isc_pathways.py` and move your workload under +2. Import `isc_pathways` and move your workload under `with isc_pathways.connect()` statement. Refer to [run_connect_example.py](run_connect_example.py) for reference. Example code: From c3da4af13f2870fd128deb57da770ad4ac4d3609 Mon Sep 17 00:00:00 2001 From: Akanksha Gupta Date: Thu, 4 Dec 2025 20:09:35 -0800 Subject: [PATCH 4/7] Resolving commments in shared_pathways_service/README.md --- .../shared_pathways_service/README.md | 21 +++++++++++-------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index 0a767f0..87d0484 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -1,15 +1,18 @@ # Shared Pathways Service -Shared pathways service is a multi-tenant Pathways cluster with dedicated TPU -resources. This eliminates the need for complex cloud setup, allowing you to -get started from a familiar local environment (like a laptop or cloud VM) with -minimal overhead: Just wrap your Python entrypoint in a -`with isc_pathways.connect():` block!. +The Shared Pathways Service accelerates developer iteration by providing a +persistent, multi-tenant TPU environment. This decouples service creation from +the development loop, allowing JAX clients to connect on-demand from a familiar +local environment (like a laptop or cloud VM) to a long-running Pathways +service that manages scheduling and error handling. ## Requirements -Make sure that your cluster is running the Resource Manager and Worker pods. -If not, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml). +Make sure that your GKE cluster is running the Resource Manager and Worker pods. +You can follow the steps +[here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring) +to confirm the status of these pods. If you haven't started the Pathways pods +yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml). Make sure to modify the following values to deploy these pods: - A unique Jobset name for the cluster's Pathways pods @@ -50,5 +53,5 @@ understanding. ... ``` -The connect block will deploy a proxy pod to your GKE cluster and connect your local runtime environment to the proxy -pod via port-forwarding. +The connect block will deploy a proxy pod dedicated to your client and connect +your local runtime environment to the proxy pod via port-forwarding. From ac233ca152017c97206886519b9c4390d5448c5b Mon Sep 17 00:00:00 2001 From: Akanksha Gupta Date: Thu, 4 Dec 2025 20:48:12 -0800 Subject: [PATCH 5/7] Fix instructions in Readme --- .../shared_pathways_service/README.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index 87d0484..a9f4f36 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -29,7 +29,11 @@ understanding. `git clone https://github.com/AI-Hypercomputer/pathways-utils.git` -2. Import `isc_pathways` and move your workload under +2. Install portpicker + +`pip install portpicker` + +3. Import `isc_pathways` and move your workload under `with isc_pathways.connect()` statement. Refer to [run_connect_example.py](run_connect_example.py) for reference. Example code: @@ -37,12 +41,12 @@ understanding. from pathwaysutils.experimental.shared_pathways_service import isc_pathways with isc_pathways.connect( - "my-cluster", - "my-project", - "region", - "gs://user-bucket", - "pathways-cluster-pathways-head-0-0.pathways-cluster:29001", - {"tpuv6e:2x2": 2}, + cluster="my-cluster", + project="my-project", + region="region", + gcs_bucket="gs://user-bucket", + pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001", + expected_tpu_instances={"tpuv6e:2x2": 2}, ) as tm: import jax.numpy as jnp import pathwaysutils From bc0c8b5145cfbbff6a6dc7fbfbbffefed0f2c8ea Mon Sep 17 00:00:00 2001 From: Akanksha Gupta Date: Fri, 5 Dec 2025 09:32:00 -0800 Subject: [PATCH 6/7] Open hyperlink in a new tab --- pathwaysutils/experimental/shared_pathways_service/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index a9f4f36..a46dcd4 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -10,7 +10,7 @@ service that manages scheduling and error handling. Make sure that your GKE cluster is running the Resource Manager and Worker pods. You can follow the steps -[here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring) +here to confirm the status of these pods. If you haven't started the Pathways pods yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml). Make sure to modify the following values to deploy these pods: From 5fd6a81c2b3e3697fc00d75fea6c40a1b8beca7b Mon Sep 17 00:00:00 2001 From: Akanksha Gupta Date: Mon, 8 Dec 2025 13:08:47 -0800 Subject: [PATCH 7/7] Update README --- .../shared_pathways_service/README.md | 112 ++++++++++++++++-- .../shared_pathways_service/validators.py | 1 + 2 files changed, 100 insertions(+), 13 deletions(-) diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index a46dcd4..9e462eb 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -8,33 +8,111 @@ service that manages scheduling and error handling. ## Requirements -Make sure that your GKE cluster is running the Resource Manager and Worker pods. -You can follow the steps -here -to confirm the status of these pods. If you haven't started the Pathways pods -yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml). -Make sure to modify the following values to deploy these pods: +1. You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports +single-host Trillium slices only, this support will be extended soon. + +2. Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml). +Make sure to modify the following values to deploy the Pathways pods: - A unique Jobset name for the cluster's Pathways pods - GCS bucket path - TPU type and topology - Number of slices -These fields are highlighted in the YAML file with trailing comments for easier -understanding. +3. Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker +pods. + +Check that the required pods are running. +``` +# Set the environment variables. +$ PROJECT= +$ CLUSTER_NAME= +$ REGION= # e.g., us-central2 + +# Get credentials for your cluster. +$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default + +# Check the status of RM and Worker pods. +$ kubectl get pods + +# Sample expected output +NAME READY STATUS RESTARTS AGE +pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s +pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s +pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s +``` + +You can also verify the pod status by looking at the project logs. Look for the below substring for the respective pod +type. + +(Detailed instructions are here) + +``` +# Set the environment variables +$ HEAD_POD_NAME=pathways-cluster-pathways-head-0-0-zzmn2 +$ WORKER0_POD_NAME=pathways-cluster-worker-0-0-bdzq4 +$ WORKER1_POD_NAME=pathways-cluster-worker-1-0-km2rf +``` + +- RM +``` +$ kubectl logs $HEAD_POD_NAME --container pathways-rm +... +I1208 20:10:04.992524 ...] Pathways Server serving on [::]:29001 +... +I1208 20:10:23.848070 ...] *** 2/2 Pathways Slices Now Ready +``` + +- Worker +``` +$ kubectl logs $WORKER0_POD_NAME --container pathways-worker +... +I1208 20:10:23.838022 ...] Pathways Server serving on [::]:29005 +... +I1208 20:10:25.249167 ...] MegaScale transport initialized. +I1208 20:10:25.249172 ...] MegaScale transport init succeeded. + +$ kubectl logs $WORKER1_POD_NAME --container pathways-worker +... +I1208 20:10:23.579361 ...] Pathways Server serving on [::]:29005 +I1208 20:10:24.994411 ...] MegaScale transport initialized. +I1208 20:10:24.994416 ...] MegaScale transport init succeeded. +... +``` + + +4. Find the address of the Pathways service. +``` +$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address" +I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001' +``` ## Instructions 1. Clone `pathwaysutils`. -`git clone https://github.com/AI-Hypercomputer/pathways-utils.git` +``` +git clone https://github.com/AI-Hypercomputer/pathways-utils.git +``` + +2. Install `portpicker`. -2. Install portpicker +``` +pip install portpicker +``` -`pip install portpicker` +3. In your script, -3. Import `isc_pathways` and move your workload under -`with isc_pathways.connect()` statement. Refer to + - Import `isc_pathways` + - Add `with isc_pathways.connect(...)` statement. The function takes the below values: + - Cluster name + - Project name + - Region + - GCS bucket name + - Pathways Service (See instructions to find the Pathways address [here](#find-pw-service)) + - Write your ML code under this `with` block to run it on the underlying TPUs. + +See [run_connect_example.py](run_connect_example.py) for reference. Example code: ``` @@ -59,3 +137,11 @@ understanding. The connect block will deploy a proxy pod dedicated to your client and connect your local runtime environment to the proxy pod via port-forwarding. + +4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways +Service finds free TPU(s) that match your request, your workload will start running on the free resources. However, +if all TPUs are occupied, you can expect your script to fail. + +## Troubleshooting +Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways) +if your Pathways pods do not come up! \ No newline at end of file diff --git a/pathwaysutils/experimental/shared_pathways_service/validators.py b/pathwaysutils/experimental/shared_pathways_service/validators.py index bd3e7e6..18fbb23 100644 --- a/pathwaysutils/experimental/shared_pathways_service/validators.py +++ b/pathwaysutils/experimental/shared_pathways_service/validators.py @@ -47,6 +47,7 @@ def _validate_tpu_supported(tpu_instance_with_topology: str) -> None: Raises ValueError if the instance is not a valid TPU host. """ # Mapping from Cloud TPU type prefix to max chips per host. + # Make sure to edit the project README if you update this mapping. single_host_max_chips = { "tpuv6e": 8, # Cloud TPU v6e (2x4) }