diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md new file mode 100644 index 0000000..9e462eb --- /dev/null +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -0,0 +1,147 @@ +# Shared Pathways Service + +The Shared Pathways Service accelerates developer iteration by providing a +persistent, multi-tenant TPU environment. This decouples service creation from +the development loop, allowing JAX clients to connect on-demand from a familiar +local environment (like a laptop or cloud VM) to a long-running Pathways +service that manages scheduling and error handling. + +## Requirements + +1. You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports +single-host Trillium slices only, this support will be extended soon. + +2. Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml). +Make sure to modify the following values to deploy the Pathways pods: + +- A unique Jobset name for the cluster's Pathways pods +- GCS bucket path +- TPU type and topology +- Number of slices + +3. Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker +pods. + +Check that the required pods are running. +``` +# Set the environment variables. +$ PROJECT= +$ CLUSTER_NAME= +$ REGION= # e.g., us-central2 + +# Get credentials for your cluster. +$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default + +# Check the status of RM and Worker pods. +$ kubectl get pods + +# Sample expected output +NAME READY STATUS RESTARTS AGE +pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s +pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s +pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s +``` + +You can also verify the pod status by looking at the project logs. Look for the below substring for the respective pod +type. + +(Detailed instructions are here) + +``` +# Set the environment variables +$ HEAD_POD_NAME=pathways-cluster-pathways-head-0-0-zzmn2 +$ WORKER0_POD_NAME=pathways-cluster-worker-0-0-bdzq4 +$ WORKER1_POD_NAME=pathways-cluster-worker-1-0-km2rf +``` + +- RM +``` +$ kubectl logs $HEAD_POD_NAME --container pathways-rm +... +I1208 20:10:04.992524 ...] Pathways Server serving on [::]:29001 +... +I1208 20:10:23.848070 ...] *** 2/2 Pathways Slices Now Ready +``` + +- Worker +``` +$ kubectl logs $WORKER0_POD_NAME --container pathways-worker +... +I1208 20:10:23.838022 ...] Pathways Server serving on [::]:29005 +... +I1208 20:10:25.249167 ...] MegaScale transport initialized. +I1208 20:10:25.249172 ...] MegaScale transport init succeeded. + +$ kubectl logs $WORKER1_POD_NAME --container pathways-worker +... +I1208 20:10:23.579361 ...] Pathways Server serving on [::]:29005 +I1208 20:10:24.994411 ...] MegaScale transport initialized. +I1208 20:10:24.994416 ...] MegaScale transport init succeeded. +... +``` + + +4. Find the address of the Pathways service. +``` +$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address" +I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001' +``` + +## Instructions + +1. Clone `pathwaysutils`. + +``` +git clone https://github.com/AI-Hypercomputer/pathways-utils.git +``` + +2. Install `portpicker`. + +``` +pip install portpicker +``` + +3. In your script, + + - Import `isc_pathways` + - Add `with isc_pathways.connect(...)` statement. The function takes the below values: + - Cluster name + - Project name + - Region + - GCS bucket name + - Pathways Service (See instructions to find the Pathways address [here](#find-pw-service)) + - Write your ML code under this `with` block to run it on the underlying TPUs. + +See +[run_connect_example.py](run_connect_example.py) for reference. Example code: + +``` + from pathwaysutils.experimental.shared_pathways_service import isc_pathways + + with isc_pathways.connect( + cluster="my-cluster", + project="my-project", + region="region", + gcs_bucket="gs://user-bucket", + pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001", + expected_tpu_instances={"tpuv6e:2x2": 2}, + ) as tm: + import jax.numpy as jnp + import pathwaysutils + import pprint + + pathwaysutils.initialize() + orig_matrix = jnp.zeros(5) + ... +``` + +The connect block will deploy a proxy pod dedicated to your client and connect +your local runtime environment to the proxy pod via port-forwarding. + +4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways +Service finds free TPU(s) that match your request, your workload will start running on the free resources. However, +if all TPUs are occupied, you can expect your script to fail. + +## Troubleshooting +Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways) +if your Pathways pods do not come up! \ No newline at end of file diff --git a/pathwaysutils/experimental/shared_pathways_service/validators.py b/pathwaysutils/experimental/shared_pathways_service/validators.py index bd3e7e6..18fbb23 100644 --- a/pathwaysutils/experimental/shared_pathways_service/validators.py +++ b/pathwaysutils/experimental/shared_pathways_service/validators.py @@ -47,6 +47,7 @@ def _validate_tpu_supported(tpu_instance_with_topology: str) -> None: Raises ValueError if the instance is not a valid TPU host. """ # Mapping from Cloud TPU type prefix to max chips per host. + # Make sure to edit the project README if you update this mapping. single_host_max_chips = { "tpuv6e": 8, # Cloud TPU v6e (2x4) }