diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index a46dcd4..e0a58ea 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -8,50 +8,119 @@ service that manages scheduling and error handling. ## Requirements -Make sure that your GKE cluster is running the Resource Manager and Worker pods. -You can follow the steps -here -to confirm the status of these pods. If you haven't started the Pathways pods -yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml). -Make sure to modify the following values to deploy these pods: +### 1. Create a GKE cluster with TPUs + +You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports +single-host Trillium slices only, this support will be extended soon. + + + +### 2. Deploy the Pathways head pod + +Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml). +Make sure to modify the following values to deploy the Pathways pods: - A unique Jobset name for the cluster's Pathways pods - GCS bucket path - TPU type and topology - Number of slices -These fields are highlighted in the YAML file with trailing comments for easier -understanding. +### 3. Verify that the pods created in Step#2 are running + +Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker +pods. + +``` +# Set the environment variables. +$ PROJECT= +$ CLUSTER_NAME= +$ REGION= # e.g., us-central2 + +# Get credentials for your cluster. +$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default +``` + +#### Option 1: List all pods + +``` +$ kubectl get pods + +# Sample expected output (1 Head pod and 1 or more Worker pods) +NAME READY STATUS RESTARTS AGE +pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s # HEAD POD +pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s # WORKER 0 +pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1 +``` + +#### Option 2: Check the status of the specific pods that belong to your Pathways Service + +``` +# e.g., pathways-cluster +$ JOBSET_NAME= # same as you used in [pw-service-example.yaml](#pw-service-yaml) + +# e.g., pathways-cluster-pathways-head-0-0-zzmn2 +$ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep head) + +# e.g., pathways-cluster-worker-0-0-bdzq4 +$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-') +``` + +#### Option 3: Check project logs + +Find the detailed instructions +here). + + +### 4. Find the Pathways service address +Find the address of the Pathways service from the logs. We check the worker pod logs in the below command. +``` +$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address" +I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001' +``` ## Instructions -1. Clone `pathwaysutils`. +### 1. Clone `pathwaysutils`. -`git clone https://github.com/AI-Hypercomputer/pathways-utils.git` +``` +git clone https://github.com/AI-Hypercomputer/pathways-utils.git +``` -2. Install portpicker +### 2. Install `portpicker`. -`pip install portpicker` +``` +pip install portpicker +``` + +### 3. Use the `isc_pathways` Context Manager + +In your script, -3. Import `isc_pathways` and move your workload under -`with isc_pathways.connect()` statement. Refer to -[run_connect_example.py](run_connect_example.py) for reference. Example code: +1. Import `isc_pathways` +2. Add `with isc_pathways.connect(...)` statement. The function takes the below values: + - Cluster name + - Project name + - Region + - GCS bucket name + - Pathways Service (See instructions to find the Pathways address [here](#find-pw-service)) +3. Write your ML code under this `with` block to run it on the underlying TPUs. + +See [run_connect_example.py](run_connect_example.py) for reference. Example code: ``` - from pathwaysutils.experimental.shared_pathways_service import isc_pathways +from pathwaysutils.experimental.shared_pathways_service import isc_pathways +import jax.numpy as jnp +import pathwaysutils +import pprint - with isc_pathways.connect( - cluster="my-cluster", - project="my-project", - region="region", - gcs_bucket="gs://user-bucket", - pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001", +with isc_pathways.connect( + cluster="my-cluster", + project="my-project", + region="region", + gcs_bucket="gs://user-bucket", + pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001", expected_tpu_instances={"tpuv6e:2x2": 2}, - ) as tm: - import jax.numpy as jnp - import pathwaysutils - import pprint - +) as tm: pathwaysutils.initialize() orig_matrix = jnp.zeros(5) ... @@ -59,3 +128,11 @@ understanding. The connect block will deploy a proxy pod dedicated to your client and connect your local runtime environment to the proxy pod via port-forwarding. + +4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways +Service finds free TPU(s) that match your request, your workload will start running on the free resources. However, +if all TPUs are occupied, you can expect your script to fail. + +## Troubleshooting +Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways) +if your Pathways pods do not come up!