Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 104 additions & 27 deletions pathwaysutils/experimental/shared_pathways_service/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,54 +8,131 @@ service that manages scheduling and error handling.

## Requirements

Make sure that your GKE cluster is running the Resource Manager and Worker pods.
You can follow the steps
<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>
to confirm the status of these pods. If you haven't started the Pathways pods
yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml).
Make sure to modify the following values to deploy these pods:
### 1. Create a GKE cluster with TPUs

You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Have a GKE cluster with at least one slice"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not mention the limited support of SPS because now you need to make sure you remove this later.

single-host Trillium slices only, this support will be extended soon.

<a name="pw-service-yaml"></a>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this line here?


### 2. Deploy the Pathways head pod

Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml).
Make sure to modify the following values to deploy the Pathways pods:

- A unique Jobset name for the cluster's Pathways pods
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for the cluster's Pathways pods

- GCS bucket path
- TPU type and topology
- Number of slices

These fields are highlighted in the YAML file with trailing comments for easier
understanding.
### 3. Verify that the pods created in Step#2 are running
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Step 2


Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "specifically the Pathways resource manager and Pathways workers" to align with the public docs of capitalizing Pathways and not capitalizing the component names

pods.

```
# Set the environment variables.
$ PROJECT=<your-project>
$ CLUSTER_NAME=<your-cluster>
$ REGION=<cluster-region> # e.g., us-central2

# Get credentials for your cluster.
$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default
```

#### Option 1: List all pods

```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "```shell" for highlighting

$ kubectl get pods

# Sample expected output (1 Head pod and 1 or more Worker pods)
NAME READY STATUS RESTARTS AGE
pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s # HEAD POD
pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s # WORKER 0
pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1
```

#### Option 2: Check the status of the specific pods that belong to your Pathways Service

```
# e.g., pathways-cluster
$ JOBSET_NAME=<your-jobset-name> # same as you used in [pw-service-example.yaml](#pw-service-yaml)

# e.g., pathways-cluster-pathways-head-0-0-zzmn2
$ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep head)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kubectl options can be used to simplify this to not need sed and grep pipes


# e.g., pathways-cluster-worker-0-0-bdzq4
$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kubectl options can be used to not need sed and grep pipes

```

#### Option 3: Check project logs

Find the detailed instructions
<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use markdown link [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring)


<a name="find-pw-service"></a>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed

### 4. Find the Pathways service address
Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
```
$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make a bug for this to be easier to get.

I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
```

## Instructions

1. Clone `pathwaysutils`.
### 1. Clone `pathwaysutils`.

`git clone https://github.com/AI-Hypercomputer/pathways-utils.git`
```
git clone https://github.com/AI-Hypercomputer/pathways-utils.git
```

2. Install portpicker
### 2. Install `portpicker`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add portpicker to pathwaysutils's dependencies and remove this.


`pip install portpicker`
```
pip install portpicker
```

### 3. Use the `isc_pathways` Context Manager

In your script,

3. Import `isc_pathways` and move your workload under
`with isc_pathways.connect()` statement. Refer to
[run_connect_example.py](run_connect_example.py) for reference. Example code:
1. Import `isc_pathways`
2. Add `with isc_pathways.connect(...)` statement. The function takes the below values:
- Cluster name
- Project name
- Region
- GCS bucket name
- Pathways Service (See instructions to find the Pathways address [here](#find-pw-service))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Pathways Server resource manager address (See instructions for finding this [here](4.-find-the-pathways-service-address))

3. Write your ML code under this `with` block to run it on the underlying TPUs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "under this context manager to run..."


See [run_connect_example.py](run_connect_example.py) for reference. Example code:

```
from pathwaysutils.experimental.shared_pathways_service import isc_pathways
from pathwaysutils.experimental.shared_pathways_service import isc_pathways
import jax.numpy as jnp
import pathwaysutils
import pprint

with isc_pathways.connect(
cluster="my-cluster",
project="my-project",
region="region",
gcs_bucket="gs://user-bucket",
pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
with isc_pathways.connect(
cluster="my-cluster",
project="my-project",
region="region",
gcs_bucket="gs://user-bucket",
pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
expected_tpu_instances={"tpuv6e:2x2": 2},
) as tm:
import jax.numpy as jnp
import pathwaysutils
import pprint

) as tm:
pathwaysutils.initialize()
orig_matrix = jnp.zeros(5)
...
```

The connect block will deploy a proxy pod dedicated to your client and connect
your local runtime environment to the proxy pod via port-forwarding.

4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Step 3

Service finds free TPU(s) that match your request, your workload will start running on the free resources. However,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid the word "free" and use "available"

if all TPUs are occupied, you can expect your script to fail.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fail after a timeout with a log indicating that there are no available resources"

What is the behavior when there are not enough resources?


## Troubleshooting
Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways)
if your Pathways pods do not come up!