Skip to content

Commit d7b6d89

Browse files
committed
reword
1 parent b80acea commit d7b6d89

File tree

1 file changed

+47
-32
lines changed
  • pathwaysutils/experimental/shared_pathways_service

1 file changed

+47
-32
lines changed

pathwaysutils/experimental/shared_pathways_service/README.md

Lines changed: 47 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -8,22 +8,28 @@ service that manages scheduling and error handling.
88

99
## Requirements
1010

11-
1. You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports
11+
### 1. Create a GKE cluster with TPUs
12+
13+
You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports
1214
single-host Trillium slices only, this support will be extended soon.
1315

1416
<a name="pw-service-yaml"></a>
15-
2. Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml).
17+
18+
### 2. Deploy the Pathways head pod
19+
20+
Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml).
1621
Make sure to modify the following values to deploy the Pathways pods:
1722

1823
- A unique Jobset name for the cluster's Pathways pods
1924
- GCS bucket path
2025
- TPU type and topology
2126
- Number of slices
2227

23-
3. Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker
28+
### 3. Verify that the pods created in Step#2 are running
29+
30+
Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker
2431
pods.
2532

26-
Check that the required pods are running.
2733
```
2834
# Set the environment variables.
2935
$ PROJECT=<your-project>
@@ -32,8 +38,11 @@ $ REGION=<cluster-region> # e.g., us-central2
3238
3339
# Get credentials for your cluster.
3440
$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default
41+
```
42+
43+
#### Option 1: List all pods
3544

36-
# Check the status of RM and Worker pods.
45+
```
3746
$ kubectl get pods
3847
3948
# Sample expected output (1 Head pod and 1 or more Worker pods)
@@ -43,8 +52,7 @@ pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s
4352
pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1
4453
```
4554

46-
You can also verify the pod status by running below commands or by checking the project logs (Detailed instructions
47-
for the logs are <a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>).
55+
#### Option 2: Check the status of the specific pods that belong to your Pathways Service
4856

4957
```
5058
# e.g., pathways-cluster
@@ -57,55 +65,62 @@ $ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${J
5765
$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-')
5866
```
5967

68+
#### Option 3: Check project logs
69+
70+
Find the detailed instructions
71+
<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>).
72+
6073
<a name="find-pw-service"></a>
61-
4. Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
74+
### 4. Find the Pathways service address
75+
Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
6276
```
6377
$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"
6478
I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
6579
```
6680

6781
## Instructions
6882

69-
1. Clone `pathwaysutils`.
83+
### 1. Clone `pathwaysutils`.
7084

7185
```
7286
git clone https://github.com/AI-Hypercomputer/pathways-utils.git
7387
```
7488

75-
2. Install `portpicker`.
89+
### 2. Install `portpicker`.
7690

7791
```
7892
pip install portpicker
7993
```
8094

81-
3. In your script,
95+
### 3. Use the `isc_pathways` Context Manager
8296

83-
- Import `isc_pathways`
84-
- Add `with isc_pathways.connect(...)` statement. The function takes the below values:
85-
- Cluster name
86-
- Project name
87-
- Region
88-
- GCS bucket name
89-
- Pathways Service (See instructions to find the Pathways address [here](#find-pw-service))
90-
- Write your ML code under this `with` block to run it on the underlying TPUs.
97+
In your script,
98+
99+
1. Import `isc_pathways`
100+
2. Add `with isc_pathways.connect(...)` statement. The function takes the below values:
101+
- Cluster name
102+
- Project name
103+
- Region
104+
- GCS bucket name
105+
- Pathways Service (See instructions to find the Pathways address [here](#find-pw-service))
106+
3. Write your ML code under this `with` block to run it on the underlying TPUs.
91107

92108
See [run_connect_example.py](run_connect_example.py) for reference. Example code:
93109

94110
```
95-
from pathwaysutils.experimental.shared_pathways_service import isc_pathways
96-
97-
with isc_pathways.connect(
98-
cluster="my-cluster",
99-
project="my-project",
100-
region="region",
101-
gcs_bucket="gs://user-bucket",
102-
pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
111+
from pathwaysutils.experimental.shared_pathways_service import isc_pathways
112+
import jax.numpy as jnp
113+
import pathwaysutils
114+
import pprint
115+
116+
with isc_pathways.connect(
117+
cluster="my-cluster",
118+
project="my-project",
119+
region="region",
120+
gcs_bucket="gs://user-bucket",
121+
pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
103122
expected_tpu_instances={"tpuv6e:2x2": 2},
104-
) as tm:
105-
import jax.numpy as jnp
106-
import pathwaysutils
107-
import pprint
108-
123+
) as tm:
109124
pathwaysutils.initialize()
110125
orig_matrix = jnp.zeros(5)
111126
...

0 commit comments

Comments
 (0)