-
Notifications
You must be signed in to change notification settings - Fork 8
Update "Shared Pathways Service" README #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,34 +8,111 @@ service that manages scheduling and error handling. | |
|
|
||
| ## Requirements | ||
|
|
||
| Make sure that your GKE cluster is running the Resource Manager and Worker pods. | ||
| You can follow the steps | ||
| <a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a> | ||
| to confirm the status of these pods. If you haven't started the Pathways pods | ||
| yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml). | ||
| Make sure to modify the following values to deploy these pods: | ||
| 1. You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports | ||
| single-host Trillium slices only, this support will be extended soon. | ||
|
|
||
| 2. Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml). | ||
| Make sure to modify the following values to deploy the Pathways pods: | ||
|
|
||
| - A unique Jobset name for the cluster's Pathways pods | ||
| - GCS bucket path | ||
| - TPU type and topology | ||
| - Number of slices | ||
|
|
||
| These fields are highlighted in the YAML file with trailing comments for easier | ||
| understanding. | ||
| 3. Verify that the Shared Pathways Service components are started, specifically the Resource Manager (RM) and Worker | ||
| pods. | ||
|
|
||
| Check that the required pods are running. | ||
| ``` | ||
| # Set the environment variables. | ||
| $ PROJECT=<your-project> | ||
| $ CLUSTER_NAME=<your-cluster> | ||
| $ REGION=<cluster-region> # e.g., us-central2 | ||
|
|
||
| # Get credentials for your cluster. | ||
| $ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default | ||
|
|
||
| # Check the status of RM and Worker pods. | ||
| $ kubectl get pods | ||
|
|
||
| # Sample expected output | ||
| NAME READY STATUS RESTARTS AGE | ||
| pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s | ||
| pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s | ||
| pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s | ||
| ``` | ||
|
|
||
| You can also verify the pod status by looking at the project logs. Look for the below substring for the respective pod | ||
| type. | ||
|
|
||
| (Detailed instructions are <a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>) | ||
|
|
||
| ``` | ||
| # Set the environment variables | ||
|
||
| $ HEAD_POD_NAME=pathways-cluster-pathways-head-0-0-zzmn2 | ||
| $ WORKER0_POD_NAME=pathways-cluster-worker-0-0-bdzq4 | ||
| $ WORKER1_POD_NAME=pathways-cluster-worker-1-0-km2rf | ||
| ``` | ||
|
|
||
| - RM | ||
| ``` | ||
| $ kubectl logs $HEAD_POD_NAME --container pathways-rm | ||
| ... | ||
| I1208 20:10:04.992524 ...] Pathways Server serving on [::]:29001 | ||
| ... | ||
| I1208 20:10:23.848070 ...] *** 2/2 Pathways Slices Now Ready | ||
| ``` | ||
|
|
||
| - Worker | ||
| ``` | ||
| $ kubectl logs $WORKER0_POD_NAME --container pathways-worker | ||
| ... | ||
| I1208 20:10:23.838022 ...] Pathways Server serving on [::]:29005 | ||
| ... | ||
| I1208 20:10:25.249167 ...] MegaScale transport initialized. | ||
| I1208 20:10:25.249172 ...] MegaScale transport init succeeded. | ||
|
|
||
| $ kubectl logs $WORKER1_POD_NAME --container pathways-worker | ||
| ... | ||
| I1208 20:10:23.579361 ...] Pathways Server serving on [::]:29005 | ||
| I1208 20:10:24.994411 ...] MegaScale transport initialized. | ||
| I1208 20:10:24.994416 ...] MegaScale transport init succeeded. | ||
| ... | ||
| ``` | ||
|
|
||
| <a name="find-pw-service"></a> | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is needed |
||
| 4. Find the address of the Pathways service. | ||
| ``` | ||
| $ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please make a bug for this to be easier to get. |
||
| I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001' | ||
| ``` | ||
|
|
||
| ## Instructions | ||
|
|
||
| 1. Clone `pathwaysutils`. | ||
|
|
||
| `git clone https://github.com/AI-Hypercomputer/pathways-utils.git` | ||
| ``` | ||
| git clone https://github.com/AI-Hypercomputer/pathways-utils.git | ||
| ``` | ||
|
|
||
| 2. Install `portpicker`. | ||
|
|
||
| 2. Install portpicker | ||
| ``` | ||
| pip install portpicker | ||
| ``` | ||
|
|
||
| `pip install portpicker` | ||
| 3. In your script, | ||
|
|
||
| 3. Import `isc_pathways` and move your workload under | ||
| `with isc_pathways.connect()` statement. Refer to | ||
| [run_connect_example.py](run_connect_example.py) for reference. Example code: | ||
| - Import `isc_pathways` | ||
| - Add `with isc_pathways.connect(...)` statement. The function takes the below values: | ||
| - Cluster name | ||
| - Project name | ||
| - Region | ||
| - GCS bucket name | ||
| - Pathways Service (See instructions to find the Pathways address [here](#find-pw-service)) | ||
| - Write your ML code under this `with` block to run it on the underlying TPUs. | ||
|
|
||
| See [run_connect_example.py](run_connect_example.py) for reference. Example code: | ||
|
|
||
| ``` | ||
| from pathwaysutils.experimental.shared_pathways_service import isc_pathways | ||
|
|
@@ -59,3 +136,11 @@ understanding. | |
|
|
||
| The connect block will deploy a proxy pod dedicated to your client and connect | ||
| your local runtime environment to the proxy pod via port-forwarding. | ||
|
|
||
| 4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Step 3 |
||
| Service finds free TPU(s) that match your request, your workload will start running on the free resources. However, | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. avoid the word "free" and use "available" |
||
| if all TPUs are occupied, you can expect your script to fail. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "fail after a timeout with a log indicating that there are no available resources" What is the behavior when there are not enough resources? |
||
|
|
||
| ## Troubleshooting | ||
| Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways) | ||
| if your Pathways pods do not come up! | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
for the cluster's Pathways pods