-
Notifications
You must be signed in to change notification settings - Fork 7
Update "Shared Pathways Service" README #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add elaborate instructions to validate that the service components are running.
| (Detailed instructions are <a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>) | ||
|
|
||
| ``` | ||
| # Set the environment variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe find these programmatically since you already have the jobset name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can get the pod names using kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=<your-jobset-name> -o name. But, not sure how to filter the pod for a specific container type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the commands.
| Make sure to modify the following values to deploy these pods: | ||
| ### 1. Create a GKE cluster with TPUs | ||
|
|
||
| You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Have a GKE cluster with at least one slice"
| Make sure to modify the following values to deploy these pods: | ||
| ### 1. Create a GKE cluster with TPUs | ||
|
|
||
| You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not mention the limited support of SPS because now you need to make sure you remove this later.
| You have a GKE cluster with atleast 1 slice of `v6e-4` or `v6e-8`. Note that the Shared Pathways Service supports | ||
| single-host Trillium slices only, this support will be extended soon. | ||
|
|
||
| <a name="pw-service-yaml"></a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this line here?
| Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml). | ||
| Make sure to modify the following values to deploy the Pathways pods: | ||
|
|
||
| - A unique Jobset name for the cluster's Pathways pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for the cluster's Pathways pods
|
|
||
| These fields are highlighted in the YAML file with trailing comments for easier | ||
| understanding. | ||
| ### 3. Verify that the pods created in Step#2 are running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Step 2
| - Project name | ||
| - Region | ||
| - GCS bucket name | ||
| - Pathways Service (See instructions to find the Pathways address [here](#find-pw-service)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Pathways Server resource manager address (See instructions for finding this [here](4.-find-the-pathways-service-address))
| - Region | ||
| - GCS bucket name | ||
| - Pathways Service (See instructions to find the Pathways address [here](#find-pw-service)) | ||
| 3. Write your ML code under this `with` block to run it on the underlying TPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "under this context manager to run..."
| The connect block will deploy a proxy pod dedicated to your client and connect | ||
| your local runtime environment to the proxy pod via port-forwarding. | ||
|
|
||
| 4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Step 3
| your local runtime environment to the proxy pod via port-forwarding. | ||
|
|
||
| 4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways | ||
| Service finds free TPU(s) that match your request, your workload will start running on the free resources. However, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid the word "free" and use "available"
|
|
||
| 4. You can start another client that uses the same `pathways_service` (similar to Step#3). If the Shared Pathways | ||
| Service finds free TPU(s) that match your request, your workload will start running on the free resources. However, | ||
| if all TPUs are occupied, you can expect your script to fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fail after a timeout with a log indicating that there are no available resources"
What is the behavior when there are not enough resources?
|
Make sure to merge your commits before merging the PR. |
Add elaborate instructions to validate that the service components are running.