@@ -11,6 +11,7 @@ service that manages scheduling and error handling.
11111 . You have a GKE cluster with atleast 1 slice of ` v6e-4 ` or ` v6e-8 ` . Note that the Shared Pathways Service supports
1212single-host Trillium slices only, this support will be extended soon.
1313
14+ <a name =" pw-service-yaml " ></a >
14152 . Start the Shared Pathways Service by using [ pw-service-example.yaml] ( yamls/pw-service-example.yaml ) .
1516Make sure to modify the following values to deploy the Pathways pods:
1617
@@ -35,53 +36,29 @@ $ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --pro
3536# Check the status of RM and Worker pods.
3637$ kubectl get pods
3738
38- # Sample expected output
39+ # Sample expected output (1 Head pod and 1 or more Worker pods)
3940NAME READY STATUS RESTARTS AGE
40- pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s
41- pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s
42- pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s
41+ pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s # HEAD POD
42+ pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s # WORKER 0
43+ pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1
4344```
4445
45- You can also verify the pod status by looking at the project logs. Look for the below substring for the respective pod
46- type .
46+ You can also verify the pod status by running below commands or by checking the project logs (Detailed instructions
47+ for the logs are < a href = " https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring " target = " _blank " >here</ a >) .
4748
48- (Detailed instructions are <a href =" https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring " target =" _blank " >here</a >)
49-
50- ```
51- # Set the environment variables
52- $ HEAD_POD_NAME=pathways-cluster-pathways-head-0-0-zzmn2
53- $ WORKER0_POD_NAME=pathways-cluster-worker-0-0-bdzq4
54- $ WORKER1_POD_NAME=pathways-cluster-worker-1-0-km2rf
5549```
50+ # e.g., pathways-cluster
51+ $ JOBSET_NAME=<your-jobset-name> # same as you used in [pw-service-example.yaml](#pw-service-yaml)
5652
57- - RM
58- ```
59- $ kubectl logs $HEAD_POD_NAME --container pathways-rm
60- ...
61- I1208 20:10:04.992524 ...] Pathways Server serving on [::]:29001
62- ...
63- I1208 20:10:23.848070 ...] *** 2/2 Pathways Slices Now Ready
64- ```
53+ # e.g., pathways-cluster-pathways-head-0-0-zzmn2
54+ $ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep head)
6555
66- - Worker
67- ```
68- $ kubectl logs $WORKER0_POD_NAME --container pathways-worker
69- ...
70- I1208 20:10:23.838022 ...] Pathways Server serving on [::]:29005
71- ...
72- I1208 20:10:25.249167 ...] MegaScale transport initialized.
73- I1208 20:10:25.249172 ...] MegaScale transport init succeeded.
74-
75- $ kubectl logs $WORKER1_POD_NAME --container pathways-worker
76- ...
77- I1208 20:10:23.579361 ...] Pathways Server serving on [::]:29005
78- I1208 20:10:24.994411 ...] MegaScale transport initialized.
79- I1208 20:10:24.994416 ...] MegaScale transport init succeeded.
80- ...
56+ # e.g., pathways-cluster-worker-0-0-bdzq4
57+ $ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-')
8158```
8259
8360<a name =" find-pw-service " ></a >
84- 4 . Find the address of the Pathways service.
61+ 4 . Find the address of the Pathways service from the logs. We check the worker pod logs in the below command .
8562```
8663$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"
8764I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
0 commit comments