Skip to content

Commit 886b6bc

Browse files
authored
Added advanced usage example for a notebook interacting with a Cloud … (#178)
* Added advanced usage example for a notebook interacting with a Cloud TPU cluster * Added advanced usage example for a notebook interacting with a Cloud TPU cluster
1 parent 1647ec6 commit 886b6bc

File tree

2 files changed

+349
-6
lines changed

2 files changed

+349
-6
lines changed

README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -497,7 +497,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
497497
--cluster xpk-test --filter-by-job=$USER
498498
```
499499
500-
* Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the `timeout`, if provided, has been reached and then list the job. If no `timeout` is specified, the default value is set to the max value, 1 week. You may also set `timeout=0` to poll the job once.
500+
* Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the `timeout`, if provided, has been reached and then list the job. If no `timeout` is specified, the default value is set to the max value, 1 week. You may also set `timeout=0` to poll the job once.
501501
(Note: `restart-on-user-code-failure` must be set
502502
when creating the workload otherwise the workload will always finish with `Completed` status.)
503503
@@ -516,11 +516,11 @@ when creating the workload otherwise the workload will always finish with `Compl
516516
--timeout=300
517517
```
518518
519-
Return codes
520-
`0`: Workload finished and completed successfully.
521-
`124`: Timeout was reached before workload finished.
522-
`125`: Workload finished but did not complete successfully.
523-
`1`: Other failure.
519+
Return codes
520+
`0`: Workload finished and completed successfully.
521+
`124`: Timeout was reached before workload finished.
522+
`125`: Workload finished but did not complete successfully.
523+
`1`: Other failure.
524524
525525
## Inspector
526526
* Inspector provides debug info to understand cluster health, and why workloads are not running.
@@ -1078,3 +1078,6 @@ To explore the stack traces collected in a temporary directory in Kubernetes Pod
10781078
--workload xpk-test-workload --command "python3 main.py" --cluster \
10791079
xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar
10801080
```
1081+
1082+
# Other advanced usage
1083+
[Use a Jupyter notebook to interact with a Cloud TPU cluster](xpk-notebooks.md)

xpk-notebooks.md

Lines changed: 340 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,340 @@
1+
<!--
2+
Copyright 2024 Google LLC
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
https://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
# Advanced usage - Use a Jupyter notebook to interact with a Cloud TPU cluster
18+
19+
[Return to README](README.md#other-advanced-usage)
20+
21+
## Introduction
22+
One of the challenges researchers face when working with contemporary models is the distributed programming involved to orchestrate work with a complex architecture. This example shows you how to use XPK to create a Cloud TPU v5e-256 cluster and interact with it using a Jupyter notebook.
23+
24+
## Assumptions
25+
You need to ensure you have the TPU capacity (quotas and limits) for this activity. You may need to change machine names and shapes to make this work.
26+
27+
To interact with the cluster, we use IPython Parallels and some [cell magic](https://ipyparallel.readthedocs.io/en/latest/tutorial/magics.html). IPython Parallels (ipyparallel) is a Python package and collection of CLI scripts for controlling clusters of IPython processes, built on the Jupyter protocol. While the default settings were adequate for this example, you should review [ipyparallel security details](https://ipyparallel.readthedocs.io/en/latest/reference/security.html) before use in a production environment.
28+
We do most of this work from a Cloud Shell instance. We will use some environment variables to make life easier.
29+
```shell
30+
export PROJECTID=${GOOGLE_CLOUD_PROJECT}
31+
export CLUSTER= # your cluster name
32+
export REGION= # region for cluster
33+
export ZONE= # zone for cluster
34+
```
35+
36+
## Cluster creation
37+
### Optional: high-MTU network
38+
If you need to work with multiple TPU slices, it will be useful to create a high-MTU network as shown here (the remaining steps assume you do):
39+
https://github.com/google/maxtext/tree/main/MaxText/configs#create-a-custom-mtu-network
40+
```shell
41+
gcloud compute networks create mtu9k --mtu=8896 \
42+
--project=${PROJECTID} --subnet-mode=auto \
43+
--bgp-routing-mode=regional
44+
45+
gcloud compute firewall-rules create mtu9kfw --network mtu9k \
46+
--allow tcp,icmp,udp --project=${PROJECTID}
47+
```
48+
49+
### XPK create cluster
50+
Install XPK. (You know, this repo!)
51+
52+
Create a GKE Cloud TPU cluster using XPK.
53+
```shell
54+
xpk cluster create --cluster ${CLUSTER} \
55+
--project=${PROJECTID} --default-pool-cpu-machine-type=n2-standard-8 \
56+
--num-slices=1 --tpu-type=v5litepod-256 --zone=${ZONE} \
57+
--spot --custom-cluster-arguments="--network=mtu9k --subnetwork=mtu9k"
58+
59+
# if you need to delete this cluster to fix errors
60+
xpk cluster delete --cluster ${CLUSTER} --zone=${ZONE}
61+
```
62+
63+
## Add storage
64+
Enable filestore plugin so we can use an NFS Filestore instance for shared storage. (This may take 20-30 minutes.)
65+
```shell
66+
gcloud container clusters update ${CLUSTER} \
67+
--region ${REGION} --project ${PROJECTID} \
68+
--update-addons=GcpFilestoreCsiDriver=ENABLED
69+
```
70+
71+
### Filestore instance
72+
Create a regional NFS [Filestore instance](https://cloud.google.com/filestore/docs/creating-instances#google-cloud-console) in ``${REGION}`` and the named network above.
73+
74+
Note the instance ID and file share name you’ve used. You will need to wait until this instance is available to continue.
75+
76+
77+
### Persistent volumes
78+
Once the Filestore instance is up, create a file with the correct names and storage size so you can create a persistent volume for the cluster. You will need to update the volumeHandle and volumeAttributes below. You will also need to change the names to match.
79+
```yaml
80+
# persistent-volume.yaml
81+
apiVersion: v1
82+
kind: PersistentVolume
83+
metadata:
84+
name: opmvol
85+
spec:
86+
storageClassName: ""
87+
capacity:
88+
storage: 1Ti
89+
accessModes:
90+
- ReadWriteMany
91+
persistentVolumeReclaimPolicy: Retain
92+
volumeMode: Filesystem
93+
csi:
94+
driver: filestore.csi.storage.gke.io
95+
volumeHandle: "modeInstance/${ZONE}/nfs-opm-ase/nfs_opm_ase"
96+
volumeAttributes:
97+
ip: 10.243.23.194
98+
volume: nfs_opm_ase
99+
---
100+
kind: PersistentVolumeClaim
101+
apiVersion: v1
102+
metadata:
103+
name: opmvol-claim
104+
spec:
105+
accessModes:
106+
- ReadWriteMany
107+
storageClassName: ""
108+
volumeName: opmvol
109+
resources:
110+
requests:
111+
storage: 1T
112+
```
113+
114+
Apply the change. Be sure to get the cluster credentials first if you haven’t already done that.
115+
```shell
116+
# get cluster credentials if needed
117+
# gcloud container clusters get-credentials ${CLUSTER} --region ${REGION} --project ${PROJECTID}
118+
# kubectl get nodes
119+
120+
# add the storage to the cluster
121+
kubectl apply -f persistent-volume.yaml
122+
```
123+
124+
If it worked, you should see the volume listed.
125+
```shell
126+
kubectl get pv
127+
kubectl get pvc
128+
```
129+
130+
## Build Docker image for IPP nodes
131+
We will start with the MaxText image because we want to train an LLM.
132+
```shell
133+
# get the code
134+
git clone "https://github.com/google/maxtext"
135+
```
136+
137+
We’ll start with a JAX stable image for TPUs and then update the build specification to include ipyparallel. Edit the ``requirements_with_jax_stable_stack.txt`` to add this at the bottom.
138+
```shell
139+
# also include IPyParallel
140+
ipyparallel
141+
```
142+
143+
Build the image and upload it so we can use the image to spin up pods. Note the resulting image name. It should be something like ``gcr.io/${PROJECTID}/opm_ipp_runner/tpu``.
144+
```shell
145+
# use docker build to build the image and upload it
146+
# NOTE: you may need to change the upload repository
147+
bash ./docker_maxtext_jax_stable_stack_image_upload.sh PROJECT_ID=${PROJECTID} \
148+
BASEIMAGE=us-docker.pkg.dev/${PROJECTID}/jax-stable-stack/tpu:jax0.4.30-rev1 \
149+
CLOUD_IMAGE_NAME=opm_ipp_runner IMAGE_TAG=latest \
150+
MAXTEXT_REQUIREMENTS_FILE=requirements_with_jax_stable_stack.txt
151+
152+
# confirm the image is available
153+
# docker image list gcr.io/${PROJECTID}/opm_ipp_runner/tpu:latest
154+
```
155+
156+
## Set up LWS
157+
We use the LeaderWorkerSet for these IPP pods, so they are managed collectively.
158+
```shell
159+
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/v0.3.0/manifests.yaml
160+
```
161+
162+
## Set up IPP deployment
163+
Next we set up an LWS pod specification for our IPP instances. Create an ``ipp-deployment.yaml`` file.
164+
You will need to update the volume mounts and the container image references. (You should also change the password.)
165+
```yaml
166+
# ipp-deployment.yaml
167+
apiVersion: leaderworkerset.x-k8s.io/v1
168+
kind: LeaderWorkerSet
169+
metadata:
170+
name: ipp-deployment
171+
annotations:
172+
leaderworkerset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
173+
spec:
174+
replicas: 1
175+
leaderWorkerTemplate:
176+
size: 65
177+
restartPolicy: RecreateGroupOnPodRestart
178+
leaderTemplate:
179+
metadata:
180+
labels:
181+
app: ipp-controller
182+
spec:
183+
securityContext:
184+
runAsUser: 1000
185+
runAsGroup: 100
186+
fsGroup: 100
187+
nodeSelector:
188+
cloud.google.com/gke-tpu-topology: 16x16
189+
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
190+
tolerations:
191+
- key: "google.com/tpu"
192+
operator: "Exists"
193+
effect: "NoSchedule"
194+
containers:
195+
- name: jupyter-notebook-server
196+
image: jupyter/base-notebook:latest
197+
args: ["start-notebook.sh", "--NotebookApp.allow_origin='https://colab.research.google.com'", "--NotebookApp.port_retries=0"]
198+
resources:
199+
limits:
200+
cpu: 1000m
201+
memory: 1Gi
202+
requests:
203+
cpu: 100m
204+
memory: 500Mi
205+
ports:
206+
- containerPort: 8888
207+
name: http-web-svc
208+
volumeMounts:
209+
- name: opmvol
210+
mountPath: /home/jovyan/nfs # jovyan is the default user
211+
- name: ipp-controller
212+
image: gcr.io/${PROJECTID}/opm_ipp_runner/tpu
213+
command:
214+
- bash
215+
- -c
216+
- |
217+
ip=$(hostname -I | awk '{print $1}')
218+
echo $ip
219+
ipcontroller --ip="$ip" --profile-dir=/app/ipp --log-level=ERROR --ping 10000
220+
volumeMounts:
221+
- name: opmvol
222+
mountPath: /app/ipp
223+
volumes:
224+
- name: opmvol
225+
persistentVolumeClaim:
226+
claimName: opmvol-claim
227+
228+
workerTemplate:
229+
spec:
230+
nodeSelector:
231+
cloud.google.com/gke-tpu-topology: 16x16
232+
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
233+
containers:
234+
235+
- name: ipp-engine
236+
image: gcr.io/${PROJECTID}/opm_ipp_runner/tpu
237+
ports:
238+
- containerPort: 8471 # Default port using which TPU VMs communicate
239+
securityContext:
240+
privileged: true
241+
command:
242+
- bash
243+
- -c
244+
- |
245+
sleep 20
246+
ipengine --file="/app/ipp/security/ipcontroller-engine.json" --timeout 5.0
247+
resources:
248+
requests:
249+
google.com/tpu: 4
250+
limits:
251+
google.com/tpu: 4
252+
volumeMounts:
253+
- name: opmvol
254+
mountPath: /app/ipp
255+
volumes:
256+
- name: opmvol
257+
persistentVolumeClaim:
258+
claimName: opmvol-claim
259+
```
260+
261+
Add the resource to the GKE cluster.
262+
```shell
263+
kubectl apply -f ipp-deployment.yaml
264+
265+
# to view pod status as they come up
266+
# kubectl get pods
267+
```
268+
Add a service to expose it.
269+
270+
Create ``ipp-service.yaml``
271+
```yaml
272+
# ipp-service.yaml
273+
apiVersion: v1
274+
kind: Service
275+
metadata:
276+
name: ipp
277+
spec:
278+
selector:
279+
app: ipp-controller
280+
ports:
281+
- protocol: TCP
282+
port: 8888
283+
targetPort: 8888
284+
type: ClusterIP #LoadBalancer
285+
```
286+
287+
Deploy the new service.
288+
```shell
289+
kubectl apply -f ipp-service.yaml
290+
```
291+
292+
If the pods don’t come up as a multihost cluster, you may need to correct the number of hosts depending on the number of chips (e.g., a v5e-256 should have an LWS size of 65 (64 ipp-engines and 1 ipp-controller)). If you need to look at a single container in isolation, you can use something like this.
293+
```shell
294+
# you should NOT have to do this
295+
# kubectl exec ipp-deployment-0-2 -c ipp-engine -- python3 -c "import jax; jax.device_count()"
296+
```
297+
298+
To correct errors, you can re-apply an updated template and re-create the leader pod.
299+
```shell
300+
# to fetch an updated docker image without changing anything else
301+
# kubectl delete pod ipp-deployment-0
302+
303+
# to update the resource definition (automatically re-creates pods)
304+
# kubectl apply -f ipp-deployment.yaml
305+
306+
# to update the resource definition after an immutable change, you will likely need to use Console
307+
# (i.e., delete Workloads lws-controller-manager, ipp, and ipp-deployment)
308+
# and then you'll also need to delete the resource
309+
# kubectl delete leaderworkerset/ipp-deployment
310+
# kubectl delete service/ipp
311+
```
312+
313+
## Optional: optimize networking
314+
If you did create a high-MTU network, you should use the MaxText [preflight.sh](https://github.com/google/maxtext/blob/main/preflight.sh) script (which invokes another script) to tune the network settings for the pods before using them with the notebook (the MaxText reference training scripts automatically do this).
315+
```shell
316+
for pod in $(kubectl get pods --no-headers --output jsonpath="{range.items[*]}{..metadata.name}{'\n'}{end}" | grep ipp-deployment-0-); \
317+
do \
318+
echo "${pod}";
319+
kubectl exec ${pod} -c ipp-engine -- bash ./preflight.sh;
320+
done
321+
```
322+
323+
## Use the notebook
324+
Get the link to the notebook …
325+
```shell
326+
kubectl logs ipp-deployment-0 --container jupyter-notebook-server
327+
328+
# see the line that shows something like this
329+
#http://127.0.0.1:8888/lab?token=1c9012cd239e13b2123028ae26436d2580a7d4fc1d561125
330+
```
331+
332+
Setup local port forwarding to your service so requests from your browser are ultimately routed to your Jupyter service.
333+
```shell
334+
# you will need to do this locally (e.g., laptop), so you probably need to
335+
# gcloud container clusters get-credentials ${CLUSTER} --region ${REGION} --project ${PROJECTID}
336+
kubectl port-forward service/ipp 8888:8888
337+
338+
# Example notebook
339+
# https://gist.github.com/nhira/ea4b93738aadb1111b2ee5868d56a22b
340+
```

0 commit comments

Comments
 (0)