|
| 1 | +<!-- |
| 2 | + Copyright 2024 Google LLC |
| 3 | +
|
| 4 | + Licensed under the Apache License, Version 2.0 (the "License"); |
| 5 | + you may not use this file except in compliance with the License. |
| 6 | + You may obtain a copy of the License at |
| 7 | +
|
| 8 | + https://www.apache.org/licenses/LICENSE-2.0 |
| 9 | +
|
| 10 | + Unless required by applicable law or agreed to in writing, software |
| 11 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 12 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 13 | + See the License for the specific language governing permissions and |
| 14 | + limitations under the License. |
| 15 | + --> |
| 16 | + |
| 17 | +# Advanced usage - Use a Jupyter notebook to interact with a Cloud TPU cluster |
| 18 | + |
| 19 | +[Return to README](README.md#other-advanced-usage) |
| 20 | + |
| 21 | +## Introduction |
| 22 | +One of the challenges researchers face when working with contemporary models is the distributed programming involved to orchestrate work with a complex architecture. This example shows you how to use XPK to create a Cloud TPU v5e-256 cluster and interact with it using a Jupyter notebook. |
| 23 | + |
| 24 | +## Assumptions |
| 25 | +You need to ensure you have the TPU capacity (quotas and limits) for this activity. You may need to change machine names and shapes to make this work. |
| 26 | + |
| 27 | +To interact with the cluster, we use IPython Parallels and some [cell magic](https://ipyparallel.readthedocs.io/en/latest/tutorial/magics.html). IPython Parallels (ipyparallel) is a Python package and collection of CLI scripts for controlling clusters of IPython processes, built on the Jupyter protocol. While the default settings were adequate for this example, you should review [ipyparallel security details](https://ipyparallel.readthedocs.io/en/latest/reference/security.html) before use in a production environment. |
| 28 | +We do most of this work from a Cloud Shell instance. We will use some environment variables to make life easier. |
| 29 | +```shell |
| 30 | +export PROJECTID=${GOOGLE_CLOUD_PROJECT} |
| 31 | +export CLUSTER= # your cluster name |
| 32 | +export REGION= # region for cluster |
| 33 | +export ZONE= # zone for cluster |
| 34 | +``` |
| 35 | + |
| 36 | +## Cluster creation |
| 37 | +### Optional: high-MTU network |
| 38 | +If you need to work with multiple TPU slices, it will be useful to create a high-MTU network as shown here (the remaining steps assume you do): |
| 39 | +https://github.com/google/maxtext/tree/main/MaxText/configs#create-a-custom-mtu-network |
| 40 | +```shell |
| 41 | +gcloud compute networks create mtu9k --mtu=8896 \ |
| 42 | +--project=${PROJECTID} --subnet-mode=auto \ |
| 43 | +--bgp-routing-mode=regional |
| 44 | + |
| 45 | +gcloud compute firewall-rules create mtu9kfw --network mtu9k \ |
| 46 | +--allow tcp,icmp,udp --project=${PROJECTID} |
| 47 | +``` |
| 48 | + |
| 49 | +### XPK create cluster |
| 50 | +Install XPK. (You know, this repo!) |
| 51 | + |
| 52 | +Create a GKE Cloud TPU cluster using XPK. |
| 53 | +```shell |
| 54 | +xpk cluster create --cluster ${CLUSTER} \ |
| 55 | +--project=${PROJECTID} --default-pool-cpu-machine-type=n2-standard-8 \ |
| 56 | +--num-slices=1 --tpu-type=v5litepod-256 --zone=${ZONE} \ |
| 57 | +--spot --custom-cluster-arguments="--network=mtu9k --subnetwork=mtu9k" |
| 58 | + |
| 59 | +# if you need to delete this cluster to fix errors |
| 60 | +xpk cluster delete --cluster ${CLUSTER} --zone=${ZONE} |
| 61 | +``` |
| 62 | + |
| 63 | +## Add storage |
| 64 | +Enable filestore plugin so we can use an NFS Filestore instance for shared storage. (This may take 20-30 minutes.) |
| 65 | +```shell |
| 66 | +gcloud container clusters update ${CLUSTER} \ |
| 67 | +--region ${REGION} --project ${PROJECTID} \ |
| 68 | +--update-addons=GcpFilestoreCsiDriver=ENABLED |
| 69 | +``` |
| 70 | + |
| 71 | +### Filestore instance |
| 72 | +Create a regional NFS [Filestore instance](https://cloud.google.com/filestore/docs/creating-instances#google-cloud-console) in ``${REGION}`` and the named network above. |
| 73 | + |
| 74 | +Note the instance ID and file share name you’ve used. You will need to wait until this instance is available to continue. |
| 75 | + |
| 76 | + |
| 77 | +### Persistent volumes |
| 78 | +Once the Filestore instance is up, create a file with the correct names and storage size so you can create a persistent volume for the cluster. You will need to update the volumeHandle and volumeAttributes below. You will also need to change the names to match. |
| 79 | +```yaml |
| 80 | +# persistent-volume.yaml |
| 81 | +apiVersion: v1 |
| 82 | +kind: PersistentVolume |
| 83 | +metadata: |
| 84 | + name: opmvol |
| 85 | +spec: |
| 86 | + storageClassName: "" |
| 87 | + capacity: |
| 88 | + storage: 1Ti |
| 89 | + accessModes: |
| 90 | + - ReadWriteMany |
| 91 | + persistentVolumeReclaimPolicy: Retain |
| 92 | + volumeMode: Filesystem |
| 93 | + csi: |
| 94 | + driver: filestore.csi.storage.gke.io |
| 95 | + volumeHandle: "modeInstance/${ZONE}/nfs-opm-ase/nfs_opm_ase" |
| 96 | + volumeAttributes: |
| 97 | + ip: 10.243.23.194 |
| 98 | + volume: nfs_opm_ase |
| 99 | +--- |
| 100 | +kind: PersistentVolumeClaim |
| 101 | +apiVersion: v1 |
| 102 | +metadata: |
| 103 | + name: opmvol-claim |
| 104 | +spec: |
| 105 | + accessModes: |
| 106 | + - ReadWriteMany |
| 107 | + storageClassName: "" |
| 108 | + volumeName: opmvol |
| 109 | + resources: |
| 110 | + requests: |
| 111 | + storage: 1T |
| 112 | +``` |
| 113 | +
|
| 114 | +Apply the change. Be sure to get the cluster credentials first if you haven’t already done that. |
| 115 | +```shell |
| 116 | +# get cluster credentials if needed |
| 117 | +# gcloud container clusters get-credentials ${CLUSTER} --region ${REGION} --project ${PROJECTID} |
| 118 | +# kubectl get nodes |
| 119 | + |
| 120 | +# add the storage to the cluster |
| 121 | +kubectl apply -f persistent-volume.yaml |
| 122 | +``` |
| 123 | + |
| 124 | +If it worked, you should see the volume listed. |
| 125 | +```shell |
| 126 | +kubectl get pv |
| 127 | +kubectl get pvc |
| 128 | +``` |
| 129 | + |
| 130 | +## Build Docker image for IPP nodes |
| 131 | +We will start with the MaxText image because we want to train an LLM. |
| 132 | +```shell |
| 133 | +# get the code |
| 134 | +git clone "https://github.com/google/maxtext" |
| 135 | +``` |
| 136 | + |
| 137 | +We’ll start with a JAX stable image for TPUs and then update the build specification to include ipyparallel. Edit the ``requirements_with_jax_stable_stack.txt`` to add this at the bottom. |
| 138 | +```shell |
| 139 | +# also include IPyParallel |
| 140 | +ipyparallel |
| 141 | +``` |
| 142 | + |
| 143 | +Build the image and upload it so we can use the image to spin up pods. Note the resulting image name. It should be something like ``gcr.io/${PROJECTID}/opm_ipp_runner/tpu``. |
| 144 | +```shell |
| 145 | +# use docker build to build the image and upload it |
| 146 | +# NOTE: you may need to change the upload repository |
| 147 | +bash ./docker_maxtext_jax_stable_stack_image_upload.sh PROJECT_ID=${PROJECTID} \ |
| 148 | + BASEIMAGE=us-docker.pkg.dev/${PROJECTID}/jax-stable-stack/tpu:jax0.4.30-rev1 \ |
| 149 | + CLOUD_IMAGE_NAME=opm_ipp_runner IMAGE_TAG=latest \ |
| 150 | + MAXTEXT_REQUIREMENTS_FILE=requirements_with_jax_stable_stack.txt |
| 151 | + |
| 152 | +# confirm the image is available |
| 153 | +# docker image list gcr.io/${PROJECTID}/opm_ipp_runner/tpu:latest |
| 154 | +``` |
| 155 | + |
| 156 | +## Set up LWS |
| 157 | +We use the LeaderWorkerSet for these IPP pods, so they are managed collectively. |
| 158 | +```shell |
| 159 | +kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/v0.3.0/manifests.yaml |
| 160 | +``` |
| 161 | + |
| 162 | +## Set up IPP deployment |
| 163 | +Next we set up an LWS pod specification for our IPP instances. Create an ``ipp-deployment.yaml`` file. |
| 164 | +You will need to update the volume mounts and the container image references. (You should also change the password.) |
| 165 | +```yaml |
| 166 | +# ipp-deployment.yaml |
| 167 | +apiVersion: leaderworkerset.x-k8s.io/v1 |
| 168 | +kind: LeaderWorkerSet |
| 169 | +metadata: |
| 170 | + name: ipp-deployment |
| 171 | + annotations: |
| 172 | + leaderworkerset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool |
| 173 | +spec: |
| 174 | + replicas: 1 |
| 175 | + leaderWorkerTemplate: |
| 176 | + size: 65 |
| 177 | + restartPolicy: RecreateGroupOnPodRestart |
| 178 | + leaderTemplate: |
| 179 | + metadata: |
| 180 | + labels: |
| 181 | + app: ipp-controller |
| 182 | + spec: |
| 183 | + securityContext: |
| 184 | + runAsUser: 1000 |
| 185 | + runAsGroup: 100 |
| 186 | + fsGroup: 100 |
| 187 | + nodeSelector: |
| 188 | + cloud.google.com/gke-tpu-topology: 16x16 |
| 189 | + cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice |
| 190 | + tolerations: |
| 191 | + - key: "google.com/tpu" |
| 192 | + operator: "Exists" |
| 193 | + effect: "NoSchedule" |
| 194 | + containers: |
| 195 | + - name: jupyter-notebook-server |
| 196 | + image: jupyter/base-notebook:latest |
| 197 | + args: ["start-notebook.sh", "--NotebookApp.allow_origin='https://colab.research.google.com'", "--NotebookApp.port_retries=0"] |
| 198 | + resources: |
| 199 | + limits: |
| 200 | + cpu: 1000m |
| 201 | + memory: 1Gi |
| 202 | + requests: |
| 203 | + cpu: 100m |
| 204 | + memory: 500Mi |
| 205 | + ports: |
| 206 | + - containerPort: 8888 |
| 207 | + name: http-web-svc |
| 208 | + volumeMounts: |
| 209 | + - name: opmvol |
| 210 | + mountPath: /home/jovyan/nfs # jovyan is the default user |
| 211 | + - name: ipp-controller |
| 212 | + image: gcr.io/${PROJECTID}/opm_ipp_runner/tpu |
| 213 | + command: |
| 214 | + - bash |
| 215 | + - -c |
| 216 | + - | |
| 217 | + ip=$(hostname -I | awk '{print $1}') |
| 218 | + echo $ip |
| 219 | + ipcontroller --ip="$ip" --profile-dir=/app/ipp --log-level=ERROR --ping 10000 |
| 220 | + volumeMounts: |
| 221 | + - name: opmvol |
| 222 | + mountPath: /app/ipp |
| 223 | + volumes: |
| 224 | + - name: opmvol |
| 225 | + persistentVolumeClaim: |
| 226 | + claimName: opmvol-claim |
| 227 | + |
| 228 | + workerTemplate: |
| 229 | + spec: |
| 230 | + nodeSelector: |
| 231 | + cloud.google.com/gke-tpu-topology: 16x16 |
| 232 | + cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice |
| 233 | + containers: |
| 234 | + |
| 235 | + - name: ipp-engine |
| 236 | + image: gcr.io/${PROJECTID}/opm_ipp_runner/tpu |
| 237 | + ports: |
| 238 | + - containerPort: 8471 # Default port using which TPU VMs communicate |
| 239 | + securityContext: |
| 240 | + privileged: true |
| 241 | + command: |
| 242 | + - bash |
| 243 | + - -c |
| 244 | + - | |
| 245 | + sleep 20 |
| 246 | + ipengine --file="/app/ipp/security/ipcontroller-engine.json" --timeout 5.0 |
| 247 | + resources: |
| 248 | + requests: |
| 249 | + google.com/tpu: 4 |
| 250 | + limits: |
| 251 | + google.com/tpu: 4 |
| 252 | + volumeMounts: |
| 253 | + - name: opmvol |
| 254 | + mountPath: /app/ipp |
| 255 | + volumes: |
| 256 | + - name: opmvol |
| 257 | + persistentVolumeClaim: |
| 258 | + claimName: opmvol-claim |
| 259 | +``` |
| 260 | +
|
| 261 | +Add the resource to the GKE cluster. |
| 262 | +```shell |
| 263 | +kubectl apply -f ipp-deployment.yaml |
| 264 | + |
| 265 | +# to view pod status as they come up |
| 266 | +# kubectl get pods |
| 267 | +``` |
| 268 | +Add a service to expose it. |
| 269 | + |
| 270 | +Create ``ipp-service.yaml`` |
| 271 | +```yaml |
| 272 | +# ipp-service.yaml |
| 273 | +apiVersion: v1 |
| 274 | +kind: Service |
| 275 | +metadata: |
| 276 | + name: ipp |
| 277 | +spec: |
| 278 | + selector: |
| 279 | + app: ipp-controller |
| 280 | + ports: |
| 281 | + - protocol: TCP |
| 282 | + port: 8888 |
| 283 | + targetPort: 8888 |
| 284 | + type: ClusterIP #LoadBalancer |
| 285 | +``` |
| 286 | +
|
| 287 | +Deploy the new service. |
| 288 | +```shell |
| 289 | +kubectl apply -f ipp-service.yaml |
| 290 | +``` |
| 291 | + |
| 292 | +If the pods don’t come up as a multihost cluster, you may need to correct the number of hosts depending on the number of chips (e.g., a v5e-256 should have an LWS size of 65 (64 ipp-engines and 1 ipp-controller)). If you need to look at a single container in isolation, you can use something like this. |
| 293 | +```shell |
| 294 | +# you should NOT have to do this |
| 295 | +# kubectl exec ipp-deployment-0-2 -c ipp-engine -- python3 -c "import jax; jax.device_count()" |
| 296 | +``` |
| 297 | + |
| 298 | +To correct errors, you can re-apply an updated template and re-create the leader pod. |
| 299 | +```shell |
| 300 | +# to fetch an updated docker image without changing anything else |
| 301 | +# kubectl delete pod ipp-deployment-0 |
| 302 | + |
| 303 | +# to update the resource definition (automatically re-creates pods) |
| 304 | +# kubectl apply -f ipp-deployment.yaml |
| 305 | + |
| 306 | +# to update the resource definition after an immutable change, you will likely need to use Console |
| 307 | +# (i.e., delete Workloads lws-controller-manager, ipp, and ipp-deployment) |
| 308 | +# and then you'll also need to delete the resource |
| 309 | +# kubectl delete leaderworkerset/ipp-deployment |
| 310 | +# kubectl delete service/ipp |
| 311 | +``` |
| 312 | + |
| 313 | +## Optional: optimize networking |
| 314 | +If you did create a high-MTU network, you should use the MaxText [preflight.sh](https://github.com/google/maxtext/blob/main/preflight.sh) script (which invokes another script) to tune the network settings for the pods before using them with the notebook (the MaxText reference training scripts automatically do this). |
| 315 | +```shell |
| 316 | +for pod in $(kubectl get pods --no-headers --output jsonpath="{range.items[*]}{..metadata.name}{'\n'}{end}" | grep ipp-deployment-0-); \ |
| 317 | +do \ |
| 318 | + echo "${pod}"; |
| 319 | + kubectl exec ${pod} -c ipp-engine -- bash ./preflight.sh; |
| 320 | +done |
| 321 | +``` |
| 322 | + |
| 323 | +## Use the notebook |
| 324 | +Get the link to the notebook … |
| 325 | +```shell |
| 326 | +kubectl logs ipp-deployment-0 --container jupyter-notebook-server |
| 327 | + |
| 328 | +# see the line that shows something like this |
| 329 | +#http://127.0.0.1:8888/lab?token=1c9012cd239e13b2123028ae26436d2580a7d4fc1d561125 |
| 330 | +``` |
| 331 | + |
| 332 | +Setup local port forwarding to your service so requests from your browser are ultimately routed to your Jupyter service. |
| 333 | +```shell |
| 334 | +# you will need to do this locally (e.g., laptop), so you probably need to |
| 335 | +# gcloud container clusters get-credentials ${CLUSTER} --region ${REGION} --project ${PROJECTID} |
| 336 | +kubectl port-forward service/ipp 8888:8888 |
| 337 | + |
| 338 | +# Example notebook |
| 339 | +# https://gist.github.com/nhira/ea4b93738aadb1111b2ee5868d56a22b |
| 340 | +``` |
0 commit comments