|
| 1 | +# TPU 7x DWS Queued Provisioning |
| 2 | + |
| 3 | +This example demonstrates how to deploy a GKE cluster with **TPU 7x** nodes using **Dynamic Workload Scheduler (DWS)** with **Queued Provisioning**. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This configuration sets up: |
| 8 | + |
| 9 | +* A GKE cluster with a dedicated TPU 7x node pool (`tpu7x-standard-4t`). |
| 10 | +* **Flex Start (Dynamic Scaling)**: The node pool scales from 0 to N nodes based on demand. |
| 11 | +* **Queued Provisioning**: Jobs are queued until the entire requested capacity is available, ensuring "all-or-nothing" scheduling. |
| 12 | +* **Kueue Orchestration**: Manages the job queue and provisioning requests. |
| 13 | + |
| 14 | +## Create a cluster |
| 15 | + |
| 16 | +These steps guide you through the cluster creation process. |
| 17 | + |
| 18 | +Note: If you create multiple clusters using these same cluster blueprints, ensure that all VPCs and subnet names are unique per project to prevent errors. |
| 19 | + |
| 20 | +1. Launch [Cloud Shell](https://cloud.google.com/shell/docs/launching-cloud-shell). You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the [instructions to install dependencies](https://cloud.google.com/cluster-toolkit/docs/setup/install-dependencies) to prepare a different environment. |
| 21 | +1. Clone the Cluster Toolkit from the git repository: |
| 22 | + |
| 23 | + ```sh |
| 24 | + cd ~ |
| 25 | + git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git |
| 26 | + ``` |
| 27 | + |
| 28 | +1. Install the Cluster Toolkit: |
| 29 | + |
| 30 | + ```sh |
| 31 | + cd cluster-toolkit && git checkout main && make |
| 32 | + ``` |
| 33 | + |
| 34 | +1. Create a Cloud Storage bucket to store the state of the Terraform deployment: |
| 35 | + |
| 36 | + ```sh |
| 37 | + gcloud storage buckets create gs://BUCKET_NAME \ |
| 38 | + --project=PROJECT_ID \ |
| 39 | + --default-storage-class=STANDARD \ |
| 40 | + --location=COMPUTE_REGION \ |
| 41 | + --uniform-bucket-level-access |
| 42 | + gcloud storage buckets update gs://BUCKET_NAME --versioning |
| 43 | + ``` |
| 44 | + |
| 45 | + Replace the following variables: |
| 46 | + |
| 47 | + * BUCKET_NAME: the name of the new Cloud Storage bucket. |
| 48 | + * PROJECT_ID: ID of the project where the bucket is being created. |
| 49 | + * COMPUTE_REGION: the compute region where you want to store the state of the Terraform deployment. |
| 50 | + |
| 51 | +1. In the `examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/gke-tpu-7x-deployment.yaml` file, fill in the following settings in the `terraform_backend_defaults` and `vars` sections to match the specific values for your deployment: |
| 52 | + |
| 53 | + * `bucket`: the name of the Cloud Storage bucket you created in the previous step. |
| 54 | + * `project_id`: your Google Cloud project ID. |
| 55 | + * `deployment_name`: a unique name for this deployment. |
| 56 | + * `region`: the compute region for the cluster. |
| 57 | + * `zone`: the compute zone for the node pool. |
| 58 | + * `authorized_cidr`: The IP address range that you want to allow to connect with the cluster (e.g., `0.0.0.0/0`). |
| 59 | + * **`tpu_topology`**: Defaults to `2x2x2` (8 chips). |
| 60 | + * **`autoscaling_max_node_count`**: **Must match your topology.** For a `2x2x2` (8 chips) topology using 4-chip nodes, this must be set to `2` (8 / 4 = 2). |
| 61 | + * **`autoscaling_min_node_count`**: Must be `0`. |
| 62 | + * **`enable_flex_start` & `enable_queued_provisioning`**: Must be `true`. |
| 63 | + |
| 64 | +1. Generate [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/provide-credentials-adc#google-idp) to provide access to Terraform. |
| 65 | + |
| 66 | + ```sh |
| 67 | + gcloud auth application-default login |
| 68 | + ``` |
| 69 | + |
| 70 | +1. Deploy the blueprint to provision the GKE infrastructure using TPU 7x machine types: |
| 71 | + |
| 72 | + ```sh |
| 73 | + cd ~/cluster-toolkit |
| 74 | + ./gcluster deploy -d \ |
| 75 | + examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/gke-tpu-7x-deployment.yaml \ |
| 76 | + examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/gke-tpu-7x.yaml |
| 77 | + ``` |
| 78 | + |
| 79 | +1. When prompted, select (A)pply to deploy the blueprint. |
| 80 | + * The blueprint creates VPC networks, Cloud Storage buckets, service accounts, a GKE cluster with a TPU node pool, Kueue, and JobSet. |
| 81 | + |
| 82 | +1. Get Credentials: |
| 83 | + |
| 84 | + ```bash |
| 85 | + gcloud container clusters get-credentials <cluster-name> --region <region> --project <project-id> |
| 86 | + ``` |
| 87 | + |
| 88 | +## Running Jobs |
| 89 | + |
| 90 | +Two sample JobSets are provided: |
| 91 | +* `tpu-7x-test-job.yaml`: A simple JobSet that echoes a message and sleeps. Best for initial cluster verification. |
| 92 | +* `tpu-7x-test-job-gcs.yaml`: A JobSet that performs an **FIO benchmark** against your provisioned GCS buckets (training/checkpointing). |
| 93 | + |
| 94 | +### Option 1: Simple Test |
| 95 | + |
| 96 | +#### Submit Job |
| 97 | + |
| 98 | +```bash |
| 99 | +kubectl apply -f examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/tpu-7x-test-job.yaml |
| 100 | +``` |
| 101 | + |
| 102 | +### Option 2: GCS Storage Benchmark (FIO) |
| 103 | + |
| 104 | +#### Find your PVC Names |
| 105 | + |
| 106 | +The toolkit creates dynamic names for your GCS buckets. Find them with: |
| 107 | + |
| 108 | +```bash |
| 109 | +kubectl get pvc |
| 110 | +``` |
| 111 | + |
| 112 | +#### Update Manifest |
| 113 | + |
| 114 | +Edit `tpu-7x-test-job-gcs.yaml` and replace the `claimName` placeholders with your actual PVC names. |
| 115 | + |
| 116 | +#### Submit Job |
| 117 | + |
| 118 | +```bash |
| 119 | +kubectl apply -f examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/tpu-7x-test-job-gcs.yaml |
| 120 | +``` |
| 121 | + |
| 122 | +## Monitor Provisioning |
| 123 | + |
| 124 | +Check the status of the DWS request: |
| 125 | + |
| 126 | +```bash |
| 127 | +kubectl get provisioningrequests -w |
| 128 | +``` |
| 129 | + |
| 130 | +* `ACCEPTED`: Request is queued. |
| 131 | +* `PROVISIONED`: Resources are allocated, nodes are creating. |
| 132 | + |
| 133 | +1. Verify Execution: |
| 134 | + Once nodes are ready, the pods will start: |
| 135 | + |
| 136 | + ```bash |
| 137 | + kubectl get pods -w |
| 138 | + ``` |
| 139 | + |
| 140 | +## Verifying Scale-Up and Scale-Down |
| 141 | + |
| 142 | +To ensure the cluster is behaving correctly, you can monitor the following events: |
| 143 | + |
| 144 | +### 1. Monitor Scale-Up |
| 145 | + |
| 146 | +When the job is submitted, the cluster will scale from 0 nodes to the required count (e.g., 2 nodes). |
| 147 | +* Watch Nodes: `kubectl get nodes -w` |
| 148 | +* Check Autoscaler Status: |
| 149 | + |
| 150 | + ```bash |
| 151 | + kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml |
| 152 | + ``` |
| 153 | + |
| 154 | + Look for `scaleUp: status: NoActivity` transitioning to activity and `ready` node counts increasing. |
| 155 | + |
| 156 | +### 2. Verify Job Success |
| 157 | + |
| 158 | +A successful DWS run means the job started *after* the full slice was provisioned and completed its work. |
| 159 | +* Check Pod Status: `kubectl get pods` should show `STATUS: Completed`. |
| 160 | +* Check Logs: `kubectl logs -l job-name=tpu-7x-qp-test` should show the "Job complete!" message. |
| 161 | + |
| 162 | +### 3. Monitor Scale-Down |
| 163 | + |
| 164 | +After the job completes, the Cluster Autoscaler will wait for a short period (typically 10 minutes) before deleting the unneeded TPU nodes. |
| 165 | +* Observe Node Deletion: `kubectl get nodes -w` will eventually show nodes being removed. |
| 166 | +* Confirm Zero State: `kubectl get nodes` should eventually return to only showing your system nodes. |
| 167 | + |
| 168 | +## Custom Jobs Requirements |
| 169 | + |
| 170 | +If you want to submit your own custom job, ensure the following fields are included in your manifest: |
| 171 | + |
| 172 | +### 1. Metadata (Kueue & DWS) |
| 173 | + |
| 174 | +Required for the job to be admitted to the queue and recognized by DWS. |
| 175 | +Note: The `queue-name` must match the `LocalQueue` created by the toolkit (default: `dws-local-queue`). |
| 176 | + |
| 177 | +```yaml |
| 178 | +metadata: |
| 179 | + labels: |
| 180 | + kueue.x-k8s.io/queue-name: dws-local-queue |
| 181 | + annotations: |
| 182 | + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "3600" # Specify duration in seconds |
| 183 | +``` |
| 184 | +
|
| 185 | +### 2. Node Selectors & Affinity |
| 186 | +
|
| 187 | +Ensures the job lands on the specific provisioned TPU nodes: |
| 188 | +
|
| 189 | +```yaml |
| 190 | +nodeSelector: |
| 191 | + cloud.google.com/gke-tpu-topology: "2x2x2" |
| 192 | + cloud.google.com/gke-queued: "true" |
| 193 | +affinity: |
| 194 | + nodeAffinity: |
| 195 | + requiredDuringSchedulingIgnoredDuringExecution: |
| 196 | + nodeSelectorTerms: |
| 197 | + - matchExpressions: |
| 198 | + - key: cloud.google.com/gke-nodepool |
| 199 | + operator: In |
| 200 | + values: ["gke-tpu-7x-pool"] |
| 201 | +``` |
| 202 | +
|
| 203 | +### 3. Tolerations (Mandatory) |
| 204 | +
|
| 205 | +Required to allow pods to land on tainted TPU nodes: |
| 206 | +
|
| 207 | +```yaml |
| 208 | +tolerations: |
| 209 | +- key: "google.com/tpu" |
| 210 | + operator: "Equal" |
| 211 | + value: "present" |
| 212 | + effect: "NoSchedule" |
| 213 | +- key: "cloud.google.com/gke-queued" |
| 214 | + operator: "Equal" |
| 215 | + value: "true" |
| 216 | + effect: "NoSchedule" |
| 217 | +``` |
| 218 | +
|
| 219 | +## Validation |
| 220 | +
|
| 221 | +### 1. Simple Test Validation |
| 222 | +
|
| 223 | +If you ran `tpu-7x-test-job.yaml`, check logs for the success message: |
| 224 | + |
| 225 | +```bash |
| 226 | +kubectl logs -l jobset.sigs.k8s.io/jobset-name=tpu-7x-qp-test -c tpu-job |
| 227 | +``` |
| 228 | + |
| 229 | +Expected output: |
| 230 | + |
| 231 | +```text |
| 232 | +Starting TPU 7x Test Job... |
| 233 | +Job complete! |
| 234 | +``` |
| 235 | + |
| 236 | +### 2. GCS Storage Benchmark Validation |
| 237 | + |
| 238 | +If you ran `tpu-7x-test-job-gcs.yaml`, you can verify the benchmark results and storage health: |
| 239 | + |
| 240 | +1. **Verify Completion**: Look for the final success message in the logs: |
| 241 | + |
| 242 | + ```bash |
| 243 | + kubectl logs -l jobset.sigs.k8s.io/jobset-name=tpu-7x-qp-fio -c tpu-job | grep "FIO benchmark complete!" |
| 244 | + ``` |
| 245 | + |
| 246 | +1. **View Performance Metrics**: To see the actual read/write throughput for your GCS buckets: |
| 247 | + |
| 248 | + ```bash |
| 249 | + kubectl logs -l jobset.sigs.k8s.io/jobset-name=tpu-7x-qp-fio -c tpu-job |
| 250 | + ``` |
| 251 | + |
| 252 | + In the output, look for the `Run status group` sections. For example: |
| 253 | + * **Read Performance**: Look for `READ: bw=...` (e.g., `bw=5554MiB/s`). |
| 254 | + * **Write Performance**: Look for `WRITE: bw=...`. |
| 255 | + |
| 256 | +> [!TIP] |
| 257 | +> If the job is still running, you can follow the logs in real-time by adding the `-f` flag to the `kubectl logs` command. |
0 commit comments