Skip to content

Commit bf7c3e1

Browse files
authored
Adding GKE TPU DWS Queued Provisioning support for v6e and 7x (#5218)
1 parent 4a68a1f commit bf7c3e1

File tree

18 files changed

+1690
-14
lines changed

18 files changed

+1690
-14
lines changed

examples/README.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1435,9 +1435,17 @@ This blueprint takes care of the initial infrastructure setup (e.g., network cre
14351435

14361436
### [gke-consumption-options] ![core-badge]
14371437

1438-
This folder holds multiple GKE blueprint examples that display different consumption options on GKE.
1439-
* [DWS Flex Start](../examples/gke-consumption-options/dws-flex-start)
1440-
* [DWS Flex Start with Queued Provisioning](../examples/gke-consumption-options/dws-flex-start-queued-provisioning)
1438+
This folder holds multiple GKE blueprint examples that demonstrate different consumption options on GKE, covering hardware such as A3 Ultra (A3U), TPU v6e, and TPU 7x.
1439+
1440+
* [**DWS Flex Start**](../examples/gke-consumption-options/dws-flex-start/README.md)
1441+
* [A3 Ultra](../examples/gke-consumption-options/dws-flex-start/gke-a3-ultragpu.yaml)
1442+
* [TPU 7x](../examples/gke-consumption-options/dws-flex-start/gke-tpu-7x)
1443+
* [TPU v6e](../examples/gke-consumption-options/dws-flex-start/gke-tpu-v6e)
1444+
1445+
* [**DWS Flex Start with Queued Provisioning**](../examples/gke-consumption-options/dws-flex-start-queued-provisioning/README.md)
1446+
* [A3 Ultra](../examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-a3-ultragpu.yaml)
1447+
* [TPU 7x](../examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x)
1448+
* [TPU v6e](../examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-v6e)
14411449

14421450
[gke-consumption-options]: ../examples/gke-consumption-options
14431451

examples/gke-consumption-options/dws-flex-start-queued-provisioning/README.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@
55
Note the `enable_flex_start` and `enable_queued_provisioning` variables in the yaml files.
66

77
## Create a cluster
8+
89
These steps guide you through the cluster creation process.
910

1011
Note: If you create multiple clusters using these same cluster blueprints, ensure that all VPCs and subnet names are unique per project to prevent errors.
1112

1213
1. Launch [Cloud Shell](https://cloud.google.com/shell/docs/launching-cloud-shell). You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the [instructions to install dependencies](https://cloud.google.com/cluster-toolkit/docs/setup/install-dependencies) to prepare a different environment.
13-
1414
1. Clone the Cluster Toolkit from the git repository:
1515

1616
```sh
@@ -24,7 +24,7 @@ Note: If you create multiple clusters using these same cluster blueprints, ensur
2424
cd cluster-toolkit && git checkout main && make
2525
```
2626

27-
1. Create a {{storage_name}} bucket to store the state of the Terraform deployment:
27+
1. Create a Cloud Storage bucket to store the state of the Terraform deployment:
2828

2929
```sh
3030
gcloud storage buckets create gs://BUCKET_NAME \
@@ -35,10 +35,11 @@ Note: If you create multiple clusters using these same cluster blueprints, ensur
3535
gcloud storage buckets update gs://BUCKET_NAME --versioning
3636
```
3737

38-
Replace the following variables:\
39-
BUCKET_NAME: the name of the new Cloud Storage bucket.\
40-
PROJECT_ID: ID of the project where the bucket is being created.\
41-
COMPUTE_REGION: the compute region where you want to store the state of the Terraform deployment.
38+
Replace the following variables:
39+
40+
* BUCKET_NAME: the name of the new Cloud Storage bucket.
41+
* PROJECT_ID: ID of the project where the bucket is being created.
42+
* COMPUTE_REGION: the compute region where you want to store the state of the Terraform deployment.
4243

4344
1. In the examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-a3-ultragpu-deployment.yaml file, fill in the following settings in the terraform_backend_defaults and vars sections to match the specific values for your deployment:
4445

@@ -71,6 +72,7 @@ Note: If you create multiple clusters using these same cluster blueprints, ensur
7172
```
7273

7374
1. When prompted, select (A)pply to deploy the blueprint.
75+
7476
* The blueprint creates VPC networks, a GPU RDMA VPC network, service accounts, a cluster, and a nodepool.
7577

7678
## Note
@@ -79,6 +81,7 @@ Note: If you create multiple clusters using these same cluster blueprints, ensur
7981
* To use DWS Flex Start, `auto_repair` should be set to `false`.
8082

8183
Along with these flex start requirements, there are a few queue-provisioning specific requirements.
84+
8285
* Queued provisioning does not work with `static_node_count` and requires `autoscaling_total_min_nodes` be set to `0`.
8386

8487
## Run a job
@@ -100,7 +103,7 @@ The dws-flex-start-queued-provisioning example provides a `sample-job.yaml` file
100103
```
101104

102105
1. Consider using `kubectl get jobs` and `kubectl describe job <job-name>` to get information about the jobs.\
103-
You can also use `kubectl get pods` and `kubectl describe pod <pod-name>` to get pod information.
106+
You can also use `kubectl get pods` and `kubectl describe pod <pod-name>` to get pod information.
104107

105108
## Deploy and run NCCL test
106109

@@ -122,7 +125,7 @@ To validate the functionality of the provisioned cluster, you can run a NCCL tes
122125

123126
Note that the `nccl-jobset-example.yaml` file has this config under jobset metadata. These are required for using queued provisioning.
124127

125-
```sh
128+
```yaml
126129
labels:
127130
kueue.x-k8s.io/queue-name: dws-local-queue
128131
annotations:
@@ -197,3 +200,10 @@ To validate the functionality of the provisioned cluster, you can run a NCCL tes
197200
# Out of bounds values : 0 OK
198201
# Avg bus bandwidth : 120.248
199202
```
203+
204+
## Hardware-Specific Guides
205+
206+
For detailed deployment instructions, topology requirements, and job examples, please refer to the guide for your specific hardware:
207+
208+
* [TPU v6e (Trillium)](gke-tpu-v6e/README.md): Optimized for `ct6e-standard-4t` clusters.
209+
* [TPU 7x (TPU v4)](gke-tpu-7x/README.md): Optimized for `tpu7x-standard-4t` clusters.
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# TPU 7x DWS Queued Provisioning
2+
3+
This example demonstrates how to deploy a GKE cluster with **TPU 7x** nodes using **Dynamic Workload Scheduler (DWS)** with **Queued Provisioning**.
4+
5+
## Overview
6+
7+
This configuration sets up:
8+
9+
* A GKE cluster with a dedicated TPU 7x node pool (`tpu7x-standard-4t`).
10+
* **Flex Start (Dynamic Scaling)**: The node pool scales from 0 to N nodes based on demand.
11+
* **Queued Provisioning**: Jobs are queued until the entire requested capacity is available, ensuring "all-or-nothing" scheduling.
12+
* **Kueue Orchestration**: Manages the job queue and provisioning requests.
13+
14+
## Create a cluster
15+
16+
These steps guide you through the cluster creation process.
17+
18+
Note: If you create multiple clusters using these same cluster blueprints, ensure that all VPCs and subnet names are unique per project to prevent errors.
19+
20+
1. Launch [Cloud Shell](https://cloud.google.com/shell/docs/launching-cloud-shell). You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the [instructions to install dependencies](https://cloud.google.com/cluster-toolkit/docs/setup/install-dependencies) to prepare a different environment.
21+
1. Clone the Cluster Toolkit from the git repository:
22+
23+
```sh
24+
cd ~
25+
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
26+
```
27+
28+
1. Install the Cluster Toolkit:
29+
30+
```sh
31+
cd cluster-toolkit && git checkout main && make
32+
```
33+
34+
1. Create a Cloud Storage bucket to store the state of the Terraform deployment:
35+
36+
```sh
37+
gcloud storage buckets create gs://BUCKET_NAME \
38+
--project=PROJECT_ID \
39+
--default-storage-class=STANDARD \
40+
--location=COMPUTE_REGION \
41+
--uniform-bucket-level-access
42+
gcloud storage buckets update gs://BUCKET_NAME --versioning
43+
```
44+
45+
Replace the following variables:
46+
47+
* BUCKET_NAME: the name of the new Cloud Storage bucket.
48+
* PROJECT_ID: ID of the project where the bucket is being created.
49+
* COMPUTE_REGION: the compute region where you want to store the state of the Terraform deployment.
50+
51+
1. In the `examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/gke-tpu-7x-deployment.yaml` file, fill in the following settings in the `terraform_backend_defaults` and `vars` sections to match the specific values for your deployment:
52+
53+
* `bucket`: the name of the Cloud Storage bucket you created in the previous step.
54+
* `project_id`: your Google Cloud project ID.
55+
* `deployment_name`: a unique name for this deployment.
56+
* `region`: the compute region for the cluster.
57+
* `zone`: the compute zone for the node pool.
58+
* `authorized_cidr`: The IP address range that you want to allow to connect with the cluster (e.g., `0.0.0.0/0`).
59+
* **`tpu_topology`**: Defaults to `2x2x2` (8 chips).
60+
* **`autoscaling_max_node_count`**: **Must match your topology.** For a `2x2x2` (8 chips) topology using 4-chip nodes, this must be set to `2` (8 / 4 = 2).
61+
* **`autoscaling_min_node_count`**: Must be `0`.
62+
* **`enable_flex_start` & `enable_queued_provisioning`**: Must be `true`.
63+
64+
1. Generate [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/provide-credentials-adc#google-idp) to provide access to Terraform.
65+
66+
```sh
67+
gcloud auth application-default login
68+
```
69+
70+
1. Deploy the blueprint to provision the GKE infrastructure using TPU 7x machine types:
71+
72+
```sh
73+
cd ~/cluster-toolkit
74+
./gcluster deploy -d \
75+
examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/gke-tpu-7x-deployment.yaml \
76+
examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/gke-tpu-7x.yaml
77+
```
78+
79+
1. When prompted, select (A)pply to deploy the blueprint.
80+
* The blueprint creates VPC networks, Cloud Storage buckets, service accounts, a GKE cluster with a TPU node pool, Kueue, and JobSet.
81+
82+
1. Get Credentials:
83+
84+
```bash
85+
gcloud container clusters get-credentials <cluster-name> --region <region> --project <project-id>
86+
```
87+
88+
## Running Jobs
89+
90+
Two sample JobSets are provided:
91+
* `tpu-7x-test-job.yaml`: A simple JobSet that echoes a message and sleeps. Best for initial cluster verification.
92+
* `tpu-7x-test-job-gcs.yaml`: A JobSet that performs an **FIO benchmark** against your provisioned GCS buckets (training/checkpointing).
93+
94+
### Option 1: Simple Test
95+
96+
#### Submit Job
97+
98+
```bash
99+
kubectl apply -f examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/tpu-7x-test-job.yaml
100+
```
101+
102+
### Option 2: GCS Storage Benchmark (FIO)
103+
104+
#### Find your PVC Names
105+
106+
The toolkit creates dynamic names for your GCS buckets. Find them with:
107+
108+
```bash
109+
kubectl get pvc
110+
```
111+
112+
#### Update Manifest
113+
114+
Edit `tpu-7x-test-job-gcs.yaml` and replace the `claimName` placeholders with your actual PVC names.
115+
116+
#### Submit Job
117+
118+
```bash
119+
kubectl apply -f examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/tpu-7x-test-job-gcs.yaml
120+
```
121+
122+
## Monitor Provisioning
123+
124+
Check the status of the DWS request:
125+
126+
```bash
127+
kubectl get provisioningrequests -w
128+
```
129+
130+
* `ACCEPTED`: Request is queued.
131+
* `PROVISIONED`: Resources are allocated, nodes are creating.
132+
133+
1. Verify Execution:
134+
Once nodes are ready, the pods will start:
135+
136+
```bash
137+
kubectl get pods -w
138+
```
139+
140+
## Verifying Scale-Up and Scale-Down
141+
142+
To ensure the cluster is behaving correctly, you can monitor the following events:
143+
144+
### 1. Monitor Scale-Up
145+
146+
When the job is submitted, the cluster will scale from 0 nodes to the required count (e.g., 2 nodes).
147+
* Watch Nodes: `kubectl get nodes -w`
148+
* Check Autoscaler Status:
149+
150+
```bash
151+
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
152+
```
153+
154+
Look for `scaleUp: status: NoActivity` transitioning to activity and `ready` node counts increasing.
155+
156+
### 2. Verify Job Success
157+
158+
A successful DWS run means the job started *after* the full slice was provisioned and completed its work.
159+
* Check Pod Status: `kubectl get pods` should show `STATUS: Completed`.
160+
* Check Logs: `kubectl logs -l job-name=tpu-7x-qp-test` should show the "Job complete!" message.
161+
162+
### 3. Monitor Scale-Down
163+
164+
After the job completes, the Cluster Autoscaler will wait for a short period (typically 10 minutes) before deleting the unneeded TPU nodes.
165+
* Observe Node Deletion: `kubectl get nodes -w` will eventually show nodes being removed.
166+
* Confirm Zero State: `kubectl get nodes` should eventually return to only showing your system nodes.
167+
168+
## Custom Jobs Requirements
169+
170+
If you want to submit your own custom job, ensure the following fields are included in your manifest:
171+
172+
### 1. Metadata (Kueue & DWS)
173+
174+
Required for the job to be admitted to the queue and recognized by DWS.
175+
Note: The `queue-name` must match the `LocalQueue` created by the toolkit (default: `dws-local-queue`).
176+
177+
```yaml
178+
metadata:
179+
labels:
180+
kueue.x-k8s.io/queue-name: dws-local-queue
181+
annotations:
182+
provreq.kueue.x-k8s.io/maxRunDurationSeconds: "3600" # Specify duration in seconds
183+
```
184+
185+
### 2. Node Selectors & Affinity
186+
187+
Ensures the job lands on the specific provisioned TPU nodes:
188+
189+
```yaml
190+
nodeSelector:
191+
cloud.google.com/gke-tpu-topology: "2x2x2"
192+
cloud.google.com/gke-queued: "true"
193+
affinity:
194+
nodeAffinity:
195+
requiredDuringSchedulingIgnoredDuringExecution:
196+
nodeSelectorTerms:
197+
- matchExpressions:
198+
- key: cloud.google.com/gke-nodepool
199+
operator: In
200+
values: ["gke-tpu-7x-pool"]
201+
```
202+
203+
### 3. Tolerations (Mandatory)
204+
205+
Required to allow pods to land on tainted TPU nodes:
206+
207+
```yaml
208+
tolerations:
209+
- key: "google.com/tpu"
210+
operator: "Equal"
211+
value: "present"
212+
effect: "NoSchedule"
213+
- key: "cloud.google.com/gke-queued"
214+
operator: "Equal"
215+
value: "true"
216+
effect: "NoSchedule"
217+
```
218+
219+
## Validation
220+
221+
### 1. Simple Test Validation
222+
223+
If you ran `tpu-7x-test-job.yaml`, check logs for the success message:
224+
225+
```bash
226+
kubectl logs -l jobset.sigs.k8s.io/jobset-name=tpu-7x-qp-test -c tpu-job
227+
```
228+
229+
Expected output:
230+
231+
```text
232+
Starting TPU 7x Test Job...
233+
Job complete!
234+
```
235+
236+
### 2. GCS Storage Benchmark Validation
237+
238+
If you ran `tpu-7x-test-job-gcs.yaml`, you can verify the benchmark results and storage health:
239+
240+
1. **Verify Completion**: Look for the final success message in the logs:
241+
242+
```bash
243+
kubectl logs -l jobset.sigs.k8s.io/jobset-name=tpu-7x-qp-fio -c tpu-job | grep "FIO benchmark complete!"
244+
```
245+
246+
1. **View Performance Metrics**: To see the actual read/write throughput for your GCS buckets:
247+
248+
```bash
249+
kubectl logs -l jobset.sigs.k8s.io/jobset-name=tpu-7x-qp-fio -c tpu-job
250+
```
251+
252+
In the output, look for the `Run status group` sections. For example:
253+
* **Read Performance**: Look for `READ: bw=...` (e.g., `bw=5554MiB/s`).
254+
* **Write Performance**: Look for `WRITE: bw=...`.
255+
256+
> [!TIP]
257+
> If the job is still running, you can follow the logs in real-time by adding the `-f` flag to the `kubectl logs` command.

0 commit comments

Comments
 (0)