Skip to content

Commit ef212d0

Browse files
committed
Add a4x environment configuration doc
1 parent d30b262 commit ef212d0

File tree

2 files changed

+822
-0
lines changed

2 files changed

+822
-0
lines changed
Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
# Configuring the environment for running benchmark recipes on a GKE Cluster with A4 Node Pools
2+
3+
This guide outlines the steps to configure the environment required to run benchmark recipes on a [Google Kubernetes Engine (GKE) cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) with [A4X](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) node pools.
4+
5+
## Prerequisites
6+
7+
Before you begin, ensure you have completed the following:
8+
9+
1. Create a Google Cloud project with billing enabled.
10+
11+
a. To create a project, see [Creating and managing projects](https://cloud.google.com/resource-manager/docs/creating-managing-projects).
12+
b. To enable billing, see [Verify the billing status of your projects](https://cloud.google.com/billing/docs/how-to/verify-billing-enabled).
13+
14+
2. Enabled the following APIs:
15+
16+
- [Cloud Resource Manager API](https://console.cloud.google.com/apis/library/cloudresourcemanager.googleapis.com).
17+
- [Service Usage API](https://console.cloud.google.com/apis/library/serviceusage.googleapis.com).
18+
- [Google Kubernetes Engine API](https://console.cloud.google.com/flows/enableapi?apiid=container.googleapis.com).
19+
- [Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com).
20+
- [Artifact Registry API](https://console.cloud.google.com/flows/enableapi?apiid=artifactregistry.googleapis.com).
21+
- [Cloud Monitoring API](https://console.cloud.google.com/flows/enableapi?apiid=monitoring.googleapis.com).
22+
- [Cloud Logging API](https://console.cloud.google.com/flows/enableapi?apiid=logging.googleapis.com)
23+
24+
3. Make sure that you have a [reservation](https://cloud.google.com/compute/docs/instances/reservations-overview) for the required number of `a4x-highgpu-4g` machines using the `DENSE` deployment type.
25+
26+
4. Ensure that you have been granted the following IAM roles:
27+
- Editor (`roles/editor`)
28+
- Project IAM Admin (`roles/resourcemanager.projectIamAdmin`)
29+
- Kubernetes Engine Admin (`roles/container.admin`)
30+
- Service Account Admin (`roles/serviceAccountAdmin`)
31+
32+
## The environment
33+
34+
The environment comprises of the following components:
35+
36+
- A client workstation: this is used to prepare, submit, and monitor ML workloads.
37+
- An [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview): serves as a
38+
private container registry for storing and managing Docker images used in the deployment.
39+
- A [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview)
40+
cluster configured as follows:
41+
- [A GKE regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: v1.32.4-gke.1236000 or later.
42+
- A GPU node pool with the user specified number of [a4x-highgpu-4g](https://cloud.google.com/compute/docs/gpus) provisioned using the DENSE deployment type.
43+
- [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled.
44+
- [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled.
45+
- [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled.
46+
- [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed.
47+
- Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/).
48+
- A regional [Google Cloud Storage (GCS) Bucket](https://cloud.google.com/storage/docs) for storing the test environment configuration and state and the execution logs generated by recipes.
49+
- A regional GCS bucket with [hierarchical namespace](https://cloud.google.com/storage/docs/hns-overview) enabled for managing training datasets.
50+
- A regional GCS bucket with [hierarchical namespace](https://cloud.google.com/storage/docs/hns-overview) enabled for managing checkpoints.
51+
52+
53+
54+
## Set up the client workstation
55+
56+
You have two options: you can use either your own workstation (e.g., a local machine or Google Cloud VM) or [Google Cloud Shell](https://cloud.google.com/shell/docs).
57+
58+
59+
### Set up Google Cloud Shell
60+
61+
[Google Cloud Shell](https://cloud.google.com/shell/docs) comes with all the necessary components pre-installed, so no additional configuration is needed.
62+
63+
**IMPORTANT**: Make sure that you have at least 2GB of disk space remaining in your home directory.
64+
65+
66+
### Set up your own workstation
67+
68+
If you prefer to use your own workstation, ensure you have the following components installed:
69+
70+
1. Cluster Toolkit [dependencies](https://cloud.google.com/cluster-toolkit/docs/setup/install-dependencies).
71+
2. kubectl with GKE authentication plugin. To install, see the
72+
[GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl).
73+
3. Helm. To install, see the [Helm documentation](https://helm.sh/docs/intro/quickstart/).
74+
75+
76+
## Set you Project ID
77+
78+
Launch your client workstation and set your Project ID.
79+
80+
```
81+
gcloud config set project PROJECT_ID
82+
```
83+
84+
85+
Replace the following:
86+
- PROJECT_ID: your project ID.
87+
88+
## Set up a Google Cloud Storage bucket for environment state and logs
89+
90+
The bucket is used to manage the state of the [Cluster Toolkit blueprint](https://cloud.google.com/cluster-toolkit/docs/setup/cluster-blueprint) that you'll use to provision a GKE cluster. The bucket is also used by the recipes to manage execution logs.
91+
92+
93+
To create the bucket execute the following command:
94+
```bash
95+
gcloud storage buckets create gs://BUCKET_NAME \
96+
--location=BUCKET_LOCATION \
97+
--no-public-access-prevention --uniform-bucket-level-access
98+
```
99+
100+
Replace the following:
101+
102+
- `BUCKET_NAME`: the name of your bucket. The name must comply with the
103+
[Cloud Storage bucket naming conventions](https://cloud.google.com/storage/docs/buckets#naming).
104+
- `BUCKET_LOCATION`: the location of your bucket. The bucket must be in the same region as your cluster.
105+
106+
107+
108+
## Configure access control to the bucket
109+
110+
A4 compute recipes access Google Cloud Storage buckets using the Kubernetes default ServiceAccount. You need to grant this account the rights to access the Google Cloud Storage bucket.
111+
112+
```
113+
gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
114+
--role=roles/storage.objectAdmin \
115+
--member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/default/sa/default \
116+
--condition=None
117+
118+
gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
119+
--role=roles/storage.legacyBucketReader \
120+
--member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/default/sa/default \
121+
--condition=None
122+
123+
```
124+
125+
Replace the following:
126+
- BUCKET_NAME - the name of your bucket
127+
- PROJECT_ID: your Google Cloud project ID.
128+
- PROJECT_NUMBER: your numerical Google Cloud project number.
129+
130+
You can retrieve the project number from Cloud Console or using the following command:
131+
132+
```
133+
PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
134+
```
135+
136+
Replace the following:
137+
- PROJECT_ID: your Google Cloud project ID.
138+
139+
## Set up an Artifact Registry
140+
141+
- If you use Cloud KMS for repository encryption, create your artifact registry by using the
142+
[instructions here](https://cloud.google.com/artifact-registry/docs/repositories/create-repos#create-repo-gcloud-docker).
143+
- If you don't use Cloud KMS, you can create your repository by using the following command:
144+
145+
```bash
146+
gcloud artifacts repositories create REPOSITORY \
147+
--repository-format=docker \
148+
--location=LOCATION \
149+
--description="DESCRIPTION"
150+
```
151+
Replace the following:
152+
153+
- `REPOSITORY`: the name of the repository. For each repository location in a project,
154+
repository names must be unique.
155+
- `LOCATION`: the regional or multi-regional location for the repository. You can omit this
156+
flag if you set a default region.
157+
- `DESCRIPTION`: a description of the repository. Don't include sensitive data because
158+
repository descriptions are not encrypted.
159+
160+
161+
## Create a GKE Cluster environment with A4X Node Pools
162+
163+
164+
You'll use the [Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/overview) to create your GKE cluster environment.
165+
The Cluster Toolkit [blueprint](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/v1.72.0/examples/gke-a4x) used in this setup creates and configures the following components:
166+
167+
- VPC networks, subnets, routers, and firewall rules.
168+
- A GKE cluster with the required features enabled.
169+
- Service accounts with the required permissions.
170+
- An A4X node pool with `a4x-highgpu-4g` nodes.
171+
- [JobSet](https://jobset.sigs.k8s.io/docs/overview/) and [Kueue](https://kueue.sigs.k8s.io/docs/overview/) APIs.
172+
- Cloud Storage buckets with hierarchical namespace enabled for training data and checkpoints.
173+
174+
The A4X compute recipes have been validated on a cluster created with the v1.72.0 version of the Cluster Toolkit.
175+
176+
1. Configure Application Default Credentials
177+
178+
Before deploying the Cluster Toolkit blueprint, you need to configure Application Default Credentials (ADC).
179+
180+
```
181+
gcloud auth application-default login
182+
```
183+
You will be prompted to open your web browser and authenticate to Google Cloud.
184+
185+
2. Clone the Cluster Toolkit from the GitHub repository:
186+
187+
```bash
188+
git clone --branch v1.72.0 --single-branch https://github.com/GoogleCloudPlatform/cluster-toolkit
189+
```
190+
3. Install the Cluster Toolkit
191+
```
192+
cd cluster-toolkit && make
193+
```
194+
195+
4. Deploy the cluster:
196+
```
197+
./gcluster deploy \
198+
examples/gke-a4x/gke-a4x.yaml \
199+
--backend-config "bucket=BUCKET_NAME" \
200+
--vars "deployment_name=CLUSTER_NAME" \
201+
--vars "project_id=PROJECT_ID" \
202+
--vars "region=COMPUTE_REGION" \
203+
--vars "zone=COMPUTE_ZONE" \
204+
--vars "authorized_cidr=AUTHORIZED_CIDR" \
205+
--vars "extended_reservation=RESERVATION_NAME" \
206+
--vars "static_node_count=NODE_COUNT"
207+
```
208+
209+
Replace the following:
210+
- BUCKET_NAME: the name of the Cloud Storage bucket created in the previous step. Don't use the `gs://` prefix in the name.
211+
- CLUSTER_NAME: the name for your cluster. Make sure that the name is shorter than 16 characters.
212+
- PROJECT_ID: the project ID of your project.
213+
- COMPUTE_REGION: the compute region for the cluster.
214+
- COMPUTE_ZONE: the compute zone for the node pool of A4 machines.
215+
- AUTHORIZED_CIDR: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Cluster Toolkit. If you want to allow access from any IP address use `0.0.0.0/0`.
216+
- NODE_COUNT: the number of A4 nodes to provision in your cluster.
217+
- RESERVATION_NAME: the name of your reservation. If you want to target a specific block within your reservation to use when creating a node pool, use the following format : `RESERVATION_NAME/reservationBlocks/BLOCK_NAME`. To get the names of the blocks that are available for your reservation, run the following command:
218+
219+
```
220+
gcloud beta compute reservations blocks list RESERVATION_NAME \
221+
--zone=COMPUTE_ZONE --format "value(name)"
222+
```
223+
224+
## Verify cluster settings
225+
226+
After the cluster toolkit blueprint has completed verify key configurations.
227+
228+
1. Get cluster credentials:
229+
230+
```
231+
gcloud container clusters get-credentials CLUSTER_NAME \
232+
--location COMPUTE_REGION
233+
```
234+
235+
Replace the following:
236+
- CLUSTER_NAME - the name of your cluster
237+
- COMPUTE_REGION - the region of your cluster
238+
239+
2. List Kueue local queues
240+
```
241+
kubectl get queues
242+
```
243+
You should see the output similar to the following:
244+
```
245+
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS
246+
a4x a4x 0 0
247+
```
248+
The blueprint configures Kueue using the `a4x` as a default name for both the local queue and the cluster queue.
249+
250+
3. Make sure that all A4 nodes are in the ready state:
251+
```
252+
kubectl get nodes
253+
```
254+
You should see the output similar to the following:
255+
```
256+
NAME STATUS ROLES AGE VERSION
257+
gke-imo-glacier-peak-a4x-highgpu-4g-a-76a6f770-0phl Ready <none> 11d v1.32.9-gke.1130000
258+
gke-imo-glacier-peak-a4x-highgpu-4g-a-76a6f770-12dg Ready <none> 11d v1.32.9-gke.1130000
259+
gke-imo-glacier-peak-a4x-highgpu-4g-a-76a6f770-4ncf Ready <none> 11d v1.32.9-gke.1130000
260+
gke-imo-glacier-peak-a4x-highgpu-4g-a-76a6f770-6t1h Ready <none> 11d v1.32.9-gke.1130000
261+
...
262+
```
263+
264+
## Additional permissions for Maxtext recipes
265+
266+
Grant the `storage.admin` role to the custom IAM node pool service account created by the Cluster Toolkit blueprint. This is required to support some recipes.
267+
268+
```bash
269+
gcloud projects add-iam-policy-binding PROJECT_ID \
270+
--member="serviceAccount:CLUSTER_NAME-gke-np-sa@PROJECT_ID.iam.gserviceaccount.com" \
271+
--role="roles/storage.admin"
272+
```
273+
274+
Replace the following:
275+
- PROJECT_ID: the project ID of your project.
276+
- CLUSTER_NAME: the name for your cluster.
277+
278+
## What's next
279+
280+
Once you have set up your GKE cluster with A4X node pools, you can proceed to deploy and
281+
run your [benchmark recipes](../README.md#benchmarks-support-matrix).
282+
283+
284+
## Clean up the environment
285+
286+
If you want to remove the resources created when setting up the environment follow the below instructions.
287+
288+
### Clean up resources created by Cluster Toolkit
289+
290+
To remove resources created by the Cluster Toolkit blueprint:
291+
292+
```
293+
cd ~/cluster-toolkit
294+
./gcluster destroy DEPLOYMENT_NAME
295+
```
296+
297+
Replace the following:
298+
- DEPLOYMENT_NAME: the name you used during the deployment. This is the name of your cluster.
299+
300+
301+
### Remove Cloud Storage buckets
302+
303+
If you want to remove Cloud Storage buckets in your environment execute the following command:
304+
305+
**IMPORTANT**. This command removes the bucket and all objects within it. You'll not be able to recover them after the command is executed.
306+
307+
```
308+
gcloud storage rm -r gs://BUCKET_NAME
309+
```
310+
311+
Replace the following:
312+
- BUCKET_NAME: the name of your bucket
313+
314+
### Remove Artifact Registry
315+
316+
To delete the Artfiact Registry:
317+
318+
```
319+
gcloud artifacts repositories delete REPOSITORY --location=LOCATION
320+
```
321+
322+
Replace the following:
323+
- REPOSITORY: the name of your repository
324+
- LOCATION: the location of your repository
325+
326+
## Get Help
327+
328+
If you encounter any issues or have questions about this setup, use one of the following
329+
resources:
330+
331+
- Consult the [official GKE documentation](https://cloud.google.com/kubernetes-engine/docs).
332+
- Check the issues section of this repository for known problems and solutions.
333+
- Reach out to Google Cloud support.

0 commit comments

Comments
 (0)