Skip to content

Commit 6d85af4

Browse files
author
Copybara
committed
Copybara import of gpu-recipes:
- b78cca628c19a8f89b86b92112fdcc559447f26c Merge "Adding Llama3.1-405B Nemo for A4 pretraining recip... - 4669d9f7c561ad58ac0ef65118ea5d23947e62e4 Fix link and typo main README - 1515dfbe3106680d5820c506ddd9617f6bc121fd Adding Llama-3.1-405B MaxText on 32 nodes for A4 High GitOrigin-RevId: 1515dfbe3106680d5820c506ddd9617f6bc121fd
1 parent 2d353ee commit 6d85af4

File tree

18 files changed

+2057
-2
lines changed

18 files changed

+2057
-2
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,13 @@ Models | GPU Machine Type
3535
**Llama-3.1-405B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo. | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-405b/nemo-pretraining-gke/README.md)
3636
**Mixtral-8-7B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md)
3737

38+
### Training benchmarks A4 High
39+
40+
Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe
41+
------------------ | ----------------------------------------------------------------------------------------------------------- | --------- | ------------- | ------------ | ------------------
42+
**Llama-3.1-405B** | [A4 High (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-high-vms) | MaxText | Pre-training | GKE | [Link](./training/a4high/llama-3.1-405b/maxtext-pretraining-gke/README.md)
43+
**Llama-3.1-405B** | [A4 High (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-high-vms) | NeMo | Pre-training | GKE | [Link](./training/a4high/llama-3.1-405b/nemo-pretraining-gke/README.md)
44+
3845
### Inference benchmarks A3 Mega
3946

4047
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
# Configuring the environment for running benchmark recipes on a GKE Cluster with A4 High Node Pools
2+
3+
This [guide](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute) outlines the steps to configure the environment required to run benchmark recipes on a [Google Kubernetes Engine (GKE) cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) with [A4 High](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) node pools.
4+
5+
## Prerequisites
6+
7+
Before you begin, ensure you have completed the following:
8+
9+
1. Create a Google Cloud project with billing enabled.
10+
11+
a. To create a project, see [Creating and managing projects](https://cloud.google.com/resource-manager/docs/creating-managing-projects).
12+
b. To enable billing, see [Verify the billing status of your projects](https://cloud.google.com/billing/docs/how-to/verify-billing-enabled).
13+
14+
2. Enabled the following APIs:
15+
16+
- [Service Usage API](https://console.cloud.google.com/apis/library/serviceusage.googleapis.com).
17+
- [Google Kubernetes Engine API](https://console.cloud.google.com/flows/enableapi?apiid=container.googleapis.com).
18+
- [Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com).
19+
- [Artifact Registry API](https://console.cloud.google.com/flows/enableapi?apiid=artifactregistry.googleapis.com).
20+
21+
3. Requested enough GPU quotas. Each `a4-highgpu-8g` machine has 8 B200 GPUs attached.
22+
1. To view quotas, see [View the quotas for your project](/docs/quotas/view-manage).
23+
In the Filter field, select **Dimensions(e.g location)** and
24+
specify [`gpu_family:NVIDIA_B200`](https://cloud.google.com/compute/resource-usage#gpu_quota).
25+
1. If you don't have enough quota, [request a higher quota](https://cloud.google.com/docs/quotas/view-manage#requesting_higher_quota).
26+
27+
## Reserve capacity
28+
29+
To ensure that your workloads have the A4 High GPU resources required for these
30+
instructions, you can create a [future reservation request](https://cloud.google.com/compute/docs/instances/future-reservations-overview).
31+
With this request, you can reserve blocks of capacity for a defined duration in the
32+
future. At that date and time in the future, Compute Engine automatically
33+
provisions the blocks of capacity by creating on-demand reservations that you
34+
can immediately consume by provisioning node pools for this cluster.
35+
36+
Additionally, as your reserved capacity might span multiple
37+
[blocks](https://cloud.google.com/ai-hypercomputer/docs/terminology#block), we recommend that you create
38+
GKE nodes on a specific block within your reservation.
39+
40+
Do the following steps to request capacity and gather the required information
41+
to create nodes on a specific block within your reservation:
42+
43+
1. [Request capacity](https://cloud.google.com/ai-hypercomputer/docs/request-capacity).
44+
45+
1. To get the name of the blocks that are available for your reservation,
46+
run the following command:
47+
48+
```sh
49+
gcloud beta compute reservations blocks list RESERVATION_NAME \
50+
--zone=ZONE --format "value(name)"
51+
```
52+
Replace the following:
53+
54+
* `RESERVATION_NAME`: the name of your reservation.
55+
* `ZONE`: the compute zone of your reservation.
56+
57+
The output has the following format: `BLOCK_NAME`.
58+
For example the output might be similar to the following: `example-res1-block-0001`.
59+
60+
1. If you want to target specific blocks within a reservation when
61+
provisioning GKE node pools, you must specify the full reference
62+
to your block as follows:
63+
64+
```none
65+
RESERVATION_NAME/reservationBlocks/BLOCK_NAME
66+
```
67+
68+
For example, using the example output in the preceding step, the full path is as follows: `example-res1/reservationBlocks/example-res1-block-0001`
69+
70+
## The environment
71+
72+
The environment comprises of the following components:
73+
74+
- Client workstation: this is used to prepare, submit, and monitor ML workloads.
75+
- [Google Cloud Storage (GCS) Bucket](https://cloud.google.com/storage/docs): used for storing
76+
datasets and logs.
77+
- [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview): serves as a
78+
private container registry for storing and managing Docker images used in the deployment.
79+
- [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview)
80+
Cluster with A4 High Node Pools: provides a managed Kubernetes environment to run benchmark
81+
recipes.
82+
83+
## Set up the client workstation
84+
85+
You have two options, you can use either a local machine or Google Cloud Shell.
86+
87+
### Google Cloud Shell
88+
89+
We recommend using [Google Cloud Shell](https://cloud.google.com/shell/docs) as it
90+
comes with all necessary components pre-installed.
91+
92+
### Local client
93+
If you prefer to use your local machine, ensure your local machine has the following
94+
components installed.
95+
96+
1. Google Cloud SDK. To install, see
97+
[Install the gcloud CLI](https://cloud.google.com/sdk/docs/install).
98+
2. kubectl. To install, see the
99+
[kuberenetes documentation](https://kubernetes.io/docs/tasks/tools/#kubectl).
100+
3. Helm. To install, see the [Helm documentation](https://helm.sh/docs/intro/quickstart/).
101+
4. Docker. To install, see the [Docker documentation](https://docs.docker.com/engine/install/).
102+
103+
104+
## Set up a Google Cloud Storage bucket
105+
106+
```bash
107+
gcloud storage buckets create gs://<BUCKET_NAME> --location=<BUCKET_LOCATION> --no-public-access-prevention
108+
```
109+
110+
Replace the following:
111+
112+
- `BUCKET_NAME`: the name of your bucket. The name must comply with the
113+
[Cloud Storage bucket naming conventions](https://cloud.google.com/storage/docs/buckets#naming).
114+
- `BUCKET_LOCATION`: the location of your bucket. The bucket must be located in
115+
the same region as the GKE cluster.
116+
117+
Add IAM binding to allow workloads authenticated via a workload identity (with the default service account) to access Cloud Storage objects.
118+
119+
```bash
120+
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
121+
gcloud storage buckets add-iam-policy-binding gs://<BUCKET_NAME> \
122+
--role=roles/storage.objectUser \
123+
--member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/default \
124+
--condition=None
125+
```
126+
127+
Replace the following:
128+
129+
- `BUCKET_NAME`: the name of your bucket created in the previous step
130+
131+
## Set up an Artifact Registry
132+
133+
- If you use Cloud KMS for repository encryption, create your artifact registry by using the
134+
[instructions here](https://cloud.google.com/artifact-registry/docs/repositories/create-repos#create-repo-gcloud-docker).
135+
- If you don't use Cloud KMS, you can create your repository by using the following command:
136+
137+
```bash
138+
gcloud artifacts repositories create <REPOSITORY> \
139+
--repository-format=docker \
140+
--location=<LOCATION> \
141+
--description="<DESCRIPTION>" \
142+
```
143+
Replace the following:
144+
145+
- `REPOSITORY`: the name of the repository. For each repository location in a project,
146+
repository names must be unique.
147+
- `LOCATION`: the regional or multi-regional location for the repository. You can omit this
148+
flag if you set a default region.
149+
- `DESCRIPTION`: a description of the repository. Don't include sensitive data because
150+
repository descriptions are not encrypted.
151+
152+
153+
## Create a GKE Cluster with A4 High Node Pools
154+
155+
Follow [this guide]() for
156+
detailed instructions to create a GKE cluster with A4 High node pools and required GPU driver versions.
157+
158+
The documentation uses [ Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/overview) to create your GKE cluster quickly while incorporating best practices:
159+
160+
- Creation of the necessary VPC networks and subnets.
161+
- Creation of a GKE cluster with multi-networking enabled.
162+
- Creation of an A4 High node pool with NVIDIA B200 GPUs.
163+
- Installation of the required components for GPUDirect-RDMA and NCCL plugin.
164+
165+
1. [Launch Cloud Shell](https://cloud.google.com/shell/docs/launching-cloud-shell). You can use a
166+
different environment; however, we recommend Cloud Shell because the
167+
dependencies are already pre-installed for Cluster Toolkit. If you
168+
don't want to use Cloud Shell, follow the instructions to [install
169+
dependencies](/cluster-toolkit/docs/setup/install-dependencies) to prepare a
170+
different environment.
171+
172+
1. Clone the Cluster Toolkit from the git repository:
173+
174+
```sh
175+
cd ~
176+
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
177+
```
178+
1. Install the Cluster Toolkit:
179+
180+
```sh
181+
cd cluster-toolkit && git checkout main && make
182+
```
183+
184+
1. Create a Cloud Storage bucket to store the state of the Terraform
185+
deployment:
186+
187+
```sh
188+
gcloud storage buckets create gs://TF_STATE_BUCKET_NAME \
189+
--default-storage-class=STANDARD \
190+
--location=COMPUTE_REGION \
191+
--uniform-bucket-level-access
192+
gcloud storage buckets update gs://TF_STATE_BUCKET_NAME --versioning
193+
```
194+
195+
Replace the following variables:
196+
197+
* `TF_STATE_BUCKET_NAME`: the name of the new Cloud Storage bucket.
198+
* `COMPUTE_REGION`: the compute region where you want to store the state of the Terraform deployment.
199+
200+
1. In the [`examples/gke-a4-highgpu/gke-a4-highgpu-deployment.yaml`](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-a4-highgpu/gke-a4-highgpu-deployment.yaml)
201+
file, replace the following variables in the `terraform_backend_defaults` and
202+
`vars` sections to match the specific values for your deployment:
203+
204+
* `BUCKET_NAME`: the name of the Cloud Storage bucket you created in the
205+
previous step to store the state of Terraform deployment.
206+
* `PROJECT_ID`: your Google Cloud project ID.
207+
* `COMPUTE_REGION`: the compute region for the cluster.
208+
* `COMPUTE_ZONE`: the compute zone for the node pool of A4 High machines.
209+
* `IP_ADDRESS/SUFFIX`: The IP address range that you want to allow to
210+
connect with the cluster. This CIDR block must include the IP address of
211+
the machine to call Terraform.
212+
* `RESERVATION_NAME`: the name of your reservation.
213+
* `BLOCK_NAME`: the name of a specific block within the reservation.
214+
* `NODE_COUNT`: the number of A4 High nodes in your cluster.
215+
216+
To modify advanced settings, edit
217+
`examples/gke-a4-highgpu/gke-a4-highgpu.yaml`.
218+
219+
1. Generate [Application Default Credentials (ADC)](/docs/authentication/provide-credentials-adc#google-idp)
220+
to provide access to Terraform.
221+
222+
1. Deploy the blueprint to provision the GKE infrastructure
223+
using A4 High machine types:
224+
225+
```sh
226+
cd ~/cluster-toolkit
227+
./gcluster deploy -d \
228+
examples/gke-a4-highgpu/gke-a4-highgpu-deployment.yaml \
229+
examples/gke-a4-highgpu/gke-a4-highgpu.yaml
230+
```
231+
232+
## Clean up {:#clean-up}
233+
234+
To avoid recurring charges for the resources used on this page, clean up the
235+
resources provisioned by Cluster Toolkit, including the
236+
VPC networks and GKE cluster:
237+
238+
```sh
239+
./gcluster destroy gke-a4-high/
240+
```
241+
242+
243+
## What's next
244+
245+
Once you have set up your GKE cluster with A4 High node pools, you can proceed to deploy and
246+
run your [benchmark recipes](../README.md#benchmarks-support-matrix).
247+
248+
## Get Help
249+
250+
If you encounter any issues or have questions about this setup, use one of the following
251+
resources:
252+
253+
- Consult the [official GKE documentation](https://cloud.google.com/kubernetes-engine/docs).
254+
- Check the issues section of this repository for known problems and solutions.
255+
- Reach out to Google Cloud support.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
hardware: gpu
2+
dcn_data_parallelism: 2
3+
dcn_fsdp_parallelism: 16
4+
ici_fsdp_parallelism: 8
5+
per_device_batch_size: 2
6+
max_target_length: 8192
7+
learning_rate: 0.001
8+
model_name: llama3.1-405b
9+
enable_checkpointing: false
10+
quantization: fp8
11+
attention: cudnn_flash_te
12+
remat_policy: full
13+
use_iota_embed: true
14+
dataset_type: synthetic
15+
logits_dot_in_fp32: false
16+
enable_goodput_recording: false
17+
monitor_goodput: false
18+
save_config_to_gcs: true
19+

0 commit comments

Comments
 (0)