Skip to content

Commit 5b132ac

Browse files
Copybaradasoto
authored andcommitted
Copybara import of gpu-recipes:
- aea441d783812e93cb0d7ff8bbe919444494e37a Merge "Adding Llama-3.1405B MaxText pretraining recipe" i... GitOrigin-RevId: aea441d783812e93cb0d7ff8bbe919444494e37a
1 parent 78012f9 commit 5b132ac

File tree

6 files changed

+425
-0
lines changed

6 files changed

+425
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ Welcome to the reproducible benchmark recipes repository for GPUs! This reposito
3232
| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
3333
| **Llama-3.1-70B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | MaxText | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md)
3434
| **Llama-3.1-70B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md)
35+
| **Llama-3.1-405B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-405b/maxtext-pretraining-gke/README.md)
3536
| **Mixtral-8-7B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | MaxText | Pre-training | GKE | [Link](./training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md)
3637
| **Mixtral-8-7B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md) |
3738

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
hardware: gpu
2+
dcn_data_parallelism: 2
3+
ici_data_parallelism: 1
4+
dcn_fsdp_parallelism: 64
5+
ici_fsdp_parallelism: 8
6+
per_device_batch_size: 2
7+
max_target_length: 8192
8+
learning_rate: 0.001
9+
model_name: llama3.1-405b
10+
enable_checkpointing: false
11+
quantization: fp8
12+
attention: cudnn_flash_te
13+
remat_policy: full
14+
use_iota_embed: true
15+
dataset_type: synthetic
16+
logits_dot_in_fp32: false
17+
enable_goodput_recording: false
18+
monitor_goodput: false
19+
save_config_to_gcs: true
20+
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
hardware: gpu
2+
dcn_fsdp_parallelism: 64
3+
ici_fsdp_parallelism: 8
4+
per_device_batch_size: 2
5+
max_target_length: 8192
6+
learning_rate: 0.001
7+
model_name: llama3.1-405b
8+
enable_checkpointing: false
9+
quantization: fp8
10+
attention: cudnn_flash_te
11+
remat_policy: full
12+
use_iota_embed: true
13+
dataset_type: synthetic
14+
logits_dot_in_fp32: false
15+
enable_goodput_recording: false
16+
monitor_goodput: false
17+
save_config_to_gcs: true
18+
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
hardware: gpu
2+
dcn_data_parallelism: 3
3+
dcn_fsdp_parallelism: 32
4+
ici_fsdp_parallelism: -1
5+
ici_tensor_parallelism: 1
6+
per_device_batch_size: 2
7+
max_target_length: 8192
8+
learning_rate: 0.001
9+
model_name: llama3.1-405b
10+
enable_checkpointing: false
11+
quantization: fp8
12+
attention: cudnn_flash_te
13+
remat_policy: full
14+
use_iota_embed: true
15+
dataset_type: synthetic
16+
logits_dot_in_fp32: false
17+
enable_goodput_recording: false
18+
monitor_goodput: false
19+
save_config_to_gcs: true
20+
Lines changed: 313 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,313 @@
1+
# Pretrain Llama-3.1-405B workloads on A3 Ultra GKE Node pools using MaxText
2+
3+
This recipe outlines the steps for running a Llama-3.1-405B pretraining workload on
4+
[A3 Ultra GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
5+
[MaxText framework](https://github.com/AI-Hypercomputer/maxtext).
6+
7+
## Orchestration and deployment tools
8+
9+
For this recipe, the following setup is used:
10+
11+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
12+
- Job configuration and deployment - Helm chart is used to configure and deploy the
13+
[Kubernetes Index Job](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs).
14+
This job encapsulates the
15+
[MaxText pretraining workload](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/train.py).
16+
The chart generates the job's manifest, adhering to best practices for using RDMA Over Ethernet (RoCE) with Google Kubernetes Engine (GKE).
17+
18+
## Test environment
19+
20+
This recipe has been optimized for and tested with the following configuration:
21+
22+
- A cluster with 64, 96 or 128 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines.
23+
- Machine placement in the cluster is configured using a [compact placement policy](https://cloud.google.com/kubernetes-engine/docs/how-to/compact-placement)
24+
- MaxText docker container
25+
- FP8 precision training
26+
- Uses a synthetic pretraining dataset provided by the MaxText framework. By default, the job
27+
is configured to execute 50 training steps. If you want to change the number of training steps,
28+
see [Configure and submit a pretraining job](#configure-and-submit-a-pretraining-job).
29+
30+
## Prerequisites
31+
32+
Before running this recipe, ensure your environment is configured as follows:
33+
34+
- A GKE cluster with the following setup:
35+
- An A3 Ultra node pool (64 nodes - 512 GPUs, 96 nodes - 768 GPUs or 128 nodes - 1024 GPUs)
36+
- Topology-aware scheduling enabled
37+
- An Artifact Registry repository to store the Docker image.
38+
- A Google Cloud Storage (GCS) bucket to store results.
39+
*Important: This bucket must be in the same region as the GKE cluster*.
40+
- A client workstation with the following pre-installed:
41+
- Google Cloud SDK
42+
- Helm
43+
- kubectl
44+
45+
To prepare the required environment, see
46+
[GKE environment setup guide](../../../../docs/configuring-environment-gke-a3-ultra.md).
47+
48+
## Run the recipe
49+
50+
It is recommended to use Cloud Shell as your client to complete the steps.
51+
Cloud Shell comes pre-installed with the necessary utilities, including
52+
`kubectl`, `the Google Cloud SDK`, and `Helm`.
53+
54+
### Launch Cloud Shell
55+
56+
In the Google Cloud console, start a [Cloud Shell Instance](https://console.cloud.google.com/?cloudshell=true).
57+
58+
### Configure environment settings
59+
60+
From your client, complete the following steps:
61+
62+
1. Set the environment variables to match your environment:
63+
64+
```bash
65+
export PROJECT_ID=<PROJECT_ID>
66+
export REGION=<REGION>
67+
export CLUSTER_REGION=<CLUSTER_REGION>
68+
export CLUSTER_NAME=<CLUSTER_NAME>
69+
export GCS_BUCKET=<GCS_BUCKET>
70+
export ARTIFACT_REGISTRY=<ARTIFACT_REGISTRY>
71+
export KUEUE_NAME=<KUEUE_NAME>
72+
```
73+
74+
Replace the following values:
75+
76+
- `<PROJECT_ID>`: your Google Cloud project ID
77+
- `<REGION>`: the region where you want to run Cloud Build
78+
- `<CLUSTER_REGION>`: the region where your cluster is located
79+
- `<CLUSTER_NAME>`: the name of your GKE cluster
80+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Do not include the `gs://` prefix
81+
- `<ARTIFACT_REGISTRY>`: the full name of your Artifact
82+
Registry in the following format: *LOCATION*-docker.pkg.dev/*PROJECT_ID*/*REPOSITORY*
83+
- `<KUEUE_NAME>`: the name of the Kueue queue configured for TAS. The default queue created by the cluster toolkit is `a3-ultra`. Please verify the name of your local queue by running `kubectl get queues` and modify it as needed.
84+
1. Set the default project:
85+
86+
```bash
87+
gcloud config set project $PROJECT_ID
88+
```
89+
90+
### Get the recipe
91+
92+
From your client, clone the `gpu-recipes` repository and set a reference to the recipe folder.
93+
94+
```
95+
cd
96+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
97+
cd gpu-recipes
98+
export REPO_ROOT=`git rev-parse --show-toplevel`
99+
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/llama-3.1-405b/maxtext-pretraining-gke
100+
```
101+
102+
### Get cluster credentials
103+
104+
From your client, get the credentials for your cluster.
105+
106+
```
107+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
108+
```
109+
110+
### Build and push a docker container image to Artifact Registry
111+
112+
To build the container, complete the following steps from your client:
113+
114+
1. Use Cloud Build to build and push the container image.
115+
116+
```bash
117+
cd $REPO_ROOT/src/docker/maxtext
118+
gcloud builds submit --region=${REGION} \
119+
--config cloudbuild.yml \
120+
--substitutions _ARTIFACT_REGISTRY=$ARTIFACT_REGISTRY \
121+
--timeout "2h" \
122+
--machine-type=e2-highcpu-32 \
123+
--quiet \
124+
--async
125+
```
126+
127+
This command outputs the `build ID`.
128+
129+
1. You can monitor the build progress by streaming the logs for the `build ID`.
130+
To do this, run the following command.
131+
132+
Replace `<BUILD_ID>` with your build ID.
133+
134+
```bash
135+
BUILD_ID=<BUILD_ID>
136+
137+
gcloud beta builds log $BUILD_ID --region=$REGION
138+
```
139+
140+
141+
### Configure and submit a pretraining job
142+
143+
#### Using 64 nodes (512 GPUs)
144+
145+
The default job setting is 15 training steps and fp8 precision. To execute the job with the
146+
default settings, run the following command from your client:
147+
148+
```bash
149+
cd $RECIPE_ROOT
150+
helm install -f values.yaml \
151+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-405b-512gpus-a3u-fp8.yaml \
152+
--set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
153+
--set workload.run_name=$USER-llama-3-1-405b-maxtext-fp8 \
154+
--set workload.gpus=512 \
155+
--set clusterName=$CLUSTER_NAME \
156+
--set queue=$KUEUE_NAME \
157+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
158+
$USER-llama-3-1-405b-maxtext-fp8 \
159+
$REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
160+
```
161+
162+
#### Using 96 nodes (768 GPUs)
163+
164+
The default job setting is 15 training steps and fp8 precision. To execute the job with the
165+
default settings, run the following command from your client:
166+
167+
```bash
168+
cd $RECIPE_ROOT
169+
helm install -f values.yaml \
170+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-405b-768gpus-a3u-fp8.yaml \
171+
--set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
172+
--set workload.run_name=$USER-llama-3-1-405b-maxtext-fp8 \
173+
--set workload.gpus=768 \
174+
--set clusterName=$CLUSTER_NAME \
175+
--set queue=$KUEUE_NAME \
176+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
177+
$USER-llama-3-1-405b-maxtext-fp8 \
178+
$REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
179+
```
180+
#### Using 128 nodes (1024 GPUs)
181+
182+
The default job setting is 15 training steps and fp8 precision. To execute the job with the
183+
default settings, run the following command from your client:
184+
185+
```bash
186+
cd $RECIPE_ROOT
187+
helm install -f values.yaml \
188+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-405b-1024gpus-a3u-fp8.yaml \
189+
--set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
190+
--set workload.run_name=$USER-llama-3-1-405b-maxtext-fp8 \
191+
--set workload.gpus=1024 \
192+
--set clusterName=$CLUSTER_NAME \
193+
--set queue=$KUEUE_NAME \
194+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
195+
$USER-llama-3-1-405b-maxtext-fp8 \
196+
$REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
197+
```
198+
199+
200+
#### Configure job settings
201+
202+
**Examples**
203+
204+
- To set the number of training steps to 100, run the following command from your client:
205+
206+
```bash
207+
cd $RECIPE_ROOT
208+
helm install -f values.yaml \
209+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-405b-512gpus-a3u-fp8.yaml \
210+
--set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
211+
--set workload.run_name=$USER-llama-3-1-405b-maxtext-fp8 \
212+
--set workload.gpus=512 \
213+
--set clusterName=$CLUSTER_NAME \
214+
--set queue=$KUEUE_NAME \
215+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
216+
--set workload.steps=100 \
217+
$USER-llama-3-1-405b-maxtext-fp8 \
218+
$REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
219+
```
220+
221+
### Monitor the job
222+
223+
To check the status of pods in the indexed job, run the following command from your client:
224+
225+
```
226+
kubectl get pods | grep $USER-llama-3-1-405b-maxtext-fp8
227+
```
228+
229+
To get the logs for one of the pods, run the following command from your client:
230+
231+
```
232+
kubectl logs "<pod_name>"
233+
```
234+
235+
### Analyze results
236+
237+
When completed, the job creates tensorboard logs in the following location:
238+
239+
```
240+
gs://${GCS_BUCKET}/maxtext/$JOB_ID/tensorboard/$JOB_ID/
241+
├── events.out.tfevents....
242+
...
243+
```
244+
245+
To inspect the text logs generated by MaxText, retrieve them from any Pod in the job using the following command:
246+
`kubectl logs "<pod_name>"`
247+
248+
249+
Here is an example of an entry in :
250+
251+
```
252+
completed step: 12, seconds: 15.516, TFLOP/s/device: 508.371, Tokens/s/device: 1055.949, total_weights: 4194304, loss: 0.000
253+
```
254+
255+
The logs will show you the step time in seconds and the TFLOP/s/device.
256+
257+
### Calculate training performance metrics (eMFU)
258+
259+
This section explains how to calculate the effective Model FLOPS Utilization (eMFU), using the logs from the pods.
260+
Using the example logs from the previous step, and considering the number of TFLOP/s/device of 903.017,
261+
you can compute the eMFU using the following formula:
262+
263+
```
264+
TFLOP/s/device 903.017
265+
eMFU = ------------------- = --------- = 0.514 = 91.3%
266+
MAX TFLOP H200 989
267+
268+
```
269+
270+
MAX TFLOP H200: 989
271+
272+
273+
### Uninstall the Helm release
274+
275+
You can delete the job and other resources created by the Helm chart.
276+
To uninstall Helm, run the following command from your client:
277+
278+
```bash
279+
helm uninstall $USER-llama-3-1-405b-maxtext-fp8
280+
```
281+
282+
### Running the recipe on a cluster that does not use the default configuration.
283+
284+
If you created your cluster using the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a3-ultra.md), it is configured with default settings that include the names for networks and subnetworks used for communication between:
285+
286+
- The host to external services.
287+
- GPU-to GPU communication.
288+
289+
For clusters with this default configuration, the Helm chart can automatically generate the [required networking annotations in a Pod's metadata](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#configure-pod-manifests-rdma). Therefore, you can use the streamlined command to install the chart, as described in the the [Configure and submit a pretraining job](#configure-and-submit-a-pretraining-job) section.
290+
291+
To configure the correct networking annotations for a cluster that uses non-default names for GKE Network resources, you must provide the names of the GKE Network resources in you cluster when installing the chart. Use the following example command, remembering to replace the example values with the actual names of your cluster's GKE Network resources:
292+
293+
```bash
294+
cd $RECIPE_ROOT
295+
helm install -f values.yaml \
296+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-405b-512gpus-a3u-fp8.yaml \
297+
--set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
298+
--set workload.run_name=$USER-llama-3-1-405b-maxtext-fp8 \
299+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
300+
--set queue=$KUEUE_NAME \
301+
--set network.subnetworks[0]=default \
302+
--set network.subnetworks[1]=gvnic-1 \
303+
--set network.subnetworks[2]=rdma-0 \
304+
--set network.subnetworks[3]=rdma-1 \
305+
--set network.subnetworks[4]=rdma-2 \
306+
--set network.subnetworks[5]=rdma-3 \
307+
--set network.subnetworks[6]=rdma-4 \
308+
--set network.subnetworks[7]=rdma-5 \
309+
--set network.subnetworks[8]=rdma-6 \
310+
--set network.subnetworks[9]=rdma-7 \
311+
$USER-llama-3-1-405b-maxtext-fp8 \
312+
$REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
313+
```

0 commit comments

Comments
 (0)