Skip to content

Commit ca2078f

Browse files
author
Copybara
committed
Copybara import of gpu-recipes:
- 3836cc5d728b8e754f66129f6d29119d1fec5c94 Merge "Adding 2048 gbs runs to Llama-3.1-70B Maxtext" int... - a49b179dd786cc1dc2e74f120511df9c966aad83 Merge "Adding 32 mnodes run to Llama-3.1-405B" into main - fed82d961cc89c89d1e11614a4aada8ee3244ddf Merge "Remove the training recipe of A3ultra MaxText Mixt... GitOrigin-RevId: fed82d961cc89c89d1e11614a4aada8ee3244ddf
1 parent 9b1c797 commit ca2078f

File tree

11 files changed

+113
-372
lines changed

11 files changed

+113
-372
lines changed

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,6 @@ Models | GPU Machine Type
3232
**Llama-3.1-70B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | MaxText | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md)
3333
**Llama-3.1-70B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md)
3434
**Llama-3.1-405B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | MaxText | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-405b/maxtext-pretraining-gke/README.md)
35-
**Mixtral-8-7B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | MaxText | Pre-training | GKE | [Link](./training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md)
3635
**Mixtral-8-7B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md)
3736

3837
### Inference benchmarks A3 Mega
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
hardware: gpu
2+
dcn_fsdp_parallelism: 32
3+
ici_fsdp_parallelism: 8
4+
per_device_batch_size: 2
5+
max_target_length: 8192
6+
learning_rate: 0.001
7+
model_name: llama3.1-405b
8+
enable_checkpointing: false
9+
quantization: fp8
10+
attention: cudnn_flash_te
11+
remat_policy: full
12+
use_iota_embed: true
13+
dataset_type: synthetic
14+
logits_dot_in_fp32: false
15+
enable_goodput_recording: false
16+
monitor_goodput: false
17+
save_config_to_gcs: true
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
hardware: gpu
2+
dcn_data_parallelism: 16
3+
dcn_fsdp_parallelism: 8
4+
ici_fsdp_parallelism: 8
5+
per_device_batch_size: 2
6+
max_target_length: 8192
7+
learning_rate: 0.001
8+
model_name: llama3.1-70b
9+
enable_checkpointing: false
10+
quantization: fp8
11+
attention: cudnn_flash_te
12+
remat_policy: save_dot_except_mlp
13+
use_iota_embed: true
14+
scan_layers: true
15+
dataset_type: synthetic
16+
logits_dot_in_fp32: false
17+
enable_goodput_recording: false
18+
monitor_goodput: false
19+
save_config_to_gcs: true
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
hardware: gpu
2+
dcn_data_parallelism: 4
3+
dcn_fsdp_parallelism: 16
4+
ici_fsdp_parallelism: 8
5+
per_device_batch_size: 4
6+
max_target_length: 8192
7+
learning_rate: 0.001
8+
model_name: llama3.1-70b
9+
enable_checkpointing: false
10+
quantization: fp8
11+
attention: cudnn_flash_te
12+
remat_policy: save_out_proj
13+
use_iota_embed: true
14+
scan_layers: true
15+
dataset_type: synthetic
16+
logits_dot_in_fp32: false
17+
enable_goodput_recording: false
18+
monitor_goodput: false
19+
save_config_to_gcs: true

src/frameworks/a3ultra/maxtext-configs/mixtral-8x7b-256gpus-a3u-bf16.yaml

Lines changed: 0 additions & 47 deletions
This file was deleted.

src/helm-charts/a3ultra/maxtext-training/templates/maxtext-launcher-job.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -220,7 +220,7 @@ spec:
220220
- name: TF_CPP_MAX_LOG_LEVEL
221221
value: "100"
222222
- name: XLA_PYTHON_CLIENT_MEM_FRACTION
223-
value: "0.92"
223+
value: "0.98"
224224
- name: CUDA_DEVICE_MAX_CONNECTIONS
225225
value: "1"
226226
- name: NVTE_FUSED_ATTN

training/a3ultra/llama-3.1-405b/maxtext-pretraining-gke/README.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ For this recipe, the following setup is used:
1919

2020
This recipe has been optimized for and tested with the following configuration:
2121

22-
- A cluster with 64, 96 or 128 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines.
22+
- A cluster with 32, 64, 96 or 128 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines.
2323
- Machine placement in the cluster is configured using a [compact placement policy](https://cloud.google.com/kubernetes-engine/docs/how-to/compact-placement)
2424
- MaxText docker container
2525
- FP8 precision training
@@ -32,7 +32,7 @@ This recipe has been optimized for and tested with the following configuration:
3232
Before running this recipe, ensure your environment is configured as follows:
3333

3434
- A GKE cluster with the following setup:
35-
- An A3 Ultra node pool (64 nodes - 512 GPUs, 96 nodes - 768 GPUs or 128 nodes - 1024 GPUs)
35+
- An A3 Ultra node pool (32 nodes - 256 GPUS, 64 nodes - 512 GPUs, 96 nodes - 768 GPUs or 128 nodes - 1024 GPUs)
3636
- Topology-aware scheduling enabled
3737
- An Artifact Registry repository to store the Docker image.
3838
- A Google Cloud Storage (GCS) bucket to store results.
@@ -140,6 +140,24 @@ To build the container, complete the following steps from your client:
140140

141141
### Configure and submit a pretraining job
142142

143+
#### Using 32 nodes (256 GPUs)
144+
145+
The default job setting is 15 training steps and fp8 precision. To execute the job with the
146+
default settings, run the following command from your client:
147+
148+
```bash
149+
cd $RECIPE_ROOT
150+
helm install -f values.yaml \
151+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-405b-256gpus-a3u-fp8.yaml \
152+
--set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
153+
--set workload.run_name=$USER-llama-3-1-405b-maxtext-fp8 \
154+
--set workload.gpus=256 \
155+
--set queue=$KUEUE_NAME \
156+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
157+
$USER-llama-3-1-405b-maxtext-fp8 \
158+
$REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
159+
```
160+
143161
#### Using 64 nodes (512 GPUs)
144162

145163
The default job setting is 15 training steps and fp8 precision. To execute the job with the

training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,42 @@ helm install -f values.yaml \
204204
$REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
205205
```
206206

207+
#### 64 nodes (512 GPUs) global batch size 2048
208+
209+
The default job setting is 50 training steps and fp8 precision. To execute the job with the
210+
default settings, run the following command from your client:
211+
212+
```bash
213+
cd $RECIPE_ROOT
214+
helm install -f values.yaml \
215+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-70b-512gpus-a3u-fp8-gbs2048.yaml \
216+
--set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
217+
--set workload.run_name=$USER-llama-3-1-70b-maxtext-fp8-64nodes-2048 \
218+
--set workload.gpus=512 \
219+
--set queue=$KUEUE_NAME \
220+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
221+
$USER-llama-3-1-70b-maxtext-fp8-64nodes-2048 \
222+
$REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
223+
```
224+
225+
#### 128 nodes (1024 GPUs) global batch size 2048
226+
227+
The default job setting is 50 training steps and fp8 precision. To execute the job with the
228+
default settings, run the following command from your client:
229+
230+
```bash
231+
cd $RECIPE_ROOT
232+
helm install -f values.yaml \
233+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-70b-1024gpus-a3u-fp8-gbs2048.yaml \
234+
--set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
235+
--set workload.run_name=$USER-llama-3-1-70b-maxtext-fp8-128nodes-2048 \
236+
--set workload.gpus=1024 \
237+
--set queue=$KUEUE_NAME \
238+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
239+
$USER-llama-3-1-70b-maxtext-fp8-128nodes-2048 \
240+
$REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
241+
```
242+
207243
#### Configure job settings
208244

209245
**Examples**

training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/values.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ workload:
2828
steps: 50
2929

3030
network:
31+
gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.0.3
3132
subnetworks[]:
3233
ncclSettings:
3334
- name: NCCL_DEBUG

0 commit comments

Comments
 (0)