Copybara import of gpu-recipes:

Copybara · Copybara · commit ca2078f575f6 · 2025-03-14T23:59:22.000Z
- 3836cc5d728b8e754f66129f6d29119d1fec5c94 Merge "Adding 2048 gbs runs to Llama-3.1-70B Maxtext" int...
  - a49b179dd786cc1dc2e74f120511df9c966aad83 Merge "Adding 32 mnodes run to Llama-3.1-405B" into main
  - fed82d961cc89c89d1e11614a4aada8ee3244ddf Merge "Remove the training recipe of A3ultra MaxText Mixt...

GitOrigin-RevId: fed82d961cc89c89d1e11614a4aada8ee3244ddf
diff --git a/README.md b/README.md
@@ -32,7 +32,6 @@ Models             | GPU Machine Type
 **Llama-3.1-70B**  | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | MaxText   | Pre-training  | GKE          | [Link](./training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md)
 **Llama-3.1-70B**  | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo      | Pre-training  | GKE          | [Link](./training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md)
 **Llama-3.1-405B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | MaxText   | Pre-training  | GKE          | [Link](./training/a3ultra/llama-3.1-405b/maxtext-pretraining-gke/README.md)
-**Mixtral-8-7B**   | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | MaxText   | Pre-training  | GKE          | [Link](./training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md)
 **Mixtral-8-7B**   | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo      | Pre-training  | GKE          | [Link](./training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md)
 
 ### Inference benchmarks A3 Mega
diff --git a/src/frameworks/a3ultra/maxtext-configs/llama-3.1-405b-256gpus-a3u-fp8.yaml b/src/frameworks/a3ultra/maxtext-configs/llama-3.1-405b-256gpus-a3u-fp8.yaml
@@ -0,0 +1,17 @@
+hardware: gpu
+dcn_fsdp_parallelism: 32
+ici_fsdp_parallelism: 8
+per_device_batch_size: 2
+max_target_length: 8192
+learning_rate: 0.001
+model_name: llama3.1-405b
+enable_checkpointing: false
+quantization: fp8
+attention: cudnn_flash_te
+remat_policy: full
+use_iota_embed: true
+dataset_type: synthetic
+logits_dot_in_fp32: false
+enable_goodput_recording: false
+monitor_goodput: false
+save_config_to_gcs: true
diff --git a/src/frameworks/a3ultra/maxtext-configs/llama-3.1-70b-1024gpus-a3u-fp8-gbs2048.yaml b/src/frameworks/a3ultra/maxtext-configs/llama-3.1-70b-1024gpus-a3u-fp8-gbs2048.yaml
@@ -0,0 +1,19 @@
+hardware: gpu
+dcn_data_parallelism: 16
+dcn_fsdp_parallelism: 8
+ici_fsdp_parallelism: 8
+per_device_batch_size: 2
+max_target_length: 8192
+learning_rate: 0.001
+model_name: llama3.1-70b
+enable_checkpointing: false
+quantization: fp8
+attention: cudnn_flash_te
+remat_policy: save_dot_except_mlp
+use_iota_embed: true
+scan_layers: true
+dataset_type: synthetic
+logits_dot_in_fp32: false
+enable_goodput_recording: false
+monitor_goodput: false
+save_config_to_gcs: true
diff --git a/src/frameworks/a3ultra/maxtext-configs/llama-3.1-70b-512gpus-a3u-fp8-gbs2048.yaml b/src/frameworks/a3ultra/maxtext-configs/llama-3.1-70b-512gpus-a3u-fp8-gbs2048.yaml
@@ -0,0 +1,19 @@
+hardware: gpu
+dcn_data_parallelism: 4
+dcn_fsdp_parallelism: 16
+ici_fsdp_parallelism: 8
+per_device_batch_size: 4
+max_target_length: 8192
+learning_rate: 0.001
+model_name: llama3.1-70b
+enable_checkpointing: false
+quantization: fp8
+attention: cudnn_flash_te
+remat_policy: save_out_proj
+use_iota_embed: true
+scan_layers: true
+dataset_type: synthetic
+logits_dot_in_fp32: false
+enable_goodput_recording: false
+monitor_goodput: false
+save_config_to_gcs: true
diff --git a/src/frameworks/a3ultra/maxtext-configs/mixtral-8x7b-256gpus-a3u-bf16.yaml b/src/frameworks/a3ultra/maxtext-configs/mixtral-8x7b-256gpus-a3u-bf16.yaml
diff --git a/src/helm-charts/a3ultra/maxtext-training/templates/maxtext-launcher-job.yaml b/src/helm-charts/a3ultra/maxtext-training/templates/maxtext-launcher-job.yaml
@@ -220,7 +220,7 @@ spec:
         - name: TF_CPP_MAX_LOG_LEVEL
           value: "100"
         - name: XLA_PYTHON_CLIENT_MEM_FRACTION
-          value: "0.92"
+          value: "0.98"
         - name: CUDA_DEVICE_MAX_CONNECTIONS
           value: "1"
         - name: NVTE_FUSED_ATTN
diff --git a/training/a3ultra/llama-3.1-405b/maxtext-pretraining-gke/README.md b/training/a3ultra/llama-3.1-405b/maxtext-pretraining-gke/README.md
@@ -19,7 +19,7 @@ For this recipe, the following setup is used:
 
 This recipe has been optimized for and tested with the following configuration:
 
-- A cluster with 64, 96 or 128 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines.
+- A cluster with 32, 64, 96 or 128 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines.
 - Machine placement in the cluster is configured using a [compact placement policy](https://cloud.google.com/kubernetes-engine/docs/how-to/compact-placement)
 - MaxText docker container
 - FP8 precision training
@@ -32,7 +32,7 @@ This recipe has been optimized for and tested with the following configuration:
 Before running this recipe, ensure your environment is configured as follows:
 
 - A GKE cluster with the following setup:
-    - An A3 Ultra node pool (64 nodes - 512 GPUs, 96 nodes - 768 GPUs or 128 nodes - 1024 GPUs)
+    - An A3 Ultra node pool (32 nodes - 256 GPUS, 64 nodes - 512 GPUs, 96 nodes - 768 GPUs or 128 nodes - 1024 GPUs)
     - Topology-aware scheduling enabled
 - An Artifact Registry repository to store the Docker image.
 - A Google Cloud Storage (GCS) bucket to store results.
@@ -140,6 +140,24 @@ To build the container, complete the following steps from your client:
 
 ### Configure and submit a pretraining job
 
+#### Using 32 nodes (256 GPUs)
+
+The default job setting is 15 training steps and fp8 precision. To execute the job with the
+default settings, run the following command from your client:
+
+```bash
+cd $RECIPE_ROOT
+helm install -f values.yaml \
+    --set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-405b-256gpus-a3u-fp8.yaml \
+    --set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
+    --set workload.run_name=$USER-llama-3-1-405b-maxtext-fp8 \
+    --set workload.gpus=256 \
+    --set queue=$KUEUE_NAME \
+    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+    $USER-llama-3-1-405b-maxtext-fp8 \
+    $REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
+```
+
 #### Using 64 nodes (512 GPUs)
 
 The default job setting is 15 training steps and fp8 precision. To execute the job with the
diff --git a/training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md b/training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md
@@ -204,6 +204,42 @@ helm install -f values.yaml \
     $REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
 ```
 
+#### 64 nodes (512 GPUs) global batch size 2048
+
+The default job setting is 50 training steps and fp8 precision. To execute the job with the
+default settings, run the following command from your client:
+
+```bash
+cd $RECIPE_ROOT
+helm install -f values.yaml \
+    --set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-70b-512gpus-a3u-fp8-gbs2048.yaml \
+    --set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
+    --set workload.run_name=$USER-llama-3-1-70b-maxtext-fp8-64nodes-2048 \
+    --set workload.gpus=512 \
+    --set queue=$KUEUE_NAME \
+    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+    $USER-llama-3-1-70b-maxtext-fp8-64nodes-2048 \
+    $REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
+```
+
+#### 128 nodes (1024 GPUs) global batch size 2048
+
+The default job setting is 50 training steps and fp8 precision. To execute the job with the
+default settings, run the following command from your client:
+
+```bash
+cd $RECIPE_ROOT
+helm install -f values.yaml \
+    --set-file maxtext_config=$REPO_ROOT/src/frameworks/a3ultra/maxtext-configs/llama-3.1-70b-1024gpus-a3u-fp8-gbs2048.yaml \
+    --set workload.image=${ARTIFACT_REGISTRY}/maxtext-benchmark \
+    --set workload.run_name=$USER-llama-3-1-70b-maxtext-fp8-128nodes-2048 \
+    --set workload.gpus=1024 \
+    --set queue=$KUEUE_NAME \
+    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+    $USER-llama-3-1-70b-maxtext-fp8-128nodes-2048 \
+    $REPO_ROOT/src/helm-charts/a3ultra/maxtext-training
+```
+
 #### Configure job settings
 
 **Examples**
diff --git a/training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/values.yaml b/training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/values.yaml
@@ -28,6 +28,7 @@ workload:
   steps: 50
 
 network:
+  gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.0.3
   subnetworks[]:
   ncclSettings:
   - name: NCCL_DEBUG
diff --git a/training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md b/training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md
diff --git a/training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/values.yaml b/training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/values.yaml