Copybara import of gpu-recipes:

Copybara · Copybara · commit 78012f9332ad · 2025-03-04T23:41:57.000Z
- 8451cae1df3a4f33deb76d5164ee65b81dbadb2e Merge "Adding Mixtral-8x7B Nemo pretraining recipe" into ...

GitOrigin-RevId: 8451cae1df3a4f33deb76d5164ee65b81dbadb2e
diff --git a/src/frameworks/a3ultra/nemo-configs/mixtral-8x7b-256gpus-a3u-bf16.yaml b/src/frameworks/a3ultra/nemo-configs/mixtral-8x7b-256gpus-a3u-bf16.yaml
@@ -13,9 +13,9 @@ trainer:
   max_steps: 30
   max_time: 05:23:30:00
   log_every_n_steps: 1
-  val_check_interval: 50
-  limit_val_batches: 32
-  limit_test_batches: 50
+  val_check_interval: 32
+  limit_val_batches: 0
+  limit_test_batches: 5
   accumulate_grad_batches: 1
   gradient_clip_val: 1.0
   enable_model_summary: false
diff --git a/training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md b/training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md
@@ -20,7 +20,7 @@ For this recipe, the following setup is used:
 
 This recipe has been optimized for and tested with the following configuration:
 
-- A cluster with 32 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines
+- A cluster with 32 or 64 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines
 - Machine placement in the cluster is configured using a [compact placement policy](https://cloud.google.com/kubernetes-engine/docs/how-to/compact-placement)
 - [NVIDIA NeMo NGC container image](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags): 24.07
 - BF16 precision training
@@ -33,7 +33,7 @@ This recipe has been optimized for and tested with the following configuration:
 Before running this recipe, ensure your environment is configured as follows:
 
 - A GKE cluster with the following setup:
-    - An A3 Ultra node pool (32 nodes, 256 GPUs)
+    - An A3 Ultra node pool (32 nodes - 256 GPUs or 64 nodes - 512 GPUs)
     - Topology-aware scheduling enabled
 - A Google Cloud Storage (GCS) bucket to store results.
   *Important: This bucket must be in the same region as the GKE cluster*.
@@ -112,6 +112,8 @@ This image is based on NVIDIA NeMo 24.07 and contains the NCCL gIB plugin v1.0.3
 
 ### Configure and submit a pretraining job
 
+#### Using 32 nodes (256 GPUs)
+
 The default job setting is 30 training steps and bf16 precision. To execute the job with the
 default settings, run the following command from your client:
 
@@ -127,6 +129,24 @@ helm  install -f values.yaml \
     $REPO_ROOT/src/helm-charts/a3ultra/nemo-training
 ```
 
+#### Using 64 nodes (512 GPUs)
+
+The default job setting is 30 training steps and bf16 precision. To execute the job with the
+default settings, run the following command from your client:
+
+```bash
+cd $RECIPE_ROOT
+helm  install -f values.yaml \
+    --set-file nemo_config=$REPO_ROOT/src/frameworks/a3ultra/nemo-configs/mixtral-8x7b-256gpus-a3u-bf16.yaml \
+    --set workload.image=us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-gib1.0.3-A3U \
+    --set clusterName=$CLUSTER_NAME \
+    --set queue=${KUEUE_NAME} \
+    --set workload.gpus=512 \
+    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+    $USER-mixtral-8x7b-nemo-512 \
+    $REPO_ROOT/src/helm-charts/a3ultra/nemo-training
+```
+
 #### Configure job settings
 
 You can overwrite any of the default
@@ -157,6 +177,8 @@ To check the status of pods in the indexed job, run the following command from y
 
 ```
 kubectl get pods | grep $USER-mixtral-8x7b-nemo
+kubectl get pods | grep $USER-mixtral-8x7b-nemo-512
+
 ```
 
 To get the logs for one of the pods, run the following command from your client:
@@ -257,7 +279,7 @@ following steps command from your client:
    ```
 
 **Note:** The `batch_size`, `num_accelerators`, `precision`, `model_type` and `accelerator_type` are the
-specific values for this recipe running the default configuration. Average step time
+specific values for this recipe running the default configuration with 32 nodes. Average step time
 is computed by default using the steps 10 to 30.
 
 For more detailed information and advanced usage instructions of this tool,
@@ -270,6 +292,7 @@ To uninstall Helm, run the following command from your client:
 
 ```bash
 helm uninstall $USER-mixtral-8x7b-nemo
+helm uninstall $USER-mixtral-8x7b-nemo-512
 ```
 
 ### Running the recipe on a cluster that does not use the default configuration.
diff --git a/training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/values.yaml b/training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/values.yaml
@@ -34,6 +34,7 @@ workload:
   experiment_name: "megatron_gpt"
 
 network:
+  gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.0.3
   subnetworks[]:
   ncclSettings:
   - name: NCCL_DEBUG