Skip to content

Commit 78012f9

Browse files
author
Copybara
committed
Copybara import of gpu-recipes:
- 8451cae1df3a4f33deb76d5164ee65b81dbadb2e Merge "Adding Mixtral-8x7B Nemo pretraining recipe" into ... GitOrigin-RevId: 8451cae1df3a4f33deb76d5164ee65b81dbadb2e
1 parent d9e8da3 commit 78012f9

File tree

3 files changed

+30
-6
lines changed

3 files changed

+30
-6
lines changed

src/frameworks/a3ultra/nemo-configs/mixtral-8x7b-256gpus-a3u-bf16.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ trainer:
1313
max_steps: 30
1414
max_time: 05:23:30:00
1515
log_every_n_steps: 1
16-
val_check_interval: 50
17-
limit_val_batches: 32
18-
limit_test_batches: 50
16+
val_check_interval: 32
17+
limit_val_batches: 0
18+
limit_test_batches: 5
1919
accumulate_grad_batches: 1
2020
gradient_clip_val: 1.0
2121
enable_model_summary: false

training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ For this recipe, the following setup is used:
2020

2121
This recipe has been optimized for and tested with the following configuration:
2222

23-
- A cluster with 32 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines
23+
- A cluster with 32 or 64 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines
2424
- Machine placement in the cluster is configured using a [compact placement policy](https://cloud.google.com/kubernetes-engine/docs/how-to/compact-placement)
2525
- [NVIDIA NeMo NGC container image](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags): 24.07
2626
- BF16 precision training
@@ -33,7 +33,7 @@ This recipe has been optimized for and tested with the following configuration:
3333
Before running this recipe, ensure your environment is configured as follows:
3434

3535
- A GKE cluster with the following setup:
36-
- An A3 Ultra node pool (32 nodes, 256 GPUs)
36+
- An A3 Ultra node pool (32 nodes - 256 GPUs or 64 nodes - 512 GPUs)
3737
- Topology-aware scheduling enabled
3838
- A Google Cloud Storage (GCS) bucket to store results.
3939
*Important: This bucket must be in the same region as the GKE cluster*.
@@ -112,6 +112,8 @@ This image is based on NVIDIA NeMo 24.07 and contains the NCCL gIB plugin v1.0.3
112112

113113
### Configure and submit a pretraining job
114114

115+
#### Using 32 nodes (256 GPUs)
116+
115117
The default job setting is 30 training steps and bf16 precision. To execute the job with the
116118
default settings, run the following command from your client:
117119

@@ -127,6 +129,24 @@ helm install -f values.yaml \
127129
$REPO_ROOT/src/helm-charts/a3ultra/nemo-training
128130
```
129131

132+
#### Using 64 nodes (512 GPUs)
133+
134+
The default job setting is 30 training steps and bf16 precision. To execute the job with the
135+
default settings, run the following command from your client:
136+
137+
```bash
138+
cd $RECIPE_ROOT
139+
helm install -f values.yaml \
140+
--set-file nemo_config=$REPO_ROOT/src/frameworks/a3ultra/nemo-configs/mixtral-8x7b-256gpus-a3u-bf16.yaml \
141+
--set workload.image=us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-gib1.0.3-A3U \
142+
--set clusterName=$CLUSTER_NAME \
143+
--set queue=${KUEUE_NAME} \
144+
--set workload.gpus=512 \
145+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
146+
$USER-mixtral-8x7b-nemo-512 \
147+
$REPO_ROOT/src/helm-charts/a3ultra/nemo-training
148+
```
149+
130150
#### Configure job settings
131151

132152
You can overwrite any of the default
@@ -157,6 +177,8 @@ To check the status of pods in the indexed job, run the following command from y
157177

158178
```
159179
kubectl get pods | grep $USER-mixtral-8x7b-nemo
180+
kubectl get pods | grep $USER-mixtral-8x7b-nemo-512
181+
160182
```
161183
162184
To get the logs for one of the pods, run the following command from your client:
@@ -257,7 +279,7 @@ following steps command from your client:
257279
```
258280

259281
**Note:** The `batch_size`, `num_accelerators`, `precision`, `model_type` and `accelerator_type` are the
260-
specific values for this recipe running the default configuration. Average step time
282+
specific values for this recipe running the default configuration with 32 nodes. Average step time
261283
is computed by default using the steps 10 to 30.
262284

263285
For more detailed information and advanced usage instructions of this tool,
@@ -270,6 +292,7 @@ To uninstall Helm, run the following command from your client:
270292

271293
```bash
272294
helm uninstall $USER-mixtral-8x7b-nemo
295+
helm uninstall $USER-mixtral-8x7b-nemo-512
273296
```
274297

275298
### Running the recipe on a cluster that does not use the default configuration.

training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/values.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ workload:
3434
experiment_name: "megatron_gpt"
3535

3636
network:
37+
gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.0.3
3738
subnetworks[]:
3839
ncclSettings:
3940
- name: NCCL_DEBUG

0 commit comments

Comments
 (0)