You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md
+26-3Lines changed: 26 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ For this recipe, the following setup is used:
20
20
21
21
This recipe has been optimized for and tested with the following configuration:
22
22
23
-
- A cluster with 32 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines
23
+
- A cluster with 32 or 64 [a3-ultragpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) machines
24
24
- Machine placement in the cluster is configured using a [compact placement policy](https://cloud.google.com/kubernetes-engine/docs/how-to/compact-placement)
25
25
-[NVIDIA NeMo NGC container image](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags): 24.07
26
26
- BF16 precision training
@@ -33,7 +33,7 @@ This recipe has been optimized for and tested with the following configuration:
33
33
Before running this recipe, ensure your environment is configured as follows:
34
34
35
35
- A GKE cluster with the following setup:
36
-
- An A3 Ultra node pool (32 nodes, 256 GPUs)
36
+
- An A3 Ultra node pool (32 nodes - 256 GPUs or 64 nodes - 512 GPUs)
37
37
- Topology-aware scheduling enabled
38
38
- A Google Cloud Storage (GCS) bucket to store results.
39
39
*Important: This bucket must be in the same region as the GKE cluster*.
@@ -112,6 +112,8 @@ This image is based on NVIDIA NeMo 24.07 and contains the NCCL gIB plugin v1.0.3
112
112
113
113
### Configure and submit a pretraining job
114
114
115
+
#### Using 32 nodes (256 GPUs)
116
+
115
117
The default job setting is 30 training steps and bf16 precision. To execute the job with the
116
118
default settings, run the following command from your client:
0 commit comments