Skip to content

Commit abc0e44

Browse files
committed
feat: Update Llama 3.1 405B recipe for 64-GPU training
1 parent 44baf27 commit abc0e44

File tree

4 files changed

+15
-3
lines changed

4 files changed

+15
-3
lines changed

training/a4x/llama3-1-405b/README.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,13 @@ This recipe has been optimized for and tested with the following configuration:
2222
Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x)
2323
to create your a4x GKE cluster.
2424

25+
> [NOTE]
26+
> **GKE version and workload placement**
27+
>
28+
> For GKE cluster versions `1.34.0-gke.1502000` and later, workload placement is mandatory. You must provide your own placement policy name. You can do this by editing `values.yaml` to set `workload.nodeSelector.cloud.google.com/placement-policy-name`
29+
>
30+
> For GKE cluster versions before `1.34.0-gke.1502000`, you can remove the `nodeSelector` section in `values.yaml`.
31+
2532
## Training dataset
2633

2734
This recipe uses a mock pretraining dataset provided by the NeMo framework.
@@ -92,7 +99,7 @@ your client:
9299
export WORKLOAD_NAME=$USER-a4x-llama3-1-405b
93100
helm install $WORKLOAD_NAME . -f values.yaml \
94101
--set-file workload_launcher=launcher.sh \
95-
--set-file workload_config=llama3-1-405b-fp8cs-gbs2048.py \
102+
--set-file workload_config=llama3-1-405b-fp8cs-gbs2048-gpus64.py \
96103
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
97104
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
98105
--set volumes.gcsMounts[0].mountPath=/job-logs \
@@ -110,7 +117,7 @@ your client:
110117
export WORKLOAD_NAME=$USER-a4x-llama3-1-405b
111118
helm install $WORKLOAD_NAME . -f values.yaml \
112119
--set-file workload_launcher=launcher.sh \
113-
--set-file workload_config=llama3-1-405b-fp8cs-gbs2048.py \
120+
--set-file workload_config=llama3-1-405b-fp8cs-gbs2048-gpus64.py \
114121
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
115122
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
116123
--set volumes.gcsMounts[0].mountPath=/job-logs \

training/a4x/llama3-1-405b/llama3-1-405b-fp8cs-gbs2048.py renamed to training/a4x/llama3-1-405b/llama3-1-405b-fp8cs-gbs2048-gpus64.py

File renamed without changes.

training/a4x/llama3-1-405b/templates/workload-job.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,8 @@ spec:
9696
{{- end }}
9797
{{- end }}
9898
spec:
99+
nodeSelector:
100+
{{- toYaml .Values.workload.nodeSelector | nindent 14 }}
99101
{{- if $root.Values.network.hostNetwork }}
100102
hostNetwork: true
101103
dnsPolicy: ClusterFirstWithHostNet

training/a4x/llama3-1-405b/values.yaml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,9 @@ volumes:
3030
psVolumes: false
3131
ssdMountPath: "/ssd"
3232
workload:
33+
nodeSelector:
34+
cloud.google.com/gke-accelerator: nvidia-gb200
35+
cloud.google.com/placement-policy-name: a4x-workload-policy-95cbc61c
3336
arguments[]: null
3437
configFile: llama3-1-405b-fp8cs-gbs2048-gpus64.py
3538
configPath: /workload/configs/
@@ -41,7 +44,7 @@ workload:
4144
- --compute_dtype=fp8
4245
- --fp8_recipe=cs
4346
- --global_batch_size=2048
44-
- --max_steps=5
47+
- --max_steps=30
4548
- --micro_batch_size=1
4649
- --tensor_parallel_size=2
4750
- --context_parallel_size=1

0 commit comments

Comments
 (0)