Skip to content

Commit ce84f75

Browse files
authored
A4 Llama 3.1 70B recipe on NeMo 2.0 with GCSFuse storage (#37)
* A4 Llama 3.1 70B recipe on NeMo 2.0 with GCSFuse storage
1 parent 0017ec3 commit ce84f75

File tree

10 files changed

+1058
-10
lines changed

10 files changed

+1058
-10
lines changed

src/helm-charts/storage/gcs-fuse/templates/pv.yaml

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -29,13 +29,11 @@ spec:
2929
namespace: {{ default "default" $gcs.namespace }}
3030
{{- if eq $gcs.type "data" }}
3131
mountOptions:
32-
- metadata-cache:ttl-secs:-1
32+
- implicit-dirs # Create implicit directories locally when accessed
33+
- metadata-cache:negative-ttl-secs:0 # Disable caching for lookups of files/dirs that don't exist
34+
- metadata-cache:ttl-secs:-1 # Keep cached metadata (file attributes, types) indefinitely time-wise
3335
- metadata-cache:stat-cache-max-size-mb:-1
3436
- metadata-cache:type-cache-max-size-mb:-1
35-
- file-system:kernel-list-cache-ttl-secs:-1
36-
- file-cache:max-size-mb:-1
37-
- file-cache:enable-parallel-downloads:true
38-
- write:enable-streaming-writes:true
3937
{{- if $gcs.dirPath }}
4038
- only-dir:{{ $gcs.dirPath }}
4139
{{- end }}
@@ -47,14 +45,22 @@ spec:
4745
gcsfuseMetadataPrefetchOnMount: "true"
4846
{{- else if eq $gcs.type "checkpoints" }}
4947
mountOptions:
50-
- metadata-cache:ttl-secs:-1
51-
- metadata-cache:negative-ttl-secs:0
48+
- implicit-dirs
49+
- metadata-cache:negative-ttl-secs:0
50+
- metadata-cache:ttl-secs:-1
5251
- metadata-cache:stat-cache-max-size-mb:-1
5352
- metadata-cache:type-cache-max-size-mb:-1
54-
- file-cache:max-size-mb:-1
55-
- file-cache:enable-parallel-downloads:true
56-
- file-system:kernel-list-cache-ttl-secs:0
5753
- write:enable-streaming-writes:true
54+
# This workaround is for gcsfuse v3.5.0 and below versions.
55+
# Earlier GCSFuse versions do not recognize a4-highgpu-8g as a high-performance machine.
56+
# Setting machine-type to a3-highgpu-8g increases the global-max-block, which is
57+
# crucial for streaming writes. Without a higher global-max-block, streaming writes
58+
# would fall back to staged writes, resulting in slower write performance.
59+
# For gcsfuse v3.5.0 and later, `machine-type:a3-highgpu-8g` can be commented out.
60+
- machine-type:a3-highgpu-8g
61+
- file-cache:max-size-mb:-1
62+
- file-cache:cache-file-for-range-read:true
63+
- file-cache:enable-parallel-downloads:true
5864
{{- if $gcs.dirPath }}
5965
- only-dir:{{ $gcs.dirPath }}
6066
{{- end }}
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: v2
16+
name: a4-jobset-workload
17+
description: a4-jobset-workload
18+
type: application
19+
version: 0.1.0
20+
appVersion: "1.16.0"
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
<!-- mdformat global-off -->
2+
# Pretrain llama3-1-70b-gpus128 workloads on A4 GKE Node pools with Nvidia NeMo Framework using Google Cloud Storage for training data and checkpoints
3+
4+
This recipe outlines the steps for running a llama3-1-70b-gpus128 pretraining
5+
workload on [A4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
6+
[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).
7+
8+
## Orchestration and deployment tools
9+
10+
For this recipe, the following setup is used:
11+
12+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
13+
- Pretraining job configuration and deployment - A Helm chart is used to
14+
configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the
15+
[NeMo pretraining workload](https://github.com/NVIDIA/nemo). The chart generates the job's manifest, adhering to best practices for using GPUDirect-TCPXO with Google Kubernetes Engine (GKE), which includes setting optimal values for NVIDIA NCCL and the TCPXO NCCL plugin.
16+
17+
18+
## Test environment
19+
20+
This recipe has been optimized for and tested with the following configuration:
21+
22+
- A standard GKE cluster:
23+
- GKE version: 1.33.5-gke.1162000 or later
24+
- A GPU node pool with 16 [a4-highgpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-machine-type) machines
25+
- Workload Identity Federation for GKE enabled
26+
- Cloud Storage FUSE CSI driver enabled
27+
- DCGM metrics enabled
28+
- Kueue and JobSet APIs installed
29+
- Kueue configured to support Topology Aware Scheduling
30+
- A regional Google Cloud Storage (GCS) bucket to store logs.
31+
- A regional Google Cloud Storage (GCS) bucket with [hierarchical](https://cloud.google.com/storage/docs/hns-overview) namespace to store the Pile dataset
32+
- A regional Google Cloud Storage (GCS) bucket with [hierarchical](https://cloud.google.com/storage/docs/hns-overview) namespace to store checkpoints
33+
- A client workstation with the following pre-installed:
34+
- Google Cloud SDK
35+
- Helm
36+
- kubectl
37+
38+
To prepare the required environment, see
39+
[GKE environment setup guide](../../../../docs/configuring-environment-gke-a4.md).
40+
41+
**Important:** All GCS buckets must be in the same region as the GKE cluster.
42+
43+
## Training dataset
44+
45+
The recipe uses the [Pile dataset](https://pile.eleuther.ai/) converted to NeMo memory map (mmap) format.
46+
47+
## Docker container image
48+
49+
This recipe uses the following docker images:
50+
51+
- `nvcr.io/nvidia/nemo:25.07`
52+
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.0`
53+
54+
## Run the recipe
55+
56+
From your client workstation, complete the following steps:
57+
58+
### Configure environment settings
59+
60+
Set the environment variables to match your environment:
61+
62+
```bash
63+
export PROJECT_ID=<PROJECT_ID>
64+
export CLUSTER_REGION=<CLUSTER_REGION>
65+
export CLUSTER_NAME=<CLUSTER_NAME>
66+
export GCS_BUCKET_LOGS=<GCS_BUCKET_LOGS>
67+
export GCS_BUCKET_DATA=<GCS_BUCKET_DATA>
68+
export GCS_BUCKET_CHECKPOINTS=<GCS_BUCKET_CHECKPOINTS>
69+
export ENABLE_DATALOADING=<ENABLE_DATALOADING>
70+
export ENABLE_CHECKPOINT_WRITE=<ENABLE_CHECKPOINT_WRITE>
71+
export CHECKPOINT_WRITE_INTERVAL=<CHECKPOINT_WRITE_INTERVAL>
72+
export ENABLE_CHECKPOINT_LOAD=<ENABLE_CHECKPOINT_LOAD>
73+
export RESTORE_PATH=<RESTORE_PATH>
74+
export TOKEN_PATH=<TOKEN_PATH>
75+
export DATASET_PATH=<DATASET_PATH>
76+
```
77+
78+
Replace the following values:
79+
80+
- `<PROJECT_ID>`: your Google Cloud project ID
81+
- `<CLUSTER_REGION>`: the region where your cluster is located
82+
- `<CLUSTER_NAME>`: the name of your GKE cluster
83+
- `<GCS_BUCKET_LOGS>`: the name of a Cloud Storage bucket for logs. Do not include the `gs://` prefix
84+
- `<GCS_BUCKET_DATA>`: the name of a Cloud Storage bucket for training data. Do not include the `gs://` prefix
85+
- `<GCS_BUCKET_CHECKPOINTS>`: the name of a Cloud Storage bucket for checkpoints. Do not include the `gs://` prefix
86+
- `<ENABLE_DATALOADING>`: The recipe has an option to use a real dataset for dataloading. Default is false.
87+
- `<ENABLE_CHECKPOINT_WRITE>`: To enable checkpoint write. Default is false.
88+
- `<CHECKPOINT_WRITE_INTERVAL>`: Step interval at which checkpoints will be written.
89+
- `<ENABLE_CHECKPOINT_LOAD>`: To enable checkpoint restore. Default is false.
90+
- `<RESTORE_PATH>`: Path to a specific checkpoint to restore from. The mount point of checkpoint_bucket is `/checkpoints` and hence the path should start with `/checkpoints`.
91+
- `<TOKEN_PATH>`: SentencePiece tokenizer model file.
92+
- `<DATASET_PATH>`: Path in dataset_bucket for dataloading. The path should contain only the dataloading objects. The mount point of dataset_bucket is `/data` and hence the path should start with `/data`.
93+
94+
Set the default project:
95+
96+
```bash
97+
gcloud config set project $PROJECT_ID
98+
```
99+
### Upload the training dataset
100+
101+
The Pile dataset in the NVIDIA NeMo *mmap* format is staged in the public GCS bucket. You need to upload the dataset to your GCS bucket with hierarchical namespace enabled.
102+
103+
1. Create a folder for the dataset
104+
105+
```
106+
gcloud storage folders create gs://${GCS_BUCKET_DATA}/data
107+
```
108+
109+
2. Upload the dataset
110+
111+
```
112+
gcloud storage cp gs://cloud-samples-data/third-party/pile/*.* gs://${GCS_BUCKET_DATA}/data
113+
```
114+
115+
### Get the recipe
116+
117+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
118+
119+
```
120+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
121+
cd gpu-recipes
122+
export REPO_ROOT=`git rev-parse --show-toplevel`
123+
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b/nemo-pretraining-gke/16node-bf16-seq8192-gbs512-gcs
124+
cd $RECIPE_ROOT
125+
```
126+
127+
### Get cluster credentials
128+
129+
```
130+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
131+
```
132+
133+
### Create Persistent Volumes and Persistent Volume Claims
134+
135+
The pretraining job accesses GCS buckets for training data and checkpoints through [the Cloud Storage FUSE CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver) configured using Kubernetes Persistent Volumes (PV) and Persistent Volume Claims (PVC). You must generate PVs and PVCs for both data and checkpoint buckets using the [gcs-fuse helper Helm chart](../../../../src/helm-charts/storage/gcs-fuse). The chart configures the FUSE driver settings following the best practices for optimizing access to buckets for training data and checkpoints.
136+
137+
```
138+
helm install -f $REPO_ROOT/src/helm-charts/storage/gcs-fuse/values.yaml \
139+
--set gcsVolumes[0].bucketName=${GCS_BUCKET_DATA} \
140+
--set gcsVolumes[1].bucketName=${GCS_BUCKET_CHECKPOINTS} \
141+
$USER-gcs-pv-pvc \
142+
$REPO_ROOT/src/helm-charts/storage/gcs-fuse
143+
```
144+
145+
### Configure and submit a pretraining job
146+
147+
#### Using 16 node (128 gpus) bf16-mixed precision
148+
To execute the job with the default settings, run the following command from
149+
your client:
150+
151+
```bash
152+
cd $RECIPE_ROOT
153+
export WORKLOAD_NAME=a4-llama3-1-70b-gpus128
154+
helm install $WORKLOAD_NAME . -f values.yaml \
155+
--set workload_launcher=launcher.sh \
156+
--set workload_config=llama3-1-70b.py \
157+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
158+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET_LOGS} \
159+
--set volumes.gcsMounts[0].mountPath=/job-logs
160+
```
161+
162+
**Examples**
163+
164+
- To set the number of training steps to 100, run the following command from
165+
your client:
166+
167+
```bash
168+
cd $RECIPE_ROOT
169+
export WORKLOAD_NAME=a4-llama3-1-70b-gpus128
170+
helm install $WORKLOAD_NAME . -f values.yaml \
171+
--set workload_launcher=launcher.sh \
172+
--set workload_config=llama3-1-70b.py \
173+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
174+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET_LOGS} \
175+
--set volumes.gcsMounts[0].mountPath=/job-logs \
176+
--set workload.step_count=100
177+
```
178+
- To enable dataloading, checkpoint restore and checkpoint write at every 25 steps, run the following command from your client:
179+
180+
```bash
181+
cd $RECIPE_ROOT
182+
export WORKLOAD_NAME=a4-llama3-1-70b-gpus128
183+
helm install $WORKLOAD_NAME . -f values.yaml \
184+
--set workload_launcher=launcher.sh \
185+
--set workload_config=llama3-1-70b.py \
186+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
187+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET_LOGS} \
188+
--set volumes.gcsMounts[0].mountPath=/job-logs \
189+
--set workload.enable_dataloading=$ENABLE_DATALOADING \
190+
--set workload.enable_ckpt_write=$ENABLE_CHECKPOINT_WRITE \
191+
--set workload.enable_ckpt_load=$ENABLE_CHECKPOINT_LOAD \
192+
--set workload.ckpt_write_interval=$CHECKPOINT_WRITE_INTERVAL \
193+
--set workload.token_path=$TOKEN_PATH \
194+
--set workload.dataset_path=$DATASET_PATH \
195+
--set workload.restore_path=$RESTORE_PATH
196+
```
197+
198+
### Monitor the job
199+
200+
To check the status of pods in your job, run the following command:
201+
202+
```
203+
kubectl get pods | grep a4-llama3-1-70b-gpus128
204+
```
205+
206+
Replace the following:
207+
208+
- JOB_NAME_PREFIX - your job name prefix. For example a4-llama3-1-70b-gpus128.
209+
210+
To get the logs for one of the pods, run the following command:
211+
212+
```
213+
kubectl logs POD_NAME
214+
```
215+
216+
Information about the training job's progress, including crucial details such as
217+
loss, step count, and step time, is generated by the rank 0 process.
218+
This process runs on the pod whose name begins with
219+
`JOB_NAME_PREFIX-workload-0-0`.
220+
For example: `a4-llama3-1-70b-gpus128-workload-0-0-s9zrv`.
221+
222+
### Analyze results
223+
224+
When completed, the job creates several artifacts, including logs and traces, and places them
225+
in the Google Cloud Storage logs bucket as follows:
226+
227+
```
228+
gs://${GCS_BUCKET_LOGS}/nemo-experiments-storage/<JOB_ID>
229+
├── nemo-configuration.yaml
230+
├── lightning_logs.txt
231+
├── nemo_error_logs.txt
232+
├── nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt
233+
├── dllogger
234+
│ ├── rank-0
235+
│ │ ├── dllogger.json
236+
...
237+
```
238+
239+
- `nemo-configuration.yaml`: the NeMo configuration used by the pretraining script. This includes
240+
the combined [configuration file](../16node-bf16-seq8192-gbs512/llama3-1-70b.py)
241+
and the command line overrides
242+
- `lightning_logs.txt`: the log files generated by PyTorch Lightning, which is used by NeMo
243+
- `nemo_error_logs.txt`: the warning and error logs generated by NeMo
244+
- `nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt`: the NeMo logs for each rank
245+
- `dllogger/`: The log captured by [NVIDIA DLLogger](https://github.com/NVIDIA/dllogger):
246+
DLLogger is configured to store logs on the rank 0 node. The log is in JSON format
247+
and includes loss, step_time, and other key metrics for each training step
248+
249+
250+
The NeMo log files include information about checkpoint operations on each rank. Users can find checkpoint read and write information in `nemo_log_globalrank-0_localrank-0.txt` files.
251+
252+
### Uninstall the Helm release
253+
254+
You can delete the job and other resources created by the Helm chart. To
255+
uninstall Helm, run the following command from your client:
256+
257+
```bash
258+
helm uninstall $WORKLOAD_NAME
259+
```
260+
261+
### Uninstall PVCs and PVs
262+
263+
To uninstall Persistent Volume and Persistent Volume Claim resources for GCSFuse execute the following command:
264+
265+
```
266+
helm uninstall $USER-gcs-pv-pvc
267+
```

0 commit comments

Comments
 (0)