|
| 1 | +<!-- mdformat global-off --> |
| 2 | +# Pretrain llama3-1-70b-gpus128 workloads on A4 GKE Node pools with Nvidia NeMo Framework using Google Cloud Storage for training data and checkpoints |
| 3 | + |
| 4 | +This recipe outlines the steps for running a llama3-1-70b-gpus128 pretraining |
| 5 | +workload on [A4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the |
| 6 | +[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo). |
| 7 | + |
| 8 | +## Orchestration and deployment tools |
| 9 | + |
| 10 | +For this recipe, the following setup is used: |
| 11 | + |
| 12 | +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) |
| 13 | +- Pretraining job configuration and deployment - A Helm chart is used to |
| 14 | + configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the |
| 15 | + [NeMo pretraining workload](https://github.com/NVIDIA/nemo). The chart generates the job's manifest, adhering to best practices for using GPUDirect-TCPXO with Google Kubernetes Engine (GKE), which includes setting optimal values for NVIDIA NCCL and the TCPXO NCCL plugin. |
| 16 | + |
| 17 | + |
| 18 | +## Test environment |
| 19 | + |
| 20 | +This recipe has been optimized for and tested with the following configuration: |
| 21 | + |
| 22 | +- A standard GKE cluster: |
| 23 | + - GKE version: 1.33.5-gke.1162000 or later |
| 24 | + - A GPU node pool with 16 [a4-highgpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-machine-type) machines |
| 25 | + - Workload Identity Federation for GKE enabled |
| 26 | + - Cloud Storage FUSE CSI driver enabled |
| 27 | + - DCGM metrics enabled |
| 28 | + - Kueue and JobSet APIs installed |
| 29 | + - Kueue configured to support Topology Aware Scheduling |
| 30 | +- A regional Google Cloud Storage (GCS) bucket to store logs. |
| 31 | +- A regional Google Cloud Storage (GCS) bucket with [hierarchical](https://cloud.google.com/storage/docs/hns-overview) namespace to store the Pile dataset |
| 32 | +- A regional Google Cloud Storage (GCS) bucket with [hierarchical](https://cloud.google.com/storage/docs/hns-overview) namespace to store checkpoints |
| 33 | +- A client workstation with the following pre-installed: |
| 34 | + - Google Cloud SDK |
| 35 | + - Helm |
| 36 | + - kubectl |
| 37 | + |
| 38 | +To prepare the required environment, see |
| 39 | +[GKE environment setup guide](../../../../docs/configuring-environment-gke-a4.md). |
| 40 | + |
| 41 | +**Important:** All GCS buckets must be in the same region as the GKE cluster. |
| 42 | + |
| 43 | +## Training dataset |
| 44 | + |
| 45 | +The recipe uses the [Pile dataset](https://pile.eleuther.ai/) converted to NeMo memory map (mmap) format. |
| 46 | + |
| 47 | +## Docker container image |
| 48 | + |
| 49 | +This recipe uses the following docker images: |
| 50 | + |
| 51 | +- `nvcr.io/nvidia/nemo:25.07` |
| 52 | +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.0` |
| 53 | + |
| 54 | +## Run the recipe |
| 55 | + |
| 56 | +From your client workstation, complete the following steps: |
| 57 | + |
| 58 | +### Configure environment settings |
| 59 | + |
| 60 | +Set the environment variables to match your environment: |
| 61 | + |
| 62 | + ```bash |
| 63 | + export PROJECT_ID=<PROJECT_ID> |
| 64 | + export CLUSTER_REGION=<CLUSTER_REGION> |
| 65 | + export CLUSTER_NAME=<CLUSTER_NAME> |
| 66 | + export GCS_BUCKET_LOGS=<GCS_BUCKET_LOGS> |
| 67 | + export GCS_BUCKET_DATA=<GCS_BUCKET_DATA> |
| 68 | + export GCS_BUCKET_CHECKPOINTS=<GCS_BUCKET_CHECKPOINTS> |
| 69 | + export ENABLE_DATALOADING=<ENABLE_DATALOADING> |
| 70 | + export ENABLE_CHECKPOINT_WRITE=<ENABLE_CHECKPOINT_WRITE> |
| 71 | + export CHECKPOINT_WRITE_INTERVAL=<CHECKPOINT_WRITE_INTERVAL> |
| 72 | + export ENABLE_CHECKPOINT_LOAD=<ENABLE_CHECKPOINT_LOAD> |
| 73 | + export RESTORE_PATH=<RESTORE_PATH> |
| 74 | + export TOKEN_PATH=<TOKEN_PATH> |
| 75 | + export DATASET_PATH=<DATASET_PATH> |
| 76 | + ``` |
| 77 | + |
| 78 | +Replace the following values: |
| 79 | + |
| 80 | + - `<PROJECT_ID>`: your Google Cloud project ID |
| 81 | + - `<CLUSTER_REGION>`: the region where your cluster is located |
| 82 | + - `<CLUSTER_NAME>`: the name of your GKE cluster |
| 83 | + - `<GCS_BUCKET_LOGS>`: the name of a Cloud Storage bucket for logs. Do not include the `gs://` prefix |
| 84 | + - `<GCS_BUCKET_DATA>`: the name of a Cloud Storage bucket for training data. Do not include the `gs://` prefix |
| 85 | + - `<GCS_BUCKET_CHECKPOINTS>`: the name of a Cloud Storage bucket for checkpoints. Do not include the `gs://` prefix |
| 86 | + - `<ENABLE_DATALOADING>`: The recipe has an option to use a real dataset for dataloading. Default is false. |
| 87 | + - `<ENABLE_CHECKPOINT_WRITE>`: To enable checkpoint write. Default is false. |
| 88 | + - `<CHECKPOINT_WRITE_INTERVAL>`: Step interval at which checkpoints will be written. |
| 89 | + - `<ENABLE_CHECKPOINT_LOAD>`: To enable checkpoint restore. Default is false. |
| 90 | + - `<RESTORE_PATH>`: Path to a specific checkpoint to restore from. The mount point of checkpoint_bucket is `/checkpoints` and hence the path should start with `/checkpoints`. |
| 91 | + - `<TOKEN_PATH>`: SentencePiece tokenizer model file. |
| 92 | + - `<DATASET_PATH>`: Path in dataset_bucket for dataloading. The path should contain only the dataloading objects. The mount point of dataset_bucket is `/data` and hence the path should start with `/data`. |
| 93 | + |
| 94 | +Set the default project: |
| 95 | + |
| 96 | + ```bash |
| 97 | + gcloud config set project $PROJECT_ID |
| 98 | + ``` |
| 99 | +### Upload the training dataset |
| 100 | + |
| 101 | +The Pile dataset in the NVIDIA NeMo *mmap* format is staged in the public GCS bucket. You need to upload the dataset to your GCS bucket with hierarchical namespace enabled. |
| 102 | + |
| 103 | +1. Create a folder for the dataset |
| 104 | + |
| 105 | +``` |
| 106 | +gcloud storage folders create gs://${GCS_BUCKET_DATA}/data |
| 107 | +``` |
| 108 | + |
| 109 | +2. Upload the dataset |
| 110 | + |
| 111 | +``` |
| 112 | +gcloud storage cp gs://cloud-samples-data/third-party/pile/*.* gs://${GCS_BUCKET_DATA}/data |
| 113 | +``` |
| 114 | + |
| 115 | +### Get the recipe |
| 116 | + |
| 117 | +Clone the `gpu-recipes` repository and set a reference to the recipe folder. |
| 118 | + |
| 119 | +``` |
| 120 | +git clone https://github.com/ai-hypercomputer/gpu-recipes.git |
| 121 | +cd gpu-recipes |
| 122 | +export REPO_ROOT=`git rev-parse --show-toplevel` |
| 123 | +export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b/nemo-pretraining-gke/16node-bf16-seq8192-gbs512-gcs |
| 124 | +cd $RECIPE_ROOT |
| 125 | +``` |
| 126 | + |
| 127 | +### Get cluster credentials |
| 128 | + |
| 129 | +``` |
| 130 | +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION |
| 131 | +``` |
| 132 | + |
| 133 | +### Create Persistent Volumes and Persistent Volume Claims |
| 134 | + |
| 135 | +The pretraining job accesses GCS buckets for training data and checkpoints through [the Cloud Storage FUSE CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver) configured using Kubernetes Persistent Volumes (PV) and Persistent Volume Claims (PVC). You must generate PVs and PVCs for both data and checkpoint buckets using the [gcs-fuse helper Helm chart](../../../../src/helm-charts/storage/gcs-fuse). The chart configures the FUSE driver settings following the best practices for optimizing access to buckets for training data and checkpoints. |
| 136 | + |
| 137 | +``` |
| 138 | +helm install -f $REPO_ROOT/src/helm-charts/storage/gcs-fuse/values.yaml \ |
| 139 | +--set gcsVolumes[0].bucketName=${GCS_BUCKET_DATA} \ |
| 140 | +--set gcsVolumes[1].bucketName=${GCS_BUCKET_CHECKPOINTS} \ |
| 141 | +$USER-gcs-pv-pvc \ |
| 142 | +$REPO_ROOT/src/helm-charts/storage/gcs-fuse |
| 143 | +``` |
| 144 | + |
| 145 | +### Configure and submit a pretraining job |
| 146 | + |
| 147 | +#### Using 16 node (128 gpus) bf16-mixed precision |
| 148 | +To execute the job with the default settings, run the following command from |
| 149 | +your client: |
| 150 | + |
| 151 | + ```bash |
| 152 | + cd $RECIPE_ROOT |
| 153 | + export WORKLOAD_NAME=a4-llama3-1-70b-gpus128 |
| 154 | + helm install $WORKLOAD_NAME . -f values.yaml \ |
| 155 | + --set workload_launcher=launcher.sh \ |
| 156 | + --set workload_config=llama3-1-70b.py \ |
| 157 | + --set workload.image=nvcr.io/nvidia/nemo:25.07 \ |
| 158 | + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET_LOGS} \ |
| 159 | + --set volumes.gcsMounts[0].mountPath=/job-logs |
| 160 | + ``` |
| 161 | + |
| 162 | +**Examples** |
| 163 | + |
| 164 | +- To set the number of training steps to 100, run the following command from |
| 165 | + your client: |
| 166 | + |
| 167 | + ```bash |
| 168 | + cd $RECIPE_ROOT |
| 169 | + export WORKLOAD_NAME=a4-llama3-1-70b-gpus128 |
| 170 | + helm install $WORKLOAD_NAME . -f values.yaml \ |
| 171 | + --set workload_launcher=launcher.sh \ |
| 172 | + --set workload_config=llama3-1-70b.py \ |
| 173 | + --set workload.image=nvcr.io/nvidia/nemo:25.07 \ |
| 174 | + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET_LOGS} \ |
| 175 | + --set volumes.gcsMounts[0].mountPath=/job-logs \ |
| 176 | + --set workload.step_count=100 |
| 177 | + ``` |
| 178 | +- To enable dataloading, checkpoint restore and checkpoint write at every 25 steps, run the following command from your client: |
| 179 | + |
| 180 | + ```bash |
| 181 | + cd $RECIPE_ROOT |
| 182 | + export WORKLOAD_NAME=a4-llama3-1-70b-gpus128 |
| 183 | + helm install $WORKLOAD_NAME . -f values.yaml \ |
| 184 | + --set workload_launcher=launcher.sh \ |
| 185 | + --set workload_config=llama3-1-70b.py \ |
| 186 | + --set workload.image=nvcr.io/nvidia/nemo:25.07 \ |
| 187 | + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET_LOGS} \ |
| 188 | + --set volumes.gcsMounts[0].mountPath=/job-logs \ |
| 189 | + --set workload.enable_dataloading=$ENABLE_DATALOADING \ |
| 190 | + --set workload.enable_ckpt_write=$ENABLE_CHECKPOINT_WRITE \ |
| 191 | + --set workload.enable_ckpt_load=$ENABLE_CHECKPOINT_LOAD \ |
| 192 | + --set workload.ckpt_write_interval=$CHECKPOINT_WRITE_INTERVAL \ |
| 193 | + --set workload.token_path=$TOKEN_PATH \ |
| 194 | + --set workload.dataset_path=$DATASET_PATH \ |
| 195 | + --set workload.restore_path=$RESTORE_PATH |
| 196 | + ``` |
| 197 | + |
| 198 | +### Monitor the job |
| 199 | + |
| 200 | +To check the status of pods in your job, run the following command: |
| 201 | + |
| 202 | +``` |
| 203 | +kubectl get pods | grep a4-llama3-1-70b-gpus128 |
| 204 | +``` |
| 205 | +
|
| 206 | +Replace the following: |
| 207 | +
|
| 208 | +- JOB_NAME_PREFIX - your job name prefix. For example a4-llama3-1-70b-gpus128. |
| 209 | +
|
| 210 | +To get the logs for one of the pods, run the following command: |
| 211 | +
|
| 212 | +``` |
| 213 | +kubectl logs POD_NAME |
| 214 | +``` |
| 215 | +
|
| 216 | +Information about the training job's progress, including crucial details such as |
| 217 | +loss, step count, and step time, is generated by the rank 0 process. |
| 218 | +This process runs on the pod whose name begins with |
| 219 | +`JOB_NAME_PREFIX-workload-0-0`. |
| 220 | +For example: `a4-llama3-1-70b-gpus128-workload-0-0-s9zrv`. |
| 221 | +
|
| 222 | +### Analyze results |
| 223 | +
|
| 224 | +When completed, the job creates several artifacts, including logs and traces, and places them |
| 225 | +in the Google Cloud Storage logs bucket as follows: |
| 226 | +
|
| 227 | +``` |
| 228 | +gs://${GCS_BUCKET_LOGS}/nemo-experiments-storage/<JOB_ID> |
| 229 | +├── nemo-configuration.yaml |
| 230 | +├── lightning_logs.txt |
| 231 | +├── nemo_error_logs.txt |
| 232 | +├── nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt |
| 233 | +├── dllogger |
| 234 | +│ ├── rank-0 |
| 235 | +│ │ ├── dllogger.json |
| 236 | +... |
| 237 | +``` |
| 238 | +
|
| 239 | +- `nemo-configuration.yaml`: the NeMo configuration used by the pretraining script. This includes |
| 240 | + the combined [configuration file](../16node-bf16-seq8192-gbs512/llama3-1-70b.py) |
| 241 | + and the command line overrides |
| 242 | +- `lightning_logs.txt`: the log files generated by PyTorch Lightning, which is used by NeMo |
| 243 | +- `nemo_error_logs.txt`: the warning and error logs generated by NeMo |
| 244 | +- `nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt`: the NeMo logs for each rank |
| 245 | +- `dllogger/`: The log captured by [NVIDIA DLLogger](https://github.com/NVIDIA/dllogger): |
| 246 | + DLLogger is configured to store logs on the rank 0 node. The log is in JSON format |
| 247 | + and includes loss, step_time, and other key metrics for each training step |
| 248 | +
|
| 249 | +
|
| 250 | +The NeMo log files include information about checkpoint operations on each rank. Users can find checkpoint read and write information in `nemo_log_globalrank-0_localrank-0.txt` files. |
| 251 | +
|
| 252 | +### Uninstall the Helm release |
| 253 | +
|
| 254 | +You can delete the job and other resources created by the Helm chart. To |
| 255 | +uninstall Helm, run the following command from your client: |
| 256 | +
|
| 257 | +```bash |
| 258 | +helm uninstall $WORKLOAD_NAME |
| 259 | +``` |
| 260 | + |
| 261 | +### Uninstall PVCs and PVs |
| 262 | + |
| 263 | +To uninstall Persistent Volume and Persistent Volume Claim resources for GCSFuse execute the following command: |
| 264 | + |
| 265 | +``` |
| 266 | +helm uninstall $USER-gcs-pv-pvc |
| 267 | +``` |
0 commit comments