Copybara import of gpu-recipes:

Copybara · Copybara · commit dc6ef1afc149 · 2024-12-23T23:49:16.000Z
- 817afcc887ae460f2b57a74ec99cc48e74c3407b update the scrip for building nemo aotc
  - cf7622782b3381482bb4da1d052f462eca6b57b0 Merge "add mixtral-8x-7b maxtext a3u recipe" into main

GitOrigin-RevId: cf7622782b3381482bb4da1d052f462eca6b57b0
diff --git a/README.md b/README.md
@@ -30,7 +30,8 @@ Welcome to the reproducible benchmark recipes repository for GPUs! This reposito
 | Models           | GPU Machine Type | Framework | Workload Type       | Orchestrator | Link to the recipe |
 | ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
 | **Llama-3.1-70B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms)    | MaxText  | Pre-training   | GKE          | [Link](./training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md)
-| **Llama-3.1-70B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms)    | NeMo  | Pre-training   | GKE          | [Link](./training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md)   
+| **Llama-3.1-70B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms)    | NeMo  | Pre-training   | GKE          | [Link](./training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md)
+| **Mixtral-8-7B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms)    | MaxText  | Pre-training   | GKE          | [Link](./training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md)           
 | **Mixtral-8-7B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms)    | NeMo  | Pre-training   | GKE          | [Link](./training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md)            |
 
 
diff --git a/src/docker/nemo-aotc-24.07/README.md b/src/docker/nemo-aotc-24.07/README.md
@@ -0,0 +1,4 @@
+# Nemo 24.07 AotC Image
+
+This Dockerfile builds a container image designed for NVIDIA NeMo training workloads. It includes the AotC library,
+which contains Google-optimized implementations of NeMo-based workflows.
diff --git a/src/docker/nemo-aotc-24.07/cloudbuild.yaml b/src/docker/nemo-aotc-24.07/cloudbuild.yaml
@@ -0,0 +1,25 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+steps:
+  - name: 'gcr.io/cloud-builders/docker'
+    args:
+      - 'build'
+      - '--tag=${_ARTIFACT_REGISTRY}/nemo_workload:24.07'
+      - '--file=nemo.Dockerfile'
+      - '.'
+    automapSubstitutions: true
+
+images:
+  - '${_ARTIFACT_REGISTRY}/nemo_workload:24.07'
diff --git a/src/docker/nemo-aotc-24.07/nemo.Dockerfile b/src/docker/nemo-aotc-24.07/nemo.Dockerfile
@@ -0,0 +1,58 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Base Image
+FROM nvcr.io/nvidia/nemo:24.07
+
+# Set the working directory
+WORKDIR /workspace
+COPY requirements.txt /workspace/requirements.txt
+
+# GCSfuse components (used to provide shared storage, not intended for high performance)
+RUN apt-get update && apt-get install --yes --no-install-recommends \
+    ca-certificates \
+    curl \
+    gnupg \
+  && echo "deb https://packages.cloud.google.com/apt gcsfuse-buster main" \
+    | tee /etc/apt/sources.list.d/gcsfuse.list \
+  && echo "deb https://packages.cloud.google.com/apt cloud-sdk main" \
+    | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list \
+  && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
+  && apt-get update \
+  && apt-get install --yes gcsfuse \
+  && apt-get install --yes google-cloud-cli \
+  && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
+  && mkdir /gcs
+
+RUN pip install --require-hashes -r requirements.txt
+
+# install kubectl
+RUN curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
+RUN chmod +x ./kubectl
+RUN mv ./kubectl /usr/local/bin
+
+# Clone the AotC repository
+RUN git clone https://github.com/AI-Hypercomputer/aotc.git
+WORKDIR /workspace/aotc
+
+# Build the wheel
+RUN pip install build setuptools
+RUN python3 -m pip wheel . --no-deps -w dist/
+RUN pip install dist/*.whl
+
+# Add the build timestamp as a label
+ARG BUILD_TIMESTAMP
+LABEL build_timestamp=$BUILD_TIMESTAMP
+
+ENTRYPOINT []
diff --git a/src/docker/nemo-aotc-24.07/requirements.in b/src/docker/nemo-aotc-24.07/requirements.in
@@ -0,0 +1 @@
+https://github.com/NVIDIA/dllogger/archive/refs/tags/v1.0.0.zip
diff --git a/src/docker/nemo-aotc-24.07/requirements.txt b/src/docker/nemo-aotc-24.07/requirements.txt
@@ -0,0 +1,8 @@
+# This file is autogenerated by pip-compile with Python 3.11
+# by the following command:
+#
+#    pip-compile --generate-hashes requirements.in
+#
+dllogger @ https://github.com/NVIDIA/dllogger/archive/refs/tags/v1.0.0.zip \
+    --hash=sha256:07d0cd9b9b56f454f0c186a0889137e9f94e1979fca3d35911967c874c93c191
+    # via -r requirements.in
diff --git a/src/frameworks/a3ultra/maxtext-configs/mixtral-8x7b-256gpus-a3u-bf16.yaml b/src/frameworks/a3ultra/maxtext-configs/mixtral-8x7b-256gpus-a3u-bf16.yaml
@@ -0,0 +1,47 @@
+base_emb_dim: 4096
+base_num_query_heads: 32
+base_num_kv_heads: 8
+base_mlp_dim: 14336
+base_num_decoder_layers: 32
+head_dim: 128
+vocab_size: 32000
+enable_dropout: false
+logits_via_embedding: false
+normalization_layer_epsilon: 0.00001
+num_experts: 8
+num_experts_per_tok: 2
+rope_max_timescale: 1000000
+decoder_block: mistral
+attention: cudnn_flash_te
+dataset_type: synthetic
+tokenizer_path: "assets/tokenizer.mistral-v1"
+max_target_length: 4096
+use_iota_embed: true
+reuse_example_batch: 1
+enable_checkpointing: false
+megablox: false
+hardware: gpu
+scan_layers: false
+per_device_batch_size: 5
+remat_policy: custom
+logits_dot_in_fp32: false
+enable_goodput_recording: false
+monitor_goodput: false
+query_proj: device
+key_proj: device
+value_proj: device
+out_proj: device
+mlpwi_0: device
+mlpwi_1: device
+mlpwo: device
+dcn_fsdp_parallelism: 2
+dcn_data_parallelism: 16
+dcn_tensor_parallelism: 1
+dcn_pipeline_parallelism: 1
+ici_fsdp_parallelism: -1
+ici_expert_parallelism: 8
+ici_tensor_parallelism: 1
+ici_data_parallelism: 1
+capacity_factor: 1
+weight_dtype: bfloat16
+save_config_to_gcs: true
diff --git a/src/helm-charts/a3ultra/maxtext-training/templates/maxtext-configmap.yaml b/src/helm-charts/a3ultra/maxtext-training/templates/maxtext-configmap.yaml
@@ -20,6 +20,9 @@ data:
   maxtext-configuration.yaml: |- 
 {{ .Values.maxtext_config | nindent 4 }}
   xla-flags: >-
+    {{- if .Values.xlaFlags }}
+    {{ .Values.xlaFlags }}
+    {{- else }}
     --xla_gpu_enable_triton_gemm=false 
     --xla_gpu_enable_latency_hiding_scheduler=true 
     --xla_gpu_graph_level=0
@@ -33,3 +36,4 @@ data:
     --xla_gpu_enable_reduce_scatter_combine_by_dim=false
     --xla_disable_hlo_passes=rematerialization
     --xla_gpu_enable_while_loop_double_buffering=true
+    {{- end }}
diff --git a/training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md b/training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md
@@ -188,6 +188,9 @@ for this job. To do this, we can set the new arguments using `--set workload.arg
       $REPO_ROOT/src/helm-charts/a3ultra/nemo-training
   ```
 
+To build the AotC-based image yourself, please use the script defined in `$REPO_ROOT/src/docker/nemo-aotc-24.07`
+
+
 ### Monitor the job
 
 To check the status of pods in the indexed job, run the following command from your client:
@@ -233,19 +236,19 @@ Here is an example of an entry in the DLLogger log:
 
 ```json
 DLLL{
-  "timestamp": "1734117227.896116", 
-  "datetime": "2024-12-13 19:13:47.896116", 
+  "timestamp": "1734117227.896116",
+  "datetime": "2024-12-13 19:13:47.896116",
   "elapsedtime": "489.15554",
-  "type": "LOG", 
-  "step": 15, 
+  "type": "LOG",
+  "step": 15,
   "data": {
-    "reduced_train_loss": 1.865377426147461, 
-    "lr": 1.1250000397922122e-06, 
-    "global_step": 15.0, 
-    "consumed_samples": 16384.0, 
-    "train_backward_timing in s": 4.5490265620173886e-05, 
-    "grad_norm": 19.41560935974121, 
-    "train_step_timing in s": 20.021318435668945, 
+    "reduced_train_loss": 1.865377426147461,
+    "lr": 1.1250000397922122e-06,
+    "global_step": 15.0,
+    "consumed_samples": 16384.0,
+    "train_backward_timing in s": 4.5490265620173886e-05,
+    "grad_norm": 19.41560935974121,
+    "train_step_timing in s": 20.021318435668945,
     "epoch": 0
     }
 }
diff --git a/training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md b/training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md
diff --git a/training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/values.yaml b/training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/values.yaml

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+https://github.com/NVIDIA/dllogger/archive/refs/tags/v1.0.0.zip`