Skip to content

Commit dc6ef1a

Browse files
CopybaraCopybara
authored andcommitted
Copybara import of gpu-recipes:
- 817afcc887ae460f2b57a74ec99cc48e74c3407b update the scrip for building nemo aotc - cf7622782b3381482bb4da1d052f462eca6b57b0 Merge "add mixtral-8x-7b maxtext a3u recipe" into main GitOrigin-RevId: cf7622782b3381482bb4da1d052f462eca6b57b0
1 parent ff16a47 commit dc6ef1a

File tree

11 files changed

+443
-12
lines changed

11 files changed

+443
-12
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,8 @@ Welcome to the reproducible benchmark recipes repository for GPUs! This reposito
3030
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
3131
| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
3232
| **Llama-3.1-70B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms) | MaxText | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md)
33-
| **Llama-3.1-70B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md)
33+
| **Llama-3.1-70B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md)
34+
| **Mixtral-8-7B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms) | MaxText | Pre-training | GKE | [Link](./training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md)
3435
| **Mixtral-8-7B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md) |
3536

3637

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Nemo 24.07 AotC Image
2+
3+
This Dockerfile builds a container image designed for NVIDIA NeMo training workloads. It includes the AotC library,
4+
which contains Google-optimized implementations of NeMo-based workflows.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Copyright 2024 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
steps:
16+
- name: 'gcr.io/cloud-builders/docker'
17+
args:
18+
- 'build'
19+
- '--tag=${_ARTIFACT_REGISTRY}/nemo_workload:24.07'
20+
- '--file=nemo.Dockerfile'
21+
- '.'
22+
automapSubstitutions: true
23+
24+
images:
25+
- '${_ARTIFACT_REGISTRY}/nemo_workload:24.07'
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Copyright 2024 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# Base Image
16+
FROM nvcr.io/nvidia/nemo:24.07
17+
18+
# Set the working directory
19+
WORKDIR /workspace
20+
COPY requirements.txt /workspace/requirements.txt
21+
22+
# GCSfuse components (used to provide shared storage, not intended for high performance)
23+
RUN apt-get update && apt-get install --yes --no-install-recommends \
24+
ca-certificates \
25+
curl \
26+
gnupg \
27+
&& echo "deb https://packages.cloud.google.com/apt gcsfuse-buster main" \
28+
| tee /etc/apt/sources.list.d/gcsfuse.list \
29+
&& echo "deb https://packages.cloud.google.com/apt cloud-sdk main" \
30+
| tee -a /etc/apt/sources.list.d/google-cloud-sdk.list \
31+
&& curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
32+
&& apt-get update \
33+
&& apt-get install --yes gcsfuse \
34+
&& apt-get install --yes google-cloud-cli \
35+
&& apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
36+
&& mkdir /gcs
37+
38+
RUN pip install --require-hashes -r requirements.txt
39+
40+
# install kubectl
41+
RUN curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
42+
RUN chmod +x ./kubectl
43+
RUN mv ./kubectl /usr/local/bin
44+
45+
# Clone the AotC repository
46+
RUN git clone https://github.com/AI-Hypercomputer/aotc.git
47+
WORKDIR /workspace/aotc
48+
49+
# Build the wheel
50+
RUN pip install build setuptools
51+
RUN python3 -m pip wheel . --no-deps -w dist/
52+
RUN pip install dist/*.whl
53+
54+
# Add the build timestamp as a label
55+
ARG BUILD_TIMESTAMP
56+
LABEL build_timestamp=$BUILD_TIMESTAMP
57+
58+
ENTRYPOINT []
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
https://github.com/NVIDIA/dllogger/archive/refs/tags/v1.0.0.zip
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# This file is autogenerated by pip-compile with Python 3.11
2+
# by the following command:
3+
#
4+
# pip-compile --generate-hashes requirements.in
5+
#
6+
dllogger @ https://github.com/NVIDIA/dllogger/archive/refs/tags/v1.0.0.zip \
7+
--hash=sha256:07d0cd9b9b56f454f0c186a0889137e9f94e1979fca3d35911967c874c93c191
8+
# via -r requirements.in
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
base_emb_dim: 4096
2+
base_num_query_heads: 32
3+
base_num_kv_heads: 8
4+
base_mlp_dim: 14336
5+
base_num_decoder_layers: 32
6+
head_dim: 128
7+
vocab_size: 32000
8+
enable_dropout: false
9+
logits_via_embedding: false
10+
normalization_layer_epsilon: 0.00001
11+
num_experts: 8
12+
num_experts_per_tok: 2
13+
rope_max_timescale: 1000000
14+
decoder_block: mistral
15+
attention: cudnn_flash_te
16+
dataset_type: synthetic
17+
tokenizer_path: "assets/tokenizer.mistral-v1"
18+
max_target_length: 4096
19+
use_iota_embed: true
20+
reuse_example_batch: 1
21+
enable_checkpointing: false
22+
megablox: false
23+
hardware: gpu
24+
scan_layers: false
25+
per_device_batch_size: 5
26+
remat_policy: custom
27+
logits_dot_in_fp32: false
28+
enable_goodput_recording: false
29+
monitor_goodput: false
30+
query_proj: device
31+
key_proj: device
32+
value_proj: device
33+
out_proj: device
34+
mlpwi_0: device
35+
mlpwi_1: device
36+
mlpwo: device
37+
dcn_fsdp_parallelism: 2
38+
dcn_data_parallelism: 16
39+
dcn_tensor_parallelism: 1
40+
dcn_pipeline_parallelism: 1
41+
ici_fsdp_parallelism: -1
42+
ici_expert_parallelism: 8
43+
ici_tensor_parallelism: 1
44+
ici_data_parallelism: 1
45+
capacity_factor: 1
46+
weight_dtype: bfloat16
47+
save_config_to_gcs: true

src/helm-charts/a3ultra/maxtext-training/templates/maxtext-configmap.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,9 @@ data:
2020
maxtext-configuration.yaml: |-
2121
{{ .Values.maxtext_config | nindent 4 }}
2222
xla-flags: >-
23+
{{- if .Values.xlaFlags }}
24+
{{ .Values.xlaFlags }}
25+
{{- else }}
2326
--xla_gpu_enable_triton_gemm=false
2427
--xla_gpu_enable_latency_hiding_scheduler=true
2528
--xla_gpu_graph_level=0
@@ -33,3 +36,4 @@ data:
3336
--xla_gpu_enable_reduce_scatter_combine_by_dim=false
3437
--xla_disable_hlo_passes=rematerialization
3538
--xla_gpu_enable_while_loop_double_buffering=true
39+
{{- end }}

training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,9 @@ for this job. To do this, we can set the new arguments using `--set workload.arg
188188
$REPO_ROOT/src/helm-charts/a3ultra/nemo-training
189189
```
190190

191+
To build the AotC-based image yourself, please use the script defined in `$REPO_ROOT/src/docker/nemo-aotc-24.07`
192+
193+
191194
### Monitor the job
192195

193196
To check the status of pods in the indexed job, run the following command from your client:
@@ -233,19 +236,19 @@ Here is an example of an entry in the DLLogger log:
233236
234237
```json
235238
DLLL{
236-
"timestamp": "1734117227.896116",
237-
"datetime": "2024-12-13 19:13:47.896116",
239+
"timestamp": "1734117227.896116",
240+
"datetime": "2024-12-13 19:13:47.896116",
238241
"elapsedtime": "489.15554",
239-
"type": "LOG",
240-
"step": 15,
242+
"type": "LOG",
243+
"step": 15,
241244
"data": {
242-
"reduced_train_loss": 1.865377426147461,
243-
"lr": 1.1250000397922122e-06,
244-
"global_step": 15.0,
245-
"consumed_samples": 16384.0,
246-
"train_backward_timing in s": 4.5490265620173886e-05,
247-
"grad_norm": 19.41560935974121,
248-
"train_step_timing in s": 20.021318435668945,
245+
"reduced_train_loss": 1.865377426147461,
246+
"lr": 1.1250000397922122e-06,
247+
"global_step": 15.0,
248+
"consumed_samples": 16384.0,
249+
"train_backward_timing in s": 4.5490265620173886e-05,
250+
"grad_norm": 19.41560935974121,
251+
"train_step_timing in s": 20.021318435668945,
249252
"epoch": 0
250253
}
251254
}

0 commit comments

Comments
 (0)