Skip to content

Commit 7fa4518

Browse files
CopybaraCopybara
authored andcommitted
Copybara import of gpu-recipes:
- 308cffe6b1bebd0880c37fb9404908271c299489 Fix chart directory name - 0a464addb419c1e9c31f042618afea145d46e9a4 Merge "Changing remat_policy to the new one available to ... - c88870caedb6ef7215486be77041b4a073d23ef8 Adding Llama-3.1-70B Nemo pretraining recipe for A3Ultra GitOrigin-RevId: c88870caedb6ef7215486be77041b4a073d23ef8
1 parent c72e38d commit 7fa4518

File tree

8 files changed

+523
-15
lines changed

8 files changed

+523
-15
lines changed

README.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,24 @@ Welcome to the reproducible benchmark recipes repository for GPUs! This reposito
1616

1717
## Benchmarks support matrix
1818

19-
### Training benchmarks
19+
### Training benchmarks A3 Mega
2020

2121
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
2222
| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
2323
| **GPT3-175B** | [A3 Mega (NVIDIA H100)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) | NeMo | Pre-training | GKE | [Link](./training/a3mega/gpt3-175b/nemo-pretraining-gke/README.md) |
2424
| **Llama-3-70B** | [A3 Mega (NVIDIA H100)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) | NeMo | Pre-training | GKE | [Link](./training/a3mega/llama-3-70b/nemo-pretraining-gke/README.md) |
2525
| **Llama-3.1-70B** | [A3 Mega (NVIDIA H100)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) | NeMo | Pre-training | GKE | [Link](./training/a3mega/llama-3.1-70b/nemo-pretraining-gke/README.md) |
26-
| **Llama-3.1-70B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms) | MaxText | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md)
2726
| **Mixtral-8-7B** | [A3 Mega (NVIDIA H100)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) | NeMo | Pre-training | GKE | [Link](./training/a3mega/mixtral-8x7b/nemo-pretraining-gke/README.md) |
27+
28+
### Training benchmarks A3 Ultra
29+
30+
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
31+
| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
32+
| **Llama-3.1-70B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms) | MaxText | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md)
33+
| **Llama-3.1-70B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md)
2834
| **Mixtral-8-7B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md) |
2935

36+
3037
## Repository structure
3138

3239
* **[training/](./training)**: Contains recipes to reproduce training benchmarks with GPUs.

src/frameworks/a3ultra/maxtext-configs/llama-3.1-70b-256gpus-a3u-bf16.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ learning_rate: 0.001
88
model_name: llama3.1-70b
99
enable_checkpointing: false
1010
attention: cudnn_flash_te
11-
remat_policy: save_dot_except_mlp
11+
remat_policy: save_dot_with_context_except_mlp
1212
use_iota_embed: true
1313
scan_layers: true
1414
dataset_type: synthetic

src/frameworks/a3ultra/maxtext-configs/llama-3.1-70b-256gpus-a3u-fp8.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ model_name: llama3.1-70b
99
enable_checkpointing: false
1010
quantization: fp8
1111
attention: cudnn_flash_te
12-
remat_policy: save_dot_except_mlp
12+
remat_policy: save_dot_with_context_except_mlp
1313
use_iota_embed: true
1414
scan_layers: true
1515
dataset_type: synthetic
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
run:
2+
name: llama-3.1-70b-a3u-bf16
3+
time_limit: 0-03:30:00
4+
dependency: singleton
5+
trainer:
6+
devices: 8
7+
accelerator: gpu
8+
precision: bf16
9+
logger: false
10+
enable_checkpointing: false
11+
use_distributed_sampler: false
12+
max_epochs: null
13+
max_steps: 30
14+
max_time: 05:23:30:00
15+
log_every_n_steps: 1
16+
val_check_interval: 200
17+
limit_val_batches: 5
18+
limit_test_batches: 5
19+
accumulate_grad_batches: 1
20+
gradient_clip_val: 1.0
21+
exp_manager:
22+
exp_dir: null
23+
name: megatron_gpt
24+
resume_if_exists: false
25+
create_dllogger_logger: true
26+
dllogger_logger_kwargs:
27+
verbose: true
28+
stdout: true
29+
resume_ignore_no_checkpoint: true
30+
create_checkpoint_callback: false
31+
checkpoint_callback_params:
32+
monitor: val_loss
33+
save_top_k: 10
34+
mode: min
35+
always_save_nemo: false
36+
save_nemo_on_train_end: false
37+
model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
38+
log_step_timing: true
39+
step_timing_kwargs:
40+
sync_cuda: true
41+
buffer_size: 5
42+
seconds_to_sleep: 60
43+
explicit_log_dir: null
44+
model:
45+
mcore_gpt: true
46+
micro_batch_size: 1
47+
global_batch_size: 1024
48+
rampup_batch_size: null
49+
tensor_model_parallel_size: 2
50+
pipeline_model_parallel_size: 4
51+
virtual_pipeline_model_parallel_size: 20
52+
context_parallel_size: 1
53+
encoder_seq_length: 8192
54+
max_position_embeddings: 8192
55+
num_layers: 80
56+
hidden_size: 8192
57+
ffn_hidden_size: 28672
58+
num_attention_heads: 64
59+
num_query_groups: 8
60+
init_method_std: 0.008944
61+
use_scaled_init_method: true
62+
hidden_dropout: 0.0
63+
attention_dropout: 0.0
64+
ffn_dropout: 0.0
65+
kv_channels: null
66+
apply_query_key_layer_scaling: true
67+
normalization: rmsnorm
68+
layernorm_epsilon: 1.0e-05
69+
do_layer_norm_weight_decay: false
70+
make_vocab_size_divisible_by: 128
71+
pre_process: true
72+
post_process: true
73+
persist_layer_norm: true
74+
bias: false
75+
activation: fast-swiglu
76+
headscale: false
77+
transformer_block_type: pre_ln
78+
openai_gelu: false
79+
normalize_attention_scores: true
80+
position_embedding_type: rope
81+
rotary_percentage: 1.0
82+
apply_rope_fusion: true
83+
attention_type: multihead
84+
share_embeddings_and_output_weights: false
85+
tokenizer:
86+
library: megatron
87+
type: GPT2BPETokenizer
88+
model: null
89+
delimiter: null
90+
vocab_file: gpt2-vocab.json
91+
merge_file: gpt2-merges.txt
92+
native_amp_init_scale: 4294967296
93+
native_amp_growth_interval: 1000
94+
hysteresis: 2
95+
fp32_residual_connection: false
96+
fp16_lm_cross_entropy: false
97+
megatron_amp_O2: true
98+
grad_allreduce_chunk_size_mb: 125
99+
grad_div_ar_fusion: true
100+
gradient_accumulation_fusion: true
101+
bias_activation_fusion: true
102+
bias_dropout_add_fusion: true
103+
masked_softmax_fusion: true
104+
seed: 1234
105+
resume_from_checkpoint: null
106+
use_cpu_initialization: false
107+
onnx_safe: false
108+
apex_transformer_log_level: 30
109+
gradient_as_bucket_view: true
110+
sync_batch_comm: false
111+
activations_checkpoint_granularity: null
112+
activations_checkpoint_method: null
113+
activations_checkpoint_num_layers: null
114+
num_micro_batches_with_partial_activation_checkpoints: null
115+
activations_checkpoint_layers_per_pipeline: null
116+
sequence_parallel: true
117+
transformer_engine: true
118+
fp8: true
119+
fp8_e4m3: true
120+
fp8_hybrid: true
121+
fp8_margin: 0
122+
fp8_interval: 1
123+
fp8_amax_history_len: 1024
124+
fp8_amax_compute_algo: max
125+
ub_tp_comm_overlap: false
126+
use_flash_attention: true
127+
overlap_p2p_comm: true
128+
batch_p2p_comm: false
129+
gc_interval: 100
130+
optim:
131+
name: distributed_fused_adam
132+
lr: 0.00015
133+
weight_decay: 0.1
134+
betas:
135+
- 0.9
136+
- 0.95
137+
bucket_cap_mb: 125
138+
overlap_grad_sync: true
139+
overlap_param_sync: true
140+
contiguous_grad_buffer: true
141+
contiguous_param_buffer: true
142+
grad_sync_dtype: bf16
143+
sched:
144+
name: CosineAnnealing
145+
warmup_steps: 2000
146+
constant_steps: 11873
147+
min_lr: 1.0e-05
148+
data:
149+
data_impl: mock
150+
splits_string: 90,8,2
151+
seq_length: 8192
152+
skip_warmup: true
153+
num_workers: 2
154+
dataloader_type: single
155+
reset_position_ids: false
156+
reset_attention_mask: false
157+
eod_mask_loss: false
158+
index_mapping_dir: null
159+
data_prefix: []
160+
nsys_profile:
161+
enabled: false
162+
start_step: 17
163+
end_step: 19
164+
ranks:
165+
- 0
166+
- 8
167+
gen_shape: false
168+
fp8_params: true

src/utils/training_metrics/src/data_defs.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@
3535
"h200",
3636
"bf16",
3737
): 989,
38+
(
39+
"h200",
40+
"fp8",
41+
): 1978,
3842
}
3943

4044
MODEL_FLOPS_PER_SAMPLE = {

0 commit comments

Comments
 (0)