Skip to content

Commit 507bfb7

Browse files
feat: add more PEFT lora recipes (#959)
Signed-off-by: Zhiyu Li <[email protected]>
1 parent 1645e61 commit 507bfb7

File tree

6 files changed

+514
-2
lines changed

6 files changed

+514
-2
lines changed

docs/performance-summary.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,12 @@ The table below shows finetuning (LoRA) performance for full sequences with no p
2525

2626
| Model | #GPUs | GBS | MBS | LBS | GA | Seq Length | TP | PP | CP | EP | VP | FSDP | Kernel Optimizations | Time per Global Step (s) | Model TFLOPs/sec/GPU | Tokens/sec/GPU |
2727
|-------|------:|----:|----:|----:|---:|-----------:|---:|---:|---:|---:|---:|-----:|---------|-------------------------:|---------------------:|---------------:|
28+
| Llama3 8B | 1 | 32 | 2 | 2 | 16 | 4096 | 1 | 1 | 1 | - | 1 | 1 | - | 10.51 | 402 | 12472.87 |
29+
| Qwen2.5 7B | 1 | 32 | 2 | 2 | 16 | 4096 | 1 | 1 | 1 | - | 1 | 1 | - | 9.29 | 423 | 14110.05 |
2830
| Llama3 70B | 8 | 32 | 1 | 4 | 4 | 4096 | 2 | 4 | 1 | - | 10 | 1 | - | 26.92 | 176 | 608.42 |
29-
| Qwen2.5 32B | 8 | 32 | 1 | 8 | 2 | 4096 | 1 | 4 | 1 | - | 8 | 1 | - | 8.40 | 261 | 1950.93 |
30-
31+
| Qwen2.5 32B | 8 | 32 | 1 | 8 | 2 | 4096 | 1 | 4 | 1 | - | 8 | 1 | 2 | 8.40 | 261 | 1950.93 |
32+
| Llama3 70B 2-node | 16 | 32 | 1 | 4 | 2 | 4096 | 2 | 4 | 1 | - | 10 | 1 | 2 | 12.78 | 185 | 640.95 |
33+
| Qwen2.5 32B 2-node | 16 | 32 | 1 | 8 | 1 | 4096 | 1 | 4 | 1 | - | 8 | 1 | 4 | 4.48 | 244 | 1826.49 |
3134
## Glossary
3235

3336
- **MFU**: Model FLOPs Utilization - ratio of achieved compute to peak hardware capability
@@ -55,8 +58,12 @@ All benchmark configurations are available in [`examples/benchmark/configs/`](ht
5558
- [`qwen3_moe_30b_te_deepep.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/benchmark/configs/qwen3_moe_30b_te_deepep.yaml) - Qwen3 MoE with TE + DeepEP
5659
- [`gptoss_20b_te_deepep.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/benchmark/configs/gptoss_20b_te_deepep.yaml) - GPT-OSS 20B with optimizations
5760
- [`gptoss_120b_te_deepep.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/benchmark/configs/gptoss_120b_te_deepep.yaml) - GPT-OSS 120B optimized
61+
- [`Llama_8b_lora.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/llama3_1/llama3_1_8b_peft_benchmark.yaml) - Llama-8B Finetuning (LoRA) optimized
62+
- [`Qwen2_5_7b_lora.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/qwen/qwen2_5_7b_peft_benchmark.yaml) - Qwen2.5-7B Finetuning (LoRA) optimized
5863
- [`Llama_70b_lora.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/llama3_3/custom_llama3_3_70b_instruct_peft_benchmark.yaml) - Llama-70B Finetuning (LoRA) optimized
5964
- [`Qwen2_5_32b_lora.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/qwen/qwen2_5_32b_peft_benchmark.yaml) - Qwen2.5-32B Finetuning (LoRA) optimized
65+
- [`Llama_70b_lora_2nodes.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/llama3_3/custom_llama3_3_70b_instruct_peft_benchmark_2nodes.yaml) - Llama-70B Finetuning (LoRA) optimized on 2 nodes
66+
- [`Qwen2_5_32b_lora_2nodes.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/qwen/qwen2_5_32b_peft_benchmark_2nodes.yaml) - Qwen2.5-32B Finetuning (LoRA) optimized on 2 nodes
6067

6168
---
6269

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# QLora configuration for Llama-3.1-8B on SQuAD dataset
16+
# Uses 4-bit quantization with LoRA adapters
17+
#
18+
# To run this recipe, please use the following command:
19+
# torchrun --nproc-per-node=1 nemo_automodel/recipes/llm/benchmark.py --config examples/llm_finetune/llama3_1/llama3_1_8b_peft_benchmark.yaml
20+
21+
seed: 42
22+
23+
# NEW: Add benchmark section
24+
benchmark:
25+
warmup_steps: 5
26+
peak_tflops: 989 # H100: 989, A100: 312
27+
nsys_start: -1 # Set to step number to profile (e.g., 10)
28+
nsys_end: -1 # Set to end step (e.g., 15)
29+
nsys_ranks: [] # e.g., [0] to profile rank 0
30+
num_nodes: 1
31+
32+
step_scheduler:
33+
global_batch_size: 32
34+
local_batch_size: 2
35+
ckpt_every_steps: 50
36+
val_every_steps: 1000
37+
max_steps: 10
38+
39+
dist_env:
40+
backend: nccl
41+
timeout_minutes: 1
42+
43+
rng:
44+
_target_: nemo_automodel.components.training.rng.StatefulRNG
45+
seed: 42
46+
ranked: true
47+
48+
model:
49+
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
50+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B
51+
52+
peft:
53+
_target_: nemo_automodel.components._peft.lora.PeftConfig
54+
match_all_linear: true
55+
dim: 16
56+
alpha: 32
57+
dropout: 0.1
58+
use_triton: true
59+
60+
distributed:
61+
_target_: nemo_automodel.components.distributed.fsdp2.FSDP2Manager
62+
dp_size: none
63+
dp_replicate_size: 1
64+
tp_size: 1
65+
cp_size: 1
66+
sequence_parallel: false
67+
68+
loss_fn:
69+
_target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
70+
71+
# Use MockIterableDataset for benchmarking (faster, no I/O)
72+
dataset:
73+
_target_: nemo_automodel.components.datasets.llm.mock_iterable_dataset.MockIterableDataset
74+
vocab_size: 100
75+
seq_len: 4096
76+
num_samples: 1000000
77+
batch_size: 2
78+
79+
dataloader:
80+
_target_: torch.utils.data.DataLoader
81+
batch_size: null # Dataset already yields batches
82+
# Note: model_config will be auto-injected by train_ft.py for PP models
83+
84+
# validation_dataset:
85+
# _target_: nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset
86+
# path_or_dataset_id: Muennighoff/natural-instructions
87+
# split: validation
88+
# column_mapping:
89+
# instruction: definition
90+
# question: inputs
91+
# answer: targets
92+
93+
optimizer:
94+
_target_: torch.optim.AdamW
95+
betas: [0.9, 0.999]
96+
eps: 1e-8
97+
lr: 1.0e-5
98+
weight_decay: 0.01
99+
100+
# Uncomment and configure for W&B logging
101+
# wandb:
102+
# project: <your_wandb_project>
103+
# entity: <your_wandb_entity>
104+
# name: llama3_1_8b_squad_qlora
105+
# save_dir: <your_wandb_save_dir>
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Based on your existing config, modified for benchmarking
2+
3+
seed: 42
4+
5+
# NEW: Add benchmark section
6+
benchmark:
7+
warmup_steps: 5
8+
peak_tflops: 989 # H100: 989, A100: 312
9+
nsys_start: -1 # Set to step number to profile (e.g., 10)
10+
nsys_end: -1 # Set to end step (e.g., 15)
11+
nsys_ranks: [] # e.g., [0] to profile rank 0
12+
num_nodes: 2
13+
14+
step_scheduler:
15+
global_batch_size: 32
16+
local_batch_size: 8
17+
ckpt_every_steps: 50
18+
val_every_steps: 1000
19+
max_steps: 10
20+
21+
dist_env:
22+
backend: nccl
23+
timeout_minutes: 1
24+
25+
rng:
26+
_target_: nemo_automodel.components.training.rng.StatefulRNG
27+
seed: 1111
28+
ranked: true
29+
30+
model:
31+
_target_: nemo_automodel.components.models.llama.model.build_llama_model
32+
pretrained_model_name_or_path: meta-llama/Llama-3.3-70B-Instruct
33+
torch_dtype: bf16
34+
35+
peft:
36+
_target_: nemo_automodel.components._peft.lora.PeftConfig
37+
match_all_linear: True
38+
dim: 16
39+
alpha: 32
40+
use_triton: True
41+
42+
checkpoint:
43+
enabled: false
44+
checkpoint_dir: checkpoints/
45+
model_save_format: safetensors
46+
save_consolidated: false
47+
48+
distributed:
49+
_target_: nemo_automodel.components.distributed.fsdp2.FSDP2Manager
50+
dp_size: 2
51+
tp_size: 2
52+
cp_size: 1
53+
pp_size: 4
54+
sequence_parallel: false
55+
activation_checkpointing: true
56+
57+
autopipeline:
58+
_target_: nemo_automodel.components.distributed.pipelining.autopipeline.AutoPipeline
59+
pp_schedule: interleaved1f1b
60+
pp_microbatch_size: 1
61+
layers_per_stage: 2
62+
scale_grads_in_schedule: false
63+
round_virtual_stages_to_pp_multiple: up
64+
dtype: bf16
65+
66+
loss_fn:
67+
_target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
68+
69+
# Use MockIterableDataset for benchmarking (faster, no I/O)
70+
dataset:
71+
_target_: nemo_automodel.components.datasets.llm.mock_iterable_dataset.MockIterableDataset
72+
vocab_size: 100
73+
seq_len: 4096
74+
num_samples: 1000000
75+
76+
dataloader:
77+
_target_: torch.utils.data.DataLoader
78+
batch_size: null # Dataset already yields batches
79+
# Note: model_config will be auto-injected by train_ft.py for PP models
80+
81+
optimizer:
82+
_target_: torch.optim.Adam
83+
betas: [0.9, 0.999]
84+
eps: 1e-8
85+
lr: 1.0e-5
86+
weight_decay: 0
87+
88+
lr_scheduler:
89+
lr_decay_style: cosine
90+
min_lr: 1.0e-6
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Custom Qwen2.5-72B with combined projections for benchmarking
2+
# Based on Qwen2_72B_peft_benchmark.yaml but using custom model implementation
3+
# Similar to custom_llama3_3_70b_instruct_peft_benchmark.yaml
4+
5+
seed: 42
6+
7+
# Benchmark section
8+
benchmark:
9+
warmup_steps: 5
10+
peak_tflops: 989 # H100: 989, A100: 312
11+
nsys_start: -1 # Set to step number to profile (e.g., 10)
12+
nsys_end: -1 # Set to end step (e.g., 15)
13+
nsys_ranks: [] # e.g., [0] to profile rank 0
14+
num_nodes: 2
15+
16+
step_scheduler:
17+
global_batch_size: 32
18+
local_batch_size: 8
19+
ckpt_every_steps: 50
20+
val_every_steps: 1000
21+
max_steps: 10
22+
23+
dist_env:
24+
backend: nccl
25+
timeout_minutes: 1
26+
27+
rng:
28+
_target_: nemo_automodel.components.training.rng.StatefulRNG
29+
seed: 1111
30+
ranked: true
31+
32+
model:
33+
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
34+
pretrained_model_name_or_path: Qwen/Qwen2.5-32B-Instruct
35+
torch_dtype: bf16
36+
37+
peft:
38+
_target_: nemo_automodel.components._peft.lora.PeftConfig
39+
match_all_linear: True
40+
dim: 16
41+
alpha: 32
42+
use_triton: True
43+
44+
checkpoint:
45+
enabled: false
46+
checkpoint_dir: checkpoints/
47+
model_save_format: safetensors
48+
save_consolidated: false
49+
50+
distributed:
51+
_target_: nemo_automodel.components.distributed.fsdp2.FSDP2Manager
52+
dp_size: 4
53+
tp_size: 1
54+
cp_size: 1
55+
pp_size: 4
56+
sequence_parallel: false
57+
activation_checkpointing: true
58+
59+
autopipeline:
60+
_target_: nemo_automodel.components.distributed.pipelining.autopipeline.AutoPipeline
61+
pp_schedule: interleaved1f1b
62+
pp_microbatch_size: 1
63+
layers_per_stage: 2
64+
scale_grads_in_schedule: false
65+
round_virtual_stages_to_pp_multiple: up
66+
dtype: bf16
67+
68+
loss_fn:
69+
_target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
70+
71+
# Use MockIterableDataset for benchmarking (faster, no I/O)
72+
dataset:
73+
_target_: nemo_automodel.components.datasets.llm.mock_iterable_dataset.MockIterableDataset
74+
vocab_size: 100
75+
seq_len: 4096
76+
num_samples: 1000000
77+
78+
dataloader:
79+
_target_: torch.utils.data.DataLoader
80+
batch_size: null # Dataset already yields batches
81+
# Note: model_config will be auto-injected by train_ft.py for PP models
82+
83+
optimizer:
84+
_target_: torch.optim.Adam
85+
betas: [0.9, 0.999]
86+
eps: 1e-8
87+
lr: 1.0e-5
88+
weight_decay: 0
89+
90+
lr_scheduler:
91+
lr_decay_style: cosine
92+
min_lr: 1.0e-6

0 commit comments

Comments
 (0)