Skip to content

Commit 2cb541b

Browse files
add llama|qwen multi-node peft benchmark datapoints
Signed-off-by: Zhiyu Li <[email protected]>
1 parent 64df343 commit 2cb541b

File tree

1 file changed

+7
-2
lines changed

1 file changed

+7
-2
lines changed

docs/performance-summary.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,9 @@ The table below shows finetuning (LoRA) performance for full sequences with no p
2828
| Llama3 8B | 1 | 32 | 2 | 2 | 16 | 4096 | 1 | 1 | 1 | - | 1 | 1 | - | 10.51 | 402 | 12472.87 |
2929
| Qwen2.5 7B | 1 | 32 | 2 | 2 | 16 | 4096 | 1 | 1 | 1 | - | 1 | 1 | - | 9.29 | 423 | 14110.05 |
3030
| Llama3 70B | 8 | 32 | 1 | 4 | 4 | 4096 | 2 | 4 | 1 | - | 10 | 1 | - | 26.92 | 176 | 608.42 |
31-
| Qwen2.5 32B | 8 | 32 | 1 | 8 | 2 | 4096 | 1 | 4 | 1 | - | 8 | 1 | - | 8.40 | 261 | 1950.93 |
32-
31+
| Qwen2.5 32B | 8 | 32 | 1 | 8 | 2 | 4096 | 1 | 4 | 1 | - | 8 | 1 | 2 | 8.40 | 261 | 1950.93 |
32+
| Llama3 70B 2-node | 16 | 32 | 1 | 4 | 2 | 4096 | 2 | 4 | 1 | - | 10 | 1 | 2 | 12.78 | 185 | 640.95 |
33+
| Qwen2.5 32B 2-node | 16 | 32 | 1 | 8 | 1 | 4096 | 1 | 4 | 1 | - | 8 | 1 | 4 | 4.48 | 244 | 1826.49 |
3334
## Glossary
3435

3536
- **MFU**: Model FLOPs Utilization - ratio of achieved compute to peak hardware capability
@@ -57,8 +58,12 @@ All benchmark configurations are available in [`examples/benchmark/configs/`](ht
5758
- [`qwen3_moe_30b_te_deepep.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/benchmark/configs/qwen3_moe_30b_te_deepep.yaml) - Qwen3 MoE with TE + DeepEP
5859
- [`gptoss_20b_te_deepep.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/benchmark/configs/gptoss_20b_te_deepep.yaml) - GPT-OSS 20B with optimizations
5960
- [`gptoss_120b_te_deepep.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/benchmark/configs/gptoss_120b_te_deepep.yaml) - GPT-OSS 120B optimized
61+
- [`Llama_8b_lora.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/llama3_1/llama3_1_8b_peft_benchmark.yaml) - Llama-8B Finetuning (LoRA) optimized
62+
- [`Qwen2_5_7b_lora.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/qwen/qwen2_5_7b_peft_benchmark.yaml) - Qwen2.5-7B Finetuning (LoRA) optimized
6063
- [`Llama_70b_lora.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/llama3_3/custom_llama3_3_70b_instruct_peft_benchmark.yaml) - Llama-70B Finetuning (LoRA) optimized
6164
- [`Qwen2_5_32b_lora.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/qwen/qwen2_5_32b_peft_benchmark.yaml) - Qwen2.5-32B Finetuning (LoRA) optimized
65+
- [`Llama_70b_lora_2nodes.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/llama3_3/custom_llama3_3_70b_instruct_peft_benchmark_2nodes.yaml) - Llama-70B Finetuning (LoRA) optimized on 2 nodes
66+
- [`Qwen2_5_32b_lora_2nodes.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/qwen/qwen2_5_32b_peft_benchmark_2nodes.yaml) - Qwen2.5-32B Finetuning (LoRA) optimized on 2 nodes
6267

6368
---
6469

0 commit comments

Comments
 (0)