|
| 1 | + |
| 2 | +# Performance |
| 3 | + |
| 4 | +As part of the NVIDIA NeMo Framework, NeMo RL, provides optimal performance for reinforcement learning on generative AI models by incorporating the latest optimizations - such as refit optimizations, mixed-precision training, and off-policy training. |
| 5 | + |
| 6 | +This page provides performance benchmarks for LLMs and VLMs using NeMo RL across different GPU systems and configurations. |
| 7 | + |
| 8 | +## Nomenclature |
| 9 | + |
| 10 | +- **GBS**: Global Batch Size |
| 11 | +- **MBS**: Micro Batch Size |
| 12 | +- **TP**: Tensor Parallel Size |
| 13 | +- **PP**: Pipeline Parallel Size |
| 14 | +- **CP**: Context Parallel Size |
| 15 | +- **VP**: Virtual Pipeline Parallel Size |
| 16 | +- **EP**: Expert Parallel Size |
| 17 | +- **T-**: Training related |
| 18 | +- **G-**: Generation related |
| 19 | +- **Training backend**: NeMo RL have two training backends: Megatron and PyTorch DTensor. This performance summary currently only shows number from Megatron backend. |
| 20 | + |
| 21 | +## Performance Metrics |
| 22 | + |
| 23 | +Since reinforcement learning consists of training, generation and transition between the two, performance measurement also reflects this. Specifically, we track the following metrics: |
| 24 | +- **Step time**: Time for each step, which includes training, generation, policy logprobs, and refit time. |
| 25 | +- **Tokens/sec/GPU**: The rate at the tokens are processed by a stage (such as training, generation, or refitting) on a single GPU: |
| 26 | + |
| 27 | + $$ |
| 28 | + \text{Tokens/sec/GPU} = \frac{\text{Total Tokens Processed}}{\text{Time for Stage} \times \text{Number of GPUs}} |
| 29 | + $$ |
| 30 | + |
| 31 | +- **Training MFU**: Model floating-point operations per second per GPU |
| 32 | + |
| 33 | + |
| 34 | +## Performance Summary for Large Language Models |
| 35 | + |
| 36 | +Below are performance benchmarks for various large language models organized by release version. These results were obtained using performance recipes available [here](https://github.com/NVIDIA-NeMo/RL/tree/r0.4.0/examples/configs/recipes/llm/performance). |
| 37 | + |
| 38 | +The performance data includes: |
| 39 | + |
| 40 | +- **RL Performance**: Performance metrics for various model sizes and architectures on different RL algorithms (GRPO and in the future DAPO, PPO, for both on-policy and asynchronous). |
| 41 | +- **System Configurations**: Results across different GPU systems (DGX-H100 and in the future DGX-GB200, DGX-B200) |
| 42 | +- **Precision Options**: Performance comparisons between different precision modes (BF16, FP8) |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## Nemo RL v0.4 |
| 47 | + |
| 48 | +* GRPO Dataset: [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) |
| 49 | +* System: DGX-H100 |
| 50 | +* Precision: Training BF16, Generation BF16 |
| 51 | +* Training Backend: Megatron-core. |
| 52 | + |
| 53 | +| Model |On/Off policy|T-Max Sequence Length|G-Average Seq len|#-GPUs|G-GBS|T-GBS|Generation [TP,PP]|Training [TP,CP,EP,PP,VPP]|Tokens / sec / GPU|Total Step time(s)| |
| 54 | +|------- |-------- |----- |----- |------|---- |---- |---- |---- |--- |---| |
| 55 | +|LLAMA3.1_8B|On policy |4,096 |1,060 |16 |2,048|512 |[1,1] |[1,1,1,1,1,2,n/a] |1,562 | 97.7| |
| 56 | +|LLAMA3.1_8B|1-step Off |4,096 |1,129 |16 |2,048|512 |[1,1] |[1,1,1,1,1,2,n/a] |2,161 | 74.6| |
| 57 | +|DeepSeek V3|On policy |1,536 |745 |256 |512 |512 |[32,1] |[1,1,16,16,n/a] |11 | 154| |
| 58 | +|DeepSeek V3|1-step Off |1,536 |744 |512 |512 |512 |[32,1] |[1,1,16,16,n/a] |11.0 | 77.9| |
| 59 | +|Qwen3-235B |On policy |8,192 |5,671 |128 |512 |512 |[16,1] |[2,2,16,8,n/a] |45.7 | 506| |
| 60 | +|Qwen3-235B |1-step Off |8,192 |5,691 |256 |512 |512 |[8,1] |[4,1,16,8,n/a] |52.2 | 241| |
| 61 | +|Qwen3-30B3A|On policy |4,096 |3,154 |32 |2,048|512 |[4,1] |[2,1,8,1,n/a] |925 | 225| |
| 62 | +|Qwen3-30B3A|1-step Off |4,096 |3,158 |32 |2,048|512 |[4,1] |[2,1,8,1,n/a] |864 | 244| |
| 63 | +|Qwen3-32B |On policy |4,096 |3,206 |32 |2,048|512 |[4,1] |[4,1,1,4,n/a] |540 | 393| |
| 64 | +|Qwen3-32B |1-step Off |4,096 |3,207 |64 |2,048|512 |[4,1] |[4,1,1,4,n/a] |494 | 215| |
| 65 | + |
| 66 | + |
| 67 | +Note: |
| 68 | + |
| 69 | +* All Mixture-of-expert (MoE) model training uses token drop-less. |
| 70 | +* The following metrics are extracted from the average of 5 steps: G-Average Seq len, Tokens/sec/gpu, Total Step time(s). Because of the averaging, the numbers in table does not completely match the equation stated in Performance Metrics above but the difference is small. |
0 commit comments