Skip to content

Commit 50a72d2

Browse files
committed
Merge remote-tracking branch 'origin' into yifu/force_on_policy
2 parents 8bd1d3f + cff17f8 commit 50a72d2

File tree

65 files changed

+2879
-1118
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+2879
-1118
lines changed

.github/CODEOWNERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,3 +59,5 @@
5959

6060
# Codeowners
6161
/.github/CODEOWNERS @nvidia-nemo/rl_maintainers
62+
63+
/research/template_project @terrykong

docker/Dockerfile

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,9 @@ ENV TORCH_CUDA_ARCH_LIST="9.0 10.0"
7676

7777
# First copy only the dependency files
7878
COPY --from=nemo-rl pyproject.toml uv.lock ./
79+
COPY --from=nemo-rl nemo_rl/__init__.py nemo_rl/package_info.py ./nemo_rl/
7980
COPY --from=nemo-rl tools/build-custom-vllm.sh ./tools/build-custom-vllm.sh
81+
COPY --from=nemo-rl --link research/ ./research/
8082
COPY --from=nemo-rl --link 3rdparty/ ./3rdparty/
8183

8284
RUN <<"EOF" bash -exu
@@ -108,6 +110,8 @@ FROM hermetic AS release
108110
ARG NEMO_RL_COMMIT
109111
ARG NVIDIA_BUILD_ID
110112
ARG NVIDIA_BUILD_REF
113+
ARG RC_DATE=00.00
114+
ARG TARGETARCH
111115
ENV NEMO_RL_COMMIT=${NEMO_RL_COMMIT:-<unknown>}
112116
ENV NVIDIA_BUILD_ID=${NVIDIA_BUILD_ID:-<unknown>}
113117
ENV NVIDIA_BUILD_REF=${NVIDIA_BUILD_REF:-<unknown>}
@@ -123,4 +127,6 @@ COPY --from=nemo-rl . /opt/nemo-rl
123127
# so do a quick check before trying to unshallow.
124128
RUN git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow || true
125129
RUN UV_LINK_MODE=symlink uv run nemo_rl/utils/prefetch_venvs.py
126-
130+
# NOTICES.txt file points to where the OSS source code is archived
131+
RUN echo "This distribution includes open source which is archived at the following URL: https://opensource.nvidia.com/oss/teams/nvidia/nemo-rl/${RC_DATE}:linux-${TARGETARCH}/index.html" > NOTICES.txt && \
132+
echo "For further inquiries or assistance, contact us at oss-requests@nvidia.com" >> NOTICES.txt

docker/Dockerfile.ngc_pytorch

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,8 @@ FROM hermetic AS release
117117
ARG NEMO_RL_COMMIT
118118
ARG NVIDIA_BUILD_ID
119119
ARG NVIDIA_BUILD_REF
120+
ARG RC_DATE=00.00
121+
ARG TARGETARCH
120122
ENV UV_NO_SYNC=1
121123
ENV NEMO_RL_COMMIT=${NEMO_RL_COMMIT:-<unknown>}
122124
ENV NVIDIA_BUILD_ID=${NVIDIA_BUILD_ID:-<unknown>}
@@ -136,3 +138,6 @@ COPY --from=nemo-rl . /opt/nemo-rl
136138
# so do a quick check before trying to unshallow.
137139
RUN git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow || true
138140
RUN UV_LINK_MODE=symlink uv sync --locked --inexact $UV_NO_INSTALL_PACKAGES
141+
# NOTICES.txt file points to where the OSS source code is archived
142+
RUN echo "This distribution includes open source which is archived at the following URL: https://opensource.nvidia.com/oss/teams/nvidia/nemo-rl/${RC_DATE}:linux-${TARGETARCH}/index.html" > NOTICES.txt && \
143+
echo "For further inquiries or assistance, contact us at oss-requests@nvidia.com" >> NOTICES.txt

docs/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ ensure-docs-env:
4646
@if [ ! -x "$(PYTHON)" ]; then \
4747
echo "📦 Creating isolated docs environment..."; \
4848
uv venv .venv; \
49-
uv sync --no-config; \
49+
uv sync --project ../pyproject.toml --group docs; \
5050
echo "✅ Docs environment ready."; \
5151
echo "📝 To activate it: $(ACTIVATE_CMD)"; \
5252
fi

docs/about/performance-summary.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
2+
# Performance
3+
4+
As part of the NVIDIA NeMo Framework, NeMo RL, provides optimal performance for reinforcement learning on generative AI models by incorporating the latest optimizations - such as refit optimizations, mixed-precision training, and off-policy training.
5+
6+
This page provides performance benchmarks for LLMs and VLMs using NeMo RL across different GPU systems and configurations.
7+
8+
## Nomenclature
9+
10+
- **GBS**: Global Batch Size
11+
- **MBS**: Micro Batch Size
12+
- **TP**: Tensor Parallel Size
13+
- **PP**: Pipeline Parallel Size
14+
- **CP**: Context Parallel Size
15+
- **VP**: Virtual Pipeline Parallel Size
16+
- **EP**: Expert Parallel Size
17+
- **T-**: Training related
18+
- **G-**: Generation related
19+
- **Training backend**: NeMo RL have two training backends: Megatron and PyTorch DTensor. This performance summary currently only shows number from Megatron backend.
20+
21+
## Performance Metrics
22+
23+
Since reinforcement learning consists of training, generation and transition between the two, performance measurement also reflects this. Specifically, we track the following metrics:
24+
- **Step time**: Time for each step, which includes training, generation, policy logprobs, and refit time.
25+
- **Tokens/sec/GPU**: The rate at the tokens are processed by a stage (such as training, generation, or refitting) on a single GPU:
26+
27+
$$
28+
\text{Tokens/sec/GPU} = \frac{\text{Total Tokens Processed}}{\text{Time for Stage} \times \text{Number of GPUs}}
29+
$$
30+
31+
- **Training MFU**: Model floating-point operations per second per GPU
32+
33+
34+
## Performance Summary for Large Language Models
35+
36+
Below are performance benchmarks for various large language models organized by release version. These results were obtained using performance recipes available [here](https://github.com/NVIDIA-NeMo/RL/tree/r0.4.0/examples/configs/recipes/llm/performance).
37+
38+
The performance data includes:
39+
40+
- **RL Performance**: Performance metrics for various model sizes and architectures on different RL algorithms (GRPO and in the future DAPO, PPO, for both on-policy and asynchronous).
41+
- **System Configurations**: Results across different GPU systems (DGX-H100 and in the future DGX-GB200, DGX-B200)
42+
- **Precision Options**: Performance comparisons between different precision modes (BF16, FP8)
43+
44+
---
45+
46+
## Nemo RL v0.4
47+
48+
* GRPO Dataset: [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
49+
* System: DGX-H100
50+
* Precision: Training BF16, Generation BF16
51+
* Training Backend: Megatron-core.
52+
53+
| Model |On/Off policy|T-Max Sequence Length|G-Average Seq len|#-GPUs|G-GBS|T-GBS|Generation [TP,PP]|Training [TP,CP,EP,PP,VPP]|Tokens / sec / GPU|Total Step time(s)|
54+
|------- |-------- |----- |----- |------|---- |---- |---- |---- |--- |---|
55+
|LLAMA3.1_8B|On policy |4,096 |1,060 |16 |2,048|512 |[1,1] |[1,1,1,1,1,2,n/a] |1,562 | 97.7|
56+
|LLAMA3.1_8B|1-step Off |4,096 |1,129 |16 |2,048|512 |[1,1] |[1,1,1,1,1,2,n/a] |2,161 | 74.6|
57+
|DeepSeek V3|On policy |1,536 |745 |256 |512 |512 |[32,1] |[1,1,16,16,n/a] |11 | 154|
58+
|DeepSeek V3|1-step Off |1,536 |744 |512 |512 |512 |[32,1] |[1,1,16,16,n/a] |11.0 | 77.9|
59+
|Qwen3-235B |On policy |8,192 |5,671 |128 |512 |512 |[16,1] |[2,2,16,8,n/a] |45.7 | 506|
60+
|Qwen3-235B |1-step Off |8,192 |5,691 |256 |512 |512 |[8,1] |[4,1,16,8,n/a] |52.2 | 241|
61+
|Qwen3-30B3A|On policy |4,096 |3,154 |32 |2,048|512 |[4,1] |[2,1,8,1,n/a] |925 | 225|
62+
|Qwen3-30B3A|1-step Off |4,096 |3,158 |32 |2,048|512 |[4,1] |[2,1,8,1,n/a] |864 | 244|
63+
|Qwen3-32B |On policy |4,096 |3,206 |32 |2,048|512 |[4,1] |[4,1,1,4,n/a] |540 | 393|
64+
|Qwen3-32B |1-step Off |4,096 |3,207 |64 |2,048|512 |[4,1] |[4,1,1,4,n/a] |494 | 215|
65+
66+
67+
Note:
68+
69+
* All Mixture-of-expert (MoE) model training uses token drop-less.
70+
* The following metrics are extracted from the average of 5 steps: G-Average Seq len, Tokens/sec/gpu, Total Step time(s). Because of the averaging, the numbers in table does not completely match the equation stated in Performance Metrics above but the difference is small.

docs/fp8.md

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -66,9 +66,9 @@ FP8 generations are recommended to be configured with the following settings:
6666

6767
## Compatibility Note for Deepseek-Style FP8 Training
6868

69-
When using FP8 training with Deepseek-style FP8 (sub-channel scaling), be aware of the following compatibility issue:
69+
The TransformerEngine implementation for this recipe requires **cuda version ≥ 12.9**. The latest nemo-rl depends on torch 2.8.0 + cuda 12.9 (since this [commit](https://github.com/NVIDIA-NeMo/RL/commit/3f36d14b53e906b27c01c06e36dbbd2b8eb300cd)). Users should check-out code to latest and build container from `docker/Dockerfile` ([instructions](docker.md)).
7070

71-
The TransformerEngine implementation for this recipe requires **cuBLAS version ≥ 12.9**. However, `nemo-rl` currently depends on **Torch 2.7.1**, which in turn requires **CUDA 12.8**. As a result, attempting to use the default setup will trigger the following error:
71+
If you are using nemo-rl before this [commit](https://github.com/NVIDIA-NeMo/RL/commit/3f36d14b53e906b27c01c06e36dbbd2b8eb300cd), you will see the following error when trying to use fp8 training
7272

7373
```
7474
File "/opt/ray_venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/transformer_engine/pytorch/fp8.py", line 646, in fp8_autocast
@@ -78,11 +78,6 @@ assert fp8_block_available, reason_for_no_fp8_block
7878
^^^^^^^^^^^^^^^^^^^
7979
AssertionError: FP8 block scaled GEMM requires Hopper and CUDA >= 12.9.
8080
```
81-
This issue will be resolved once the Torch version is upgraded to **≥ 2.8.0** (Please follow [#1122](https://github.com/NVIDIA-NeMo/RL/issues/1122) for more progress on the upgrade). In the meantime, you can enable Deepseek-style FP8 training using the following workaround:
82-
83-
- **Build the NGC PyTorch container** from `docker/Dockerfile.ngc_pytorch`.
84-
This setup uses the system Python environment, which includes **CUDA version 12.9 or higher**, meeting the requirements for TransformerEngine’s FP8 implementation.
85-
8681

8782

8883
## Accuracy

docs/index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,7 @@ Comprehensive reference for all NeMo RL modules, classes, functions, and methods
170170
:hidden:
171171
172172
about/overview
173+
about/performance-summary
173174
about/features
174175
about/backends
175176
about/quick-start
@@ -180,6 +181,8 @@ about/clusters
180181
about/tips-and-tricks
181182
```
182183

184+
185+
183186
```{toctree}
184187
:caption: Environment Start
185188
:hidden:

docs/pyproject.toml

Lines changed: 0 additions & 22 deletions
This file was deleted.

0 commit comments

Comments
 (0)