NVIDIA-NeMo
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 2 additions & 0 deletions b/‎.github/CODEOWNERS‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docker/Dockerfile‎
Lines changed: 7 additions & 1 deletion b/‎docker/Dockerfile‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎docker/Dockerfile.ngc_pytorch‎
Lines changed: 5 additions & 0 deletions b/‎docker/Dockerfile.ngc_pytorch‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/Makefile‎
Lines changed: 1 addition & 1 deletion b/‎docs/Makefile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/about/performance-summary.md‎
Lines changed: 70 additions & 0 deletions b/‎docs/about/performance-summary.md‎
Lines changed: 70 additions & 0 deletions
diff --git a/‎docs/fp8.md‎
Lines changed: 2 additions & 7 deletions b/‎docs/fp8.md‎
Lines changed: 2 additions & 7 deletions
diff --git a/‎docs/index.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/index.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/pyproject.toml‎
Lines changed: 0 additions & 22 deletions b/‎docs/pyproject.toml‎
Lines changed: 0 additions & 22 deletions
@@ -59,3 +59,5 @@
 
 # Codeowners
 /.github/CODEOWNERS @nvidia-nemo/rl_maintainers
+
+/research/template_project @terrykong
@@ -76,7 +76,9 @@ ENV TORCH_CUDA_ARCH_LIST="9.0 10.0"
 
 # First copy only the dependency files
 COPY --from=nemo-rl pyproject.toml uv.lock ./
+COPY --from=nemo-rl nemo_rl/__init__.py nemo_rl/package_info.py ./nemo_rl/
 COPY --from=nemo-rl tools/build-custom-vllm.sh ./tools/build-custom-vllm.sh
+COPY --from=nemo-rl --link research/ ./research/
 COPY --from=nemo-rl --link 3rdparty/ ./3rdparty/
 
 RUN <<"EOF" bash -exu
@@ -108,6 +110,8 @@ FROM hermetic AS release
 ARG NEMO_RL_COMMIT
 ARG NVIDIA_BUILD_ID
 ARG NVIDIA_BUILD_REF
+ARG RC_DATE=00.00
+ARG TARGETARCH
 ENV NEMO_RL_COMMIT=${NEMO_RL_COMMIT:-<unknown>}
 ENV NVIDIA_BUILD_ID=${NVIDIA_BUILD_ID:-<unknown>}
 ENV NVIDIA_BUILD_REF=${NVIDIA_BUILD_REF:-<unknown>}
@@ -123,4 +127,6 @@ COPY --from=nemo-rl . /opt/nemo-rl
 # so do a quick check before trying to unshallow.
 RUN git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow || true
 RUN UV_LINK_MODE=symlink uv run nemo_rl/utils/prefetch_venvs.py
-
+# NOTICES.txt file points to where the OSS source code is archived
+RUN echo "This distribution includes open source which is archived at the following URL: https://opensource.nvidia.com/oss/teams/nvidia/nemo-rl/${RC_DATE}:linux-${TARGETARCH}/index.html" > NOTICES.txt && \
+    echo "For further inquiries or assistance, contact us at oss-requests@nvidia.com" >> NOTICES.txt
@@ -117,6 +117,8 @@ FROM hermetic AS release
 ARG NEMO_RL_COMMIT
 ARG NVIDIA_BUILD_ID
 ARG NVIDIA_BUILD_REF
+ARG RC_DATE=00.00
+ARG TARGETARCH
 ENV UV_NO_SYNC=1
 ENV NEMO_RL_COMMIT=${NEMO_RL_COMMIT:-<unknown>}
 ENV NVIDIA_BUILD_ID=${NVIDIA_BUILD_ID:-<unknown>}
@@ -136,3 +138,6 @@ COPY --from=nemo-rl . /opt/nemo-rl
 # so do a quick check before trying to unshallow.
 RUN git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow || true
 RUN UV_LINK_MODE=symlink uv sync --locked --inexact $UV_NO_INSTALL_PACKAGES
+# NOTICES.txt file points to where the OSS source code is archived
+RUN echo "This distribution includes open source which is archived at the following URL: https://opensource.nvidia.com/oss/teams/nvidia/nemo-rl/${RC_DATE}:linux-${TARGETARCH}/index.html" > NOTICES.txt && \
+    echo "For further inquiries or assistance, contact us at oss-requests@nvidia.com" >> NOTICES.txt
@@ -46,7 +46,7 @@ ensure-docs-env:
 	@if [ ! -x "$(PYTHON)" ]; then \
 		echo "📦 Creating isolated docs environment..."; \
 		uv venv .venv; \
-		uv sync --no-config; \
+		uv sync --project ../pyproject.toml --group docs; \
 		echo "✅ Docs environment ready."; \
 		echo "📝 To activate it: $(ACTIVATE_CMD)"; \
 	fi
 
@@ -0,0 +1,70 @@
+
+# Performance
+
+As part of the NVIDIA NeMo Framework, NeMo RL, provides optimal performance for reinforcement learning on generative AI models by incorporating the latest optimizations - such as refit optimizations, mixed-precision training, and off-policy training.
+
+This page provides performance benchmarks for LLMs and VLMs using NeMo RL across different GPU systems and configurations.
+
+## Nomenclature
+
+- **GBS**: Global Batch Size
+- **MBS**: Micro Batch Size
+- **TP**: Tensor Parallel Size
+- **PP**: Pipeline Parallel Size
+- **CP**: Context Parallel Size
+- **VP**: Virtual Pipeline Parallel Size
+- **EP**: Expert Parallel Size
+- **T-**: Training related
+- **G-**: Generation related
+- **Training backend**: NeMo RL have two training backends: Megatron and PyTorch DTensor. This performance summary currently only shows number from Megatron backend.
+
+## Performance Metrics
+
+Since reinforcement learning consists of training, generation and transition between the two, performance measurement also reflects this. Specifically, we track the following metrics:
+- **Step time**: Time for each step, which includes training, generation, policy logprobs, and refit time.
+- **Tokens/sec/GPU**: The rate at the tokens are processed by a stage (such as training, generation, or refitting) on a single GPU:
+
+    $$
+    \text{Tokens/sec/GPU} = \frac{\text{Total Tokens Processed}}{\text{Time for Stage} \times \text{Number of GPUs}}
+    $$
+
+- **Training MFU**: Model floating-point operations per second per GPU
+
+
+## Performance Summary for Large Language Models
+
+Below are performance benchmarks for various large language models organized by release version. These results were obtained using performance recipes available [here](https://github.com/NVIDIA-NeMo/RL/tree/r0.4.0/examples/configs/recipes/llm/performance).
+
+The performance data includes:
+
+- **RL Performance**: Performance metrics for various model sizes and architectures on different RL algorithms (GRPO and in the future DAPO, PPO, for both on-policy and asynchronous).
+- **System Configurations**: Results across different GPU systems (DGX-H100 and in the future DGX-GB200, DGX-B200)
+- **Precision Options**: Performance comparisons between different precision modes (BF16, FP8)
+
+---
+
+## Nemo RL v0.4
+
+* GRPO Dataset: [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
+* System: DGX-H100
+* Precision: Training BF16, Generation BF16
+* Training Backend: Megatron-core.
+
+| Model     |On/Off policy|T-Max Sequence Length|G-Average Seq len|#-GPUs|G-GBS|T-GBS|Generation [TP,PP]|Training [TP,CP,EP,PP,VPP]|Tokens / sec / GPU|Total Step time(s)|
+|-------    |--------     |-----                |-----            |------|---- |---- |----              |----                      |---               |---|
+|LLAMA3.1_8B|On policy    |4,096                |1,060            |16    |2,048|512  |[1,1]             |[1,1,1,1,1,2,n/a]         |1,562             | 97.7|
+|LLAMA3.1_8B|1-step Off   |4,096                |1,129            |16    |2,048|512  |[1,1]             |[1,1,1,1,1,2,n/a]         |2,161             | 74.6|
+|DeepSeek V3|On policy    |1,536                |745              |256   |512  |512  |[32,1]            |[1,1,16,16,n/a]           |11                | 154|
+|DeepSeek V3|1-step Off   |1,536                |744              |512   |512  |512  |[32,1]            |[1,1,16,16,n/a]           |11.0              | 77.9|
+|Qwen3-235B |On policy    |8,192                |5,671            |128   |512  |512  |[16,1]            |[2,2,16,8,n/a]            |45.7              | 506|
+|Qwen3-235B |1-step Off   |8,192                |5,691            |256   |512  |512  |[8,1]             |[4,1,16,8,n/a]            |52.2              | 241|
+|Qwen3-30B3A|On policy    |4,096                |3,154            |32    |2,048|512  |[4,1]             |[2,1,8,1,n/a]             |925               | 225|
+|Qwen3-30B3A|1-step Off   |4,096                |3,158            |32    |2,048|512  |[4,1]             |[2,1,8,1,n/a]             |864               | 244|
+|Qwen3-32B  |On policy    |4,096                |3,206            |32    |2,048|512  |[4,1]             |[4,1,1,4,n/a]             |540               | 393|
+|Qwen3-32B  |1-step Off   |4,096                |3,207            |64    |2,048|512  |[4,1]             |[4,1,1,4,n/a]             |494               | 215|
+
+
+Note:
+
+* All Mixture-of-expert (MoE) model training uses token drop-less. 
+* The following metrics are extracted from the average of 5 steps: G-Average Seq len, Tokens/sec/gpu, Total Step time(s). Because of the averaging, the numbers in table does not completely match the equation stated in Performance Metrics above but the difference is small.
@@ -66,9 +66,9 @@ FP8 generations are recommended to be configured with the following settings:
 
 ## Compatibility Note for Deepseek-Style FP8 Training
 
-When using FP8 training with Deepseek-style FP8 (sub-channel scaling), be aware of the following compatibility issue:
+The TransformerEngine implementation for this recipe requires **cuda version ≥ 12.9**. The latest nemo-rl depends on torch 2.8.0 + cuda 12.9 (since this [commit](https://github.com/NVIDIA-NeMo/RL/commit/3f36d14b53e906b27c01c06e36dbbd2b8eb300cd)). Users should check-out code to latest and build container from `docker/Dockerfile` ([instructions](docker.md)). 
 
-The TransformerEngine implementation for this recipe requires **cuBLAS version ≥ 12.9**. However, `nemo-rl` currently depends on **Torch 2.7.1**, which in turn requires **CUDA 12.8**. As a result, attempting to use the default setup will trigger the following error:
+If you are using nemo-rl before this [commit](https://github.com/NVIDIA-NeMo/RL/commit/3f36d14b53e906b27c01c06e36dbbd2b8eb300cd), you will see the following error when trying to use fp8 training
 
 ```
 File "/opt/ray_venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/transformer_engine/pytorch/fp8.py", line 646, in fp8_autocast
@@ -78,11 +78,6 @@ assert fp8_block_available, reason_for_no_fp8_block
            ^^^^^^^^^^^^^^^^^^^
 AssertionError: FP8 block scaled GEMM requires Hopper and CUDA >= 12.9.
 ```
-This issue will be resolved once the Torch version is upgraded to **≥ 2.8.0** (Please follow [#1122](https://github.com/NVIDIA-NeMo/RL/issues/1122) for more progress on the upgrade). In the meantime, you can enable Deepseek-style FP8 training using the following workaround:
-
-- **Build the NGC PyTorch container** from `docker/Dockerfile.ngc_pytorch`.  
-  This setup uses the system Python environment, which includes **CUDA version 12.9 or higher**, meeting the requirements for TransformerEngine’s FP8 implementation.
-
 
 
 ## Accuracy
 
@@ -170,6 +170,7 @@ Comprehensive reference for all NeMo RL modules, classes, functions, and methods
 :hidden:
 
 about/overview
+about/performance-summary
 about/features
 about/backends
 about/quick-start
@@ -180,6 +181,8 @@ about/clusters
 about/tips-and-tricks
 ```
 
+
+
 ```{toctree}
 :caption: Environment Start
 :hidden: