NVIDIA-NeMo · YianZhang · Aug 4, 2025 · Aug 4, 2025 · Aug 4, 2025 · Aug 5, 2025
diff --git a/.dockerignore b/.dockerignore
@@ -1,6 +1,8 @@
 # Adding to .gitignore helps reduce the size of your working_dir
 
-.git
+# Note: removing .git from .dockerignore since it is valuable to have the git history to
+#       know where this container was built
+# .git
 *.out
 *.log
 *.tar

diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
@@ -162,13 +162,15 @@ jobs:
   build-container:
     if: ${{ needs.pre-flight.outputs.test_level != 'none' }}
     needs: [pre-flight]
-    uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_build_container.yml@v0.30.0
+    uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_build_container.yml@v0.52.0
     with:
       build-ref: ${{ github.sha }}
       image-name: nemo_rl_container
       dockerfile: docker/Dockerfile
       image-label: nemo-rl
       target: hermetic
+      build-contexts: |
+        nemo-rl=${{ github.run_id }}/
       build-args: |
         MAX_JOBS=32
         NEMO_RL_COMMIT=${{ github.sha }}

diff --git a/.gitignore b/.gitignore
@@ -34,6 +34,7 @@ hf_datasets_cache/
 datasets/
 docker/*
 !docker/Dockerfile
+!docker/Dockerfile.ngc_pytorch
 !docker/README.md
 wandb/
 checkpoints/

diff --git a/.gitmodules b/.gitmodules
@@ -1,7 +1,7 @@
 [submodule "3rdparty/NeMo"]
 	path = 3rdparty/NeMo-workspace/NeMo
 	url = https://github.com/NVIDIA/NeMo.git
-	branch = zhiyul/yukih/prepare-refit-info
+	branch = pjin/ashors/rl-qwen3-export
 	shallow = true
 [submodule "3rdparty/Megatron-LM"]
 	path = 3rdparty/Megatron-LM-workspace/Megatron-LM

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -3,11 +3,9 @@ repos:
     rev: v4.4.0
     hooks:
     - id: end-of-file-fixer
-      # only include python files
-      files: \.py$
+      types_or: [python, pyi] # Only include Python files.
     - id: trailing-whitespace
-      # only include python files
-      files: \.py$
+      types_or: [python, pyi] # Only include Python files.
 
   - repo: https://github.com/astral-sh/ruff-pre-commit
     rev: "v0.9.9" # Use the appropriate version
@@ -36,8 +34,15 @@ repos:
         exclude: '^\.github/'
         types: [file]
 
-  - repo: https://github.com/facebook/pyrefly
-    rev: 0.24.2
+  - repo: local
     hooks:
       - id: pyrefly-typecheck
-        files: \.py$
+        name: pyrefly check
+        entry: uv run --group dev pyrefly check
+        types_or: [python, pyi]
+        language: system
+        pass_filenames: false # Pyrefly reads config & project roots itself.
+        args: []
+        require_serial: true
+        additional_dependencies: []
+        minimum_pre_commit_version: "2.9.2"
diff --git a/3rdparty/NeMo-workspace/NeMo b/3rdparty/NeMo-workspace/NeMo
diff --git a/README.md b/README.md
@@ -105,41 +105,37 @@ sudo apt-get update
 sudo apt-get install cudnn-cuda-12
 ```
 
-Install `uv`.
-```sh
-# For faster setup and environment isolation, we use `uv`
-pip install uv
+For faster setup and environment isolation, we use [uv](https://docs.astral.sh/uv/).
+Follow [these instructions](https://docs.astral.sh/uv/getting-started/installation/) to install uv.
 
-# Initialize NeMo RL project virtual environment
-# NOTE: Please do not use -p/--python and instead allow uv venv to read it from .python-version
-#       This ensures that the version of python used is always what we prescribe.
+Then, initialize NeMo RL project virtual environment via:
+```sh
 uv venv
+```
+> [!NOTE]
+> Please do not use `-p/--python` and instead allow `uv venv` to read it from `.python-version`.
+> This ensures that the version of python used is always what we prescribe.
 
-# If working outside a container, it can help to build flash-attn and warm the
-# uv cache before your first run. The NeMo RL Dockerfile will warm the uv cache
-# with flash-attn. See https://docs.nvidia.com/nemo/rl/latest/docker.html for
-# instructions if you are looking for the NeMo RL container.
+If working outside a container, it can help to build [flash-attn](https://github.com/Dao-AILab/flash-attention) and warm the uv cache before your first run.
+```sh
 bash tools/build-flash-attn-in-uv-cache.sh
-# If sucessful, you should see "✅ flash-attn successfully added to uv cache"
-
-# If you cannot install at the system level, you can install for your user with
-# pip install --user uv
-
-# Use `uv run` to launch all commands. It handles pip installing implicitly and
-# ensures your environment is up to date with our lock file.
-
-# Note that it is not recommended to activate the venv and instead use `uv run` since
-# it ensures consistent environment usage across different shells and sessions.
-# Example: uv run python examples/run_grpo_math.py
 ```
+> [!NOTE]
+> On the first install, `flash-attn` can take a while to install (~45min with 48 CPU hyperthreads). After it is built once, it is cached in your uv's cache dir making subsequent installs much quicker.
+
+> [!TIP]
+> The NeMo RL Dockerfile will warm the uv cache with flash-attn.
+> See https://docs.nvidia.com/nemo/rl/latest/docker.html for instructions if you are looking for the NeMo RL container.
 
-**Important Notes:**
+If sucessful, you should see `✅ flash-attn successfully added to uv cache`.
 
-- Use the `uv run <command>` to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions.
-- Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.
-- On the first install, `flash-attn` can take a while to install (~45min with 48 CPU hyperthreads). After it is built once, it is cached in your `uv`'s cache dir making subsequent installs much quicker.
-- If you update your environment in `pyproject.toml`, it is necessary to force a rebuild of the virtual environments by setting `NRL_FORCE_REBUILD_VENVS=true` next time you launch a run.
-- **Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
+Use `uv run` to launch all commands. It handles pip installing implicitly and ensures your environment is up to date with our lock file.
+> [!NOTE]
+> - It is not recommended to activate the `venv`, and you should use `uv run <command>` instead to execute scripts within the managed environment.
+>   This ensures consistent environment usage across different shells and sessions. Example: `uv run python examples/run_grpo_math.py`
+> - Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.
+> - If you update your environment in `pyproject.toml`, it is necessary to force a rebuild of the virtual environments by setting `NRL_FORCE_REBUILD_VENVS=true` next time you launch a run.
+> - **Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
 
 ## Training Backends
 
@@ -413,13 +409,13 @@ uv run python examples/converters/convert_dcp_to_hf.py \
     --hf-ckpt-path results/grpo/hf
 ```
 
-If you have a model saved in Megatron format, you can use the following command to convert it to Hugging Face format prior to running evaluation:
+If you have a model saved in Megatron format, you can use the following command to convert it to Hugging Face format prior to running evaluation. This script requires mcore, so make sure to launch with the mcore extra:
 
 ```sh
 # Example for a GRPO checkpoint at step 170
-uv run python examples/converters/convert_megatron_to_hf.py \
+uv run --extra mcore python examples/converters/convert_megatron_to_hf.py \
     --config results/grpo/step_170/config.yaml \
-    --dcp-ckpt-path results/grpo/step_170/policy/weights/iter_0000000 \
+    --megatron-ckpt-path results/grpo/step_170/policy/weights/iter_0000000 \
     --hf-ckpt-path results/grpo/hf
 ```
 

diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -1,4 +1,14 @@
+# Usage:
+# Self-contained build (default: builds from main): docker buildx build -f docker/Dockerfile --tag <registry>/nemo-rl:latest --push .
+# Self-contained build (specific git ref): docker buildx build -f docker/Dockerfile --build-arg NRL_GIT_REF=r0.3.0 --tag <registry>/nemo-rl:r0.3.0 --push .
+# Self-contained build (remote NeMo RL source; no need for a local clone of NeMo RL): docker buildx build -f docker/Dockerfile --build-arg NRL_GIT_REF=r0.3.0 --tag <registry>/nemo-rl:r0.3.0 --push https://github.com/NVIDIA-NeMo/RL.git
+# Local NeMo RL source override: docker buildx build --build-context nemo-rl=. -f docker/Dockerfile --tag <registry>/nemo-rl:latest --push .
+
 ARG BASE_IMAGE=nvcr.io/nvidia/cuda-dl-base:25.05-cuda12.9-devel-ubuntu24.04
+FROM scratch AS nemo-rl
+ARG NRL_GIT_REF=main
+ADD --keep-git-dir=true https://github.com/NVIDIA-NeMo/RL.git#${NRL_GIT_REF} /
+
 FROM ${BASE_IMAGE} AS base
 
 # It is more convenient for users to run as root
@@ -65,8 +75,8 @@ VIRTUAL_ENV=$UV_PROJECT_ENVIRONMENT uv pip install --link-mode symlink flash-att
 EOF
 
 # First copy only the dependency files
-COPY pyproject.toml uv.lock ./
-COPY --link 3rdparty/ ./3rdparty/
+COPY --from=nemo-rl pyproject.toml uv.lock ./
+COPY --from=nemo-rl --link 3rdparty/ ./3rdparty/
 
 RUN <<"EOF" bash -exu
 # uv sync has a more reliable resolver than simple uv pip install which can fail
@@ -100,7 +110,11 @@ LABEL com.nvidia.build.ref="${NVIDIA_BUILD_REF}"
 
 ENV NEMO_RL_VENV_DIR=/opt/ray_venvs
 
-# Copy in source and prefetch all virtual environments
-COPY . /opt/nemo-rl
+# Copy in source from build context (defaults to cloned repo, can be overridden)
+COPY --from=nemo-rl . /opt/nemo-rl
+# Unshallow the repo to get the full history (in the case it was from the scratch layer).
+# Potentially not necessary if the repo is passed in as a complete repository (w/ full git history),
+# so do a quick check before trying to unshallow.
+RUN git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow || true
 RUN UV_LINK_MODE=symlink uv run nemo_rl/utils/prefetch_venvs.py
 
diff --git a/docker/Dockerfile.ngc_pytorch b/docker/Dockerfile.ngc_pytorch
@@ -0,0 +1,128 @@
+# This Dockerfile is used to build a Docker image for NeMo RL with the NGC PyTorch base image.
+# However, it is still a work in progress and is not yet ready for production use.
+#
+# Usage:
+# Self-contained build (default: builds from main): docker buildx build -f docker/Dockerfile.ngc_pytorch --tag <registry>/nemo-rl:latest --push .
+# Self-contained build (specific git ref): docker buildx build -f docker/Dockerfile.ngc_pytorch --build-arg NRL_GIT_REF=r0.3.0 --tag <registry>/nemo-rl:r0.3.0 --push .
+# Self-contained build (remote NeMo RL source; no need for a local clone of NeMo RL): docker buildx build -f docker/Dockerfile.ngc_pytorch --build-arg NRL_GIT_REF=r0.3.0 --tag <registry>/nemo-rl:r0.3.0 --push https://github.com/NVIDIA-NeMo/RL.git
+# Local NeMo RL source override: docker buildx build --build-context nemo-rl=. -f docker/Dockerfile.ngc_pytorch --tag <registry>/nemo-rl:latest --push .
+#
+# If installing new dependencies in the container, then use "uv pip install new-dependency"
+ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:25.06-py3
+FROM scratch AS nemo-rl
+ARG NRL_GIT_REF=main
+ADD --keep-git-dir=true https://github.com/NVIDIA-NeMo/RL.git#${NRL_GIT_REF} /
+
+FROM ${BASE_IMAGE} AS base
+
+# It is more convenient for users to run as root
+USER root
+
+RUN <<"EOF" bash -exu -o pipefail
+export DEBIAN_FRONTEND=noninteractive
+export TZ=America/Los_Angeles
+
+apt-get update
+apt-get install -y --no-install-recommends \
+    jq \
+    curl \
+    git \
+    rsync \
+    wget \
+    less \
+    vim \
+
+
+apt-get clean
+rm -rf /var/lib/apt/lists/*
+EOF
+
+# Install uv at /usr/local/bin in case the root home directory is bind mounted
+ARG UV_VERSION=0.7.2
+RUN curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | XDG_BIN_HOME=/usr/local/bin sh
+
+# Disable usage stats by default for users who are sensitive to sharing usage.
+# Users are encouraged to enable if they wish.
+ENV RAY_USAGE_STATS_ENABLED=0
+ENV NEMO_RL_VENV_DIR=/opt/ray_venvs
+
+# Build vLLM from source to use with the NVIDIA PyTorch base image
+FROM base AS build_vllm
+
+ARG MAX_JOBS=32
+WORKDIR /opt
+COPY --from=nemo-rl uv.lock /tmp/uv.lock
+
+RUN <<"EOF" bash -exu
+echo "Building vLLM from source for PyTorch base image"
+VLLM_VERSION=$(grep -A 1 'name = "vllm"' /tmp/uv.lock | grep 'version =' | sed 's/version = "\(.*\)"/\1/') && \
+echo "Building vLLM version: $VLLM_VERSION"
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+git checkout v$VLLM_VERSION
+python use_existing_torch.py
+pip install -r requirements/build.txt
+pip wheel --no-deps --no-build-isolation -v .
+EOF
+
+FROM base AS hermetic
+
+WORKDIR /opt/nemo-rl
+
+# Variables to control the build of TE. If there are issues with parallelization, consider
+# setting these to 1.
+ARG MAX_JOBS
+ARG NVTE_BUILD_THREADS_PER_JOB
+
+ENV UV_PROJECT_ENVIRONMENT=/opt/nemo_rl_venv
+ENV UV_CACHE_DIR=/opt/uv_cache
+ENV UV_LINK_MODE=copy
+
+# Define the no-install-package arguments for PyTorch base images
+ARG BASE_IMAGE
+ARG UV_NO_INSTALL_PACKAGES="--no-install-package torch --no-install-package torchvision --no-install-package triton --no-install-package nvidia-cublas-cu12 --no-install-package nvidia-cuda-cupti-cu12 --no-install-package nvidia-cuda-nvrtc-cu12 --no-install-package nvidia-cuda-runtime-cu12 --no-install-package nvidia-cudnn-cu12 --no-install-package nvidia-cufft-cu12 --no-install-package nvidia-cufile-cu12 --no-install-package nvidia-curand-cu12 --no-install-package nvidia-cusolver-cu12 --no-install-package nvidia-cusparse-cu12 --no-install-package nvidia-cusparselt-cu12 --no-install-package nvidia-nccl-cu12 --no-install-package vllm --no-install-package flash-attn --no-install-package transformer-engine --no-install-package transformer-engine-cu12 --no-install-package transformer-engine-torch --no-install-package numpy"
+ENV UV_NO_INSTALL_PACKAGES=${UV_NO_INSTALL_PACKAGES}
+ENV PATH="/opt/nemo_rl_venv/bin:$PATH"
+
+# First copy only the dependency files
+COPY --from=nemo-rl pyproject.toml uv.lock ./
+COPY --from=nemo-rl --link 3rdparty/ ./3rdparty/
+
+
+RUN --mount=type=bind,from=build_vllm,source=/opt/,target=/tmp/build_vllm/ <<"EOF" bash -exu
+
+# uv sync has a more reliable resolver than simple uv pip install which can fail
+# The venv is symlinked to avoid bloating the layer size
+uv venv --system-site-packages ${UV_PROJECT_ENVIRONMENT}
+uv pip install --no-cache-dir --no-deps /tmp/build_vllm/vllm/vllm*.whl
+uv sync --link-mode symlink --locked --inexact --extra vllm --extra mcore --extra automodel --all-groups --no-install-project $UV_NO_INSTALL_PACKAGES
+EOF
+
+ENV NEMO_RL_VENV_DIR=/opt/ray_venvs
+
+WORKDIR /opt/nemo-rl
+
+FROM hermetic AS release
+
+ARG NEMO_RL_COMMIT
+ARG NVIDIA_BUILD_ID
+ARG NVIDIA_BUILD_REF
+ENV UV_NO_SYNC=1
+ENV NEMO_RL_COMMIT=${NEMO_RL_COMMIT:-<unknown>}
+ENV NVIDIA_BUILD_ID=${NVIDIA_BUILD_ID:-<unknown>}
+ENV NVIDIA_BUILD_REF=${NVIDIA_BUILD_REF:-<unknown>}
+ENV NEMO_RL_PY_EXECUTABLES_SYSTEM=1
+# The 25.06 Pytorch container is not compatible with vllm standalone compile so we disable it
+ENV VLLM_USE_STANDALONE_COMPILE=0
+LABEL com.nvidia.build.id="${NVIDIA_BUILD_ID}"
+LABEL com.nvidia.build.ref="${NVIDIA_BUILD_REF}"
+
+ENV NEMO_RL_VENV_DIR=/opt/ray_venvs
+
+# Copy in source from build context (defaults to cloned repo, can be overridden)
+COPY --from=nemo-rl . /opt/nemo-rl
+# Unshallow the repo to get the full history (in the case it was from the scratch layer).
+# Potentially not necessary if the repo is passed in as a complete repository (w/ full git history),
+# so do a quick check before trying to unshallow.
+RUN git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow || true
+RUN UV_LINK_MODE=symlink uv sync --locked --inexact $UV_NO_INSTALL_PACKAGES
diff --git a/docker/README.md b/docker/README.md
@@ -3,8 +3,8 @@ NOTE: *We use `docker buildx` instead of `docker build` for these containers*
 
 This directory contains the `Dockerfile` for NeMo-RL Docker images.
 You can build two types of images:
-- A **base image**: A minimal image where Python dependencies can be specified at runtime.
-- A **hermetic image**: An image that includes default dependencies for offline use.
+- A **release image** (recommended): Contains everything from the hermetic image, plus the nemo-rl source code and pre-fetched virtual environments for isolated workers.
+- A **hermetic image**: Includes the base image plus pre-fetched NeMo RL python packages in the `uv` cache.
 
 
 For detailed instructions on building these images, please see [docs/docker.md](../docs/docker.md).
diff --git a/docs/adding-new-models.md b/docs/adding-new-models.md
@@ -152,3 +152,42 @@ uv run --extra vllm tools/model_diagnostics/2.long_generation_decode_vs_prefill.
 # ...
 # [Qwen/Qwen2.5-1.5B] ALL GOOD!
 ```
+
+## [3.check_hf_model_embeddings_untrained.py](https://github.com/NVIDIA-NeMo/RL/blob/main/tools/model_diagnostics/3.check_hf_model_embeddings_untrained.py)
+
+Detects untrained or improperly initialized Hugging Face model embeddings by scanning for near-zero rows and rows with near-identical values in both input and output embeddings. The script also reports whether word embeddings are tied and summarizes basic statistics.
+
+```sh
+# Example run
+uv run --extra mcore tools/model_diagnostics/3.check_hf_model_embeddings_untrained.py --model nvidia/Nemotron-H-8B-Base-8K
+
+# ....
+#================================================================================
+#EMBEDDING SUMMARIES
+#================================================================================
+#
+#--- Input Embeddings Summary ---
+#Shape: torch.Size([131072, 4096]), Dtype: torch.bfloat16
+#Near-zero embeddings (abs < 1.00e-10): 1039/131072 (0.8%)
+#  Indices: 0-1,3-999,1192-1193,1245-1255,55014,77579,81772,81819,82312,82500,82725,82737,82977,84020,84121,84521,84794,85015,86409,87411,89412,90320,91368,94485,96385,104097,108262,112147,112327,112497,114755
+#Identical embeddings (std < 1.00e-08): 1041/131072 (0.8%)
+#  Indices: 0-1,3-999,1192-1193,1245-1255,55014,77579,81772,81819,82312,82500,82725,82737,82977,83855,84020,84121,84521,84794,85015,86409,87411,89412,90320,91368,94485,96385,101707,104097,108262,112147,112327,112497,114755
+#Statistics: mean_abs=0.007874, max_abs=0.196289, std_range=[0.000000, 0.015442]
+#⚠️  POTENTIAL ISSUES: 1039 near-zero embeddings, 1041 identical embeddings
+#
+#--- Output Embeddings Summary (Tied: False) ---
+#Shape: torch.Size([131072, 4096]), Dtype: torch.bfloat16
+#Near-zero embeddings (abs < 1.00e-10): 0/131072 (0.0%)
+#Identical embeddings (std < 1.00e-08): 0/131072 (0.0%)
+#Statistics: mean_abs=0.006775, max_abs=0.200195, std_range=[0.004089, 0.021240]
+#✅ No obvious untrained patterns detected
+#
+#=== Final Summary ===
+#Model: nvidia/Nemotron-H-8B-Base-8K
+#Analysis complete.
+```
+
+- Thresholds can be adjusted via flags:
+  - `--near-zero-threshold` (default: `1e-10`)
+  - `--identical-threshold` (default: `1e-8`)
+- If any near-zero or identical rows are reported, the model may have issues of numerical instability (e.g., inf grad norms) during post-training if any of these problematic tokens are encountered. We have observed this happening when special tokens are reserved in the tokenizer and embedding, but none are encountered during pre-training. It may help to initialize these embeddings similar to how they were initialize during pre-training.