cp: `fix: Incompatible configuration between reward normalization and the loo (1519)` into `r0.4.0` by chtruong814 · Pull Request #1533 · NVIDIA-NeMo/RL

chtruong814 · 2025-11-18T01:11:00Z

beep boop [🤖]: Hi @ffrujeri 👋,

we've cherry picked #1519 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Bug Fixes
- Improved numerical stability in advantage normalization by safely handling edge cases with zero variance.
Tests
- Enhanced test coverage for normalization logic, including comprehensive edge case scenarios with zero variance conditions.

…loo (#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai · 2025-11-18T01:14:00Z

📝 Walkthrough

Walkthrough

Modified the normalize_advantages_with_epsilon function in the GRPO algorithm to skip normalization for samples with zero standard deviation, preventing division by zero. Implemented using a mask-based selective division approach. Updated corresponding unit tests to validate edge cases for zero-std entries.

Changes

Cohort / File(s)	Summary
Algorithm normalization logic `nemo_rl/algorithms/grpo.py`	Modified `normalize_advantages_with_epsilon` to use mask-based selective normalization: skips division by zero for samples with std == 0, while applying normalization with epsilon stabilization to samples with std > 0.
Unit tests `tests/unit/algorithms/test_grpo.py`	Updated existing test expectations and added new test cases to validate normalization behavior for edge cases: zero-std entries remaining unchanged, non-zero std entries normalizing with epsilon, and combinations with zero/small std values.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

The normalization logic change is localized to a single function with clear intent (mask-based division to avoid zero-std issues)
Test updates follow a consistent pattern validating the new behavior across edge cases
Changes are straightforward once the masking approach is understood; minimal interdependencies

Possibly related PRs

feat: enhance advantages tracking and normalization stability in GRPO #1423: Introduced the original normalize_advantages_with_epsilon function with epsilon-based stability; this PR refines its behavior for zero-std edge cases.
fix: Incompatible configuration between reward normalization and the loo #1519: Related modifications to the same function using boolean masking for zero-std handling to prevent division by zero.
cp: feat: enhance advantages tracking and normalization stability in GRPO (1423) into r0.4.0 #1516: Earlier related work on epsilon-based normalization in the same function; this PR extends the logic for leave-one-out baseline scenarios.

Suggested labels

CI:L1, r0.4.0

Suggested reviewers

ffrujeri
terrykong

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title references fixing incompatibility between reward normalization and leave-one-out baseline, which aligns with the actual changes normalizing only positive std samples.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR contains targeted bug fix with comprehensive test coverage for numerical stability in reward normalization with zero standard deviation.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-1519-r0.4.0

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abeff54 and cd16afd.

📒 Files selected for processing (2)

nemo_rl/algorithms/grpo.py (1 hunks)
tests/unit/algorithms/test_grpo.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/unit/algorithms/test_grpo.py (1)

nemo_rl/algorithms/grpo.py (1)

normalize_advantages_with_epsilon (545-569)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: sphinx-build / Build docs
GitHub Check: build-container / main
GitHub Check: Lint check
GitHub Check: Check if PR branch is up to date
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (4)

nemo_rl/algorithms/grpo.py (1)

545-569: LGTM! Clean implementation of the mask-based normalization.

The mask-based approach correctly handles the incompatibility between reward normalization and leave-one-out baseline by skipping normalization for samples with zero standard deviation. The logic is efficient and the docstring clearly explains the behavior.

tests/unit/algorithms/test_grpo.py (3)

1228-1241: Test correctly validates zero std handling.

The updated test properly validates that samples with std=0 remain unchanged while samples with std>0 are normalized with epsilon.

1244-1256: Good fix for in-place modification testing.

Properly captures expected values before the in-place modification by cloning. This ensures the test validates the actual behavior rather than comparing the modified tensor to itself.

1288-1343: Excellent test coverage for edge cases.

The new test functions provide comprehensive validation of the mask-based normalization:

Leave-one-out baseline compatibility is verified

Zero std with both zero and non-zero advantages are tested

Small non-zero std values are confirmed to be normalized (no thresholding)

All tests correctly use clone() to capture expected values before in-place modification.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fix: Incompatible configuration between reward normalization and the …

cd16afd

…loo (#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

chtruong814 requested a review from a team as a code owner November 18, 2025 01:11

chtruong814 requested a review from ffrujeri November 18, 2025 01:11

chtruong814 requested a review from a team as a code owner November 18, 2025 01:11

chtruong814 added cherry-pick Run CICD labels Nov 18, 2025

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Nov 18, 2025

terrykong enabled auto-merge (squash) November 18, 2025 01:12

terrykong approved these changes Nov 18, 2025

View reviewed changes

terrykong temporarily deployed to nemo-ci November 18, 2025 01:12 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci November 18, 2025 01:31 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci November 18, 2025 03:04 — with GitHub Actions Inactive

terrykong merged commit f3feb8c into r0.4.0 Nov 18, 2025
64 of 71 checks passed

terrykong deleted the cherry-pick-1519-r0.4.0 branch November 18, 2025 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `fix: Incompatible configuration between reward normalization and the loo (1519)` into `r0.4.0`#1533

cp: `fix: Incompatible configuration between reward normalization and the loo (1519)` into `r0.4.0`#1533
terrykong merged 1 commit intor0.4.0from
cherry-pick-1519-r0.4.0

chtruong814 commented Nov 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 18, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chtruong814 commented Nov 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 18, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chtruong814 commented Nov 18, 2025 •

edited by coderabbitai bot

Loading