fix: Incompatible configuration between reward normalization and the loo by ffrujeri · Pull Request #1519 · NVIDIA-NeMo/RL

ffrujeri · 2025-11-13T21:20:34Z

What does this PR do?

Fixes numerical instability in advantage normalization when using leave-one-out baseline by gracefully skipping normalization for zero-variance samples.

Issues

Closes #1518

Changes

Summary

Modified normalize_advantages_with_epsilon() to skip normalization when std=0 and advantages!=0
This makes normalize_rewards=True and use_leave_one_out_baseline=True fully compatible
Added comprehensive tests to verify correct behavior in all edge cases

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Summary by CodeRabbit

Bug Fixes
- Modified advantage normalization to skip processing when standard deviation values are zero or near-zero, preventing potential numerical issues.
Tests
- Added comprehensive test coverage for edge cases involving zero and near-zero standard deviation scenarios, including those from leave-one-out baseline calculations.

…patible with leave one out estimation. Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

coderabbitai · 2025-11-17T19:15:11Z

📝 Walkthrough

Walkthrough

The PR modifies the advantage normalization function in GRPO to conditionally apply normalization only when standard deviation is positive, skipping normalization for zero or near-zero std cases. This prevents normalized advantages from exploding when using leave-one-out baselines with identical reward samples. Both algorithm and test code are updated accordingly.

Changes

Cohort / File(s)	Summary
Advantage Normalization Logic `nemo_rl/algorithms/grpo.py`	Modified `normalize_advantages_with_epsilon` to compute a mask for positive std values and selectively normalize only those entries, leaving others unchanged. Prevents division by near-zero epsilon when std ≤ 0 with leave-one-out baselines. Updated docstring to reflect new behavior.
Test Coverage `tests/unit/algorithms/test_grpo.py`	Added comprehensive test cases for `normalize_advantages_with_epsilon` covering zero std scenarios, leave-one-out baseline edge cases, mixed std values, zero advantages with zero std, and very small non-zero std values to verify normalization occurs only when std > 0.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10-15 minutes

Localized conditional logic change in a single utility function
Straightforward test additions with clear expected behaviors
Changes directly address a known incompatibility without architectural modifications
Recommend reviewing the mask application logic for edge cases with different tensor batch shapes

Possibly related PRs

PR #1516: Implements opposing zero/near-zero std handling in the same normalization function, providing context for alternative design approaches
PR #1423: Introduces the initial epsilon-based normalization logic that this PR refines to conditionally apply based on std values

Suggested labels

Run CICD

Suggested reviewers

terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title is incomplete and vague, cutting off mid-sentence ('loo' appears truncated), making it unclear what incompatibility is being addressed.	Complete and clarify the title to fully describe the fix, such as: 'fix: Skip advantage normalization when std=0 with leave-one-out baseline' or 'fix: Handle zero-std variance in advantage normalization with leave-one-out baseline'

✅ Passed checks (5 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	The PR implements Option 3 (Conditional Normalization) from issue #1518, modifying normalize_advantages_with_epsilon to skip normalization when std=0 or is near-zero, preventing numerical explosions while keeping both features compatible.
Out of Scope Changes check	✅ Passed	All changes directly address the incompatibility issue: modifying normalization logic in grpo.py and adding corresponding test cases to verify the fix handles zero/near-zero std scenarios correctly.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	PR includes comprehensive test coverage for numerical stability fix with three specific tests addressing zero standard deviation issue and additional edge case tests.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ffrujeri/loo-normalization-bug

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

nemo_rl/algorithms/grpo.py (1)
655-660: In-place tensor modification may cause unexpected behavior.

The function modifies the advantages tensor in-place using advantages[non_zero_std_mask] = .... This mutates the input tensor, which could be surprising to callers expecting a pure function. Consider either:

Documenting this mutation behavior clearly in the docstring

Creating a copy: advantages = advantages.clone() at the start

Apply this diff to document the mutation:
     """Normalize advantages by standard deviation, skipping samples with zero std.

     When std is exactly zero (from leave-one-out baseline with identical rewards),
     normalization is skipped for those samples to prevent numerical instability.
     This makes normalize_rewards compatible with use_leave_one_out_baseline.

     Args:
         advantages: Tensor of shape (batch_size, 1) containing advantage values
+            Note: This tensor is modified in-place.
         std: Tensor of shape (batch_size,) containing standard deviation values
         epsilon: Small value to avoid division by very small std, defaults to 1e-6

     Returns:
         Normalized advantages tensor of same shape as input advantages
     """
tests/unit/algorithms/test_grpo.py (1)
1315-1316: Clarify comment: normalization is skipped, not performed.

The comment states "Sample 0: std=0 but advantage=0 -> normalize (gives 0)", but this is misleading. When std=0, the new implementation skips normalization entirely—the advantage remains unchanged at 0, rather than being "normalized to 0". While the result is the same, the reasoning matters for understanding the code's behavior.

Apply this diff to clarify:
-    # Sample 0: std=0 but advantage=0 -> normalize (gives 0)
+    # Sample 0: std=0 but advantage=0 -> skip normalization (remains 0)
     assert torch.allclose(result[0], torch.tensor([[0.0]]), rtol=1e-5)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fc917f and 3bb7dbc.

📒 Files selected for processing (2)

nemo_rl/algorithms/grpo.py (1 hunks)
tests/unit/algorithms/test_grpo.py (3 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-09-20T14:59:08.052Z
Learning: If a change could affect numerics or convergence, include evidence in the PR description demonstrating no regression.

🧬 Code graph analysis (1)

tests/unit/algorithms/test_grpo.py (1)

nemo_rl/algorithms/grpo.py (1)

normalize_advantages_with_epsilon (636-660)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (5)

nemo_rl/algorithms/grpo.py (1)

655-660: Consider clarifying the threshold choice (std > 0 vs a small epsilon threshold).

The implementation uses std > 0 to determine which samples to normalize. This means that even std values like 1e-300 will be normalized (with epsilon protection), while exactly 0.0 will be skipped. This is mathematically clean but worth verifying:

Is this the intended behavior, or should there be a small threshold (e.g., std > 1e-10) to handle near-zero floating-point values?

The current approach treats exact zero specially, which aligns with the leave-one-out case producing exact zero variance

The current approach is likely correct for the stated use case, but please confirm that exact zero (not near-zero) is the intended condition for skipping normalization. If near-zero values should also be skipped, consider using a small threshold instead of comparing to zero.

tests/unit/algorithms/test_grpo.py (4)

1240-1245: Test correctly updated for new normalization behavior.

The expected values have been properly updated to reflect that samples with std=0 now remain unchanged rather than being normalized. The comment accurately describes the expected behavior.

1256-1258: Test correctly handles all-zero std case.

The test properly validates that when all std values are zero, all advantages remain unchanged. This covers an important edge case for the leave-one-out baseline scenario.

1290-1305: Excellent test coverage for the leave-one-out baseline scenario.

This test directly addresses the root cause described in issue #1518. It properly validates:

Sample with std=0 keeps its advantage unchanged (preventing explosion)

Samples with std>0 are normalized correctly

The test documentation clearly explains the leave-one-out case, making it easy to understand the scenario being tested.

1325-1335: Good test for small non-zero std values, but consider numerical implications.

This test correctly validates that small but non-zero std values still trigger normalization (no threshold). However, note that for very small std values (e.g., 0.0001), the normalized advantage can still become quite large: advantage / (0.0001 + 1e-6) ≈ advantage / 0.0001 = 10,000 * advantage. While not as extreme as the previous 1/epsilon explosion, this could still produce large values in practice.

Based on learnings: Consider whether convergence testing has been performed with small positive std values to ensure numerical stability.

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

…loo (#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

…loo (NVIDIA-NeMo#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

…loo (NVIDIA-NeMo#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

ffrujeri force-pushed the ffrujeri/loo-normalization-bug branch from 4e271f4 to 5d9e0f9 Compare November 15, 2025 06:52

terrykong added the r0.4.0 label Nov 17, 2025

ffrujeri added 2 commits November 17, 2025 11:06

Add filter for advantage normalization in case the std == 0 to be com…

db652aa

…patible with leave one out estimation. Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Simplify filtering logic for std==0.

3bb7dbc

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

ffrujeri force-pushed the ffrujeri/loo-normalization-bug branch from 5d9e0f9 to 3bb7dbc Compare November 17, 2025 19:07

ffrujeri marked this pull request as ready for review November 17, 2025 19:09

ffrujeri requested review from a team as code owners November 17, 2025 19:09

coderabbitai bot reviewed Nov 17, 2025

View reviewed changes

terrykong previously approved these changes Nov 17, 2025

View reviewed changes

Fix tests.

b9a0ebf

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

ffrujeri dismissed terrykong’s stale review via b9a0ebf November 17, 2025 19:50

ffrujeri added the CI:L1 Run doctests, unit tests, and functional tests label Nov 17, 2025

ffrujeri temporarily deployed to nemo-ci November 17, 2025 19:51 — with GitHub Actions Inactive

ffrujeri changed the title ~~fix: Incompatible configuration of normalize_rewards with use_leave_one_out_baseline~~ fix: Incompatible configuration between reward normalization and use_leave_one_out_baseline Nov 17, 2025

ffrujeri changed the title ~~fix: Incompatible configuration between reward normalization and use_leave_one_out_baseline~~ fix: Incompatible configuration between reward normalization and the loo Nov 17, 2025

ffrujeri temporarily deployed to nemo-ci November 17, 2025 22:17 — with GitHub Actions Inactive

terrykong approved these changes Nov 18, 2025

View reviewed changes

terrykong merged commit 775fc34 into main Nov 18, 2025
41 of 51 checks passed

terrykong deleted the ffrujeri/loo-normalization-bug branch November 18, 2025 01:10

chtruong814 pushed a commit that referenced this pull request Nov 18, 2025

fix: Incompatible configuration between reward normalization and the …

cd16afd

…loo (#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai bot mentioned this pull request Nov 18, 2025

cp: fix: Incompatible configuration between reward normalization and the loo (1519) into r0.4.0 #1533

Merged

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

fix: Incompatible configuration between reward normalization and the …

2713e87

…loo (NVIDIA-NeMo#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026

fix: Incompatible configuration between reward normalization and the …

7f05cdd

…loo (NVIDIA-NeMo#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Incompatible configuration between reward normalization and the loo#1519

fix: Incompatible configuration between reward normalization and the loo#1519
terrykong merged 3 commits intomainfrom
ffrujeri/loo-normalization-bug

ffrujeri commented Nov 13, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 17, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ffrujeri commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Issues

Changes

Summary

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ffrujeri commented Nov 13, 2025 •

edited

Loading

coderabbitai bot commented Nov 17, 2025 •

edited

Loading