Skip to content

fix: Incompatible configuration between reward normalization and the loo#1519

Merged
terrykong merged 3 commits intomainfrom
ffrujeri/loo-normalization-bug
Nov 18, 2025
Merged

fix: Incompatible configuration between reward normalization and the loo#1519
terrykong merged 3 commits intomainfrom
ffrujeri/loo-normalization-bug

Conversation

@ffrujeri
Copy link
Copy Markdown
Contributor

@ffrujeri ffrujeri commented Nov 13, 2025

What does this PR do?

Fixes numerical instability in advantage normalization when using leave-one-out baseline by gracefully skipping normalization for zero-variance samples.

Issues

Closes #1518

Changes

Summary

  • Modified normalize_advantages_with_epsilon() to skip normalization when std=0 and advantages!=0
  • This makes normalize_rewards=True and use_leave_one_out_baseline=True fully compatible
  • Added comprehensive tests to verify correct behavior in all edge cases

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Summary by CodeRabbit

  • Bug Fixes

    • Modified advantage normalization to skip processing when standard deviation values are zero or near-zero, preventing potential numerical issues.
  • Tests

    • Added comprehensive test coverage for edge cases involving zero and near-zero standard deviation scenarios, including those from leave-one-out baseline calculations.

@ffrujeri ffrujeri force-pushed the ffrujeri/loo-normalization-bug branch from 4e271f4 to 5d9e0f9 Compare November 15, 2025 06:52
…patible with leave one out estimation.

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@ffrujeri ffrujeri force-pushed the ffrujeri/loo-normalization-bug branch from 5d9e0f9 to 3bb7dbc Compare November 17, 2025 19:07
@ffrujeri ffrujeri marked this pull request as ready for review November 17, 2025 19:09
@ffrujeri ffrujeri requested review from a team as code owners November 17, 2025 19:09
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 17, 2025

📝 Walkthrough

Walkthrough

The PR modifies the advantage normalization function in GRPO to conditionally apply normalization only when standard deviation is positive, skipping normalization for zero or near-zero std cases. This prevents normalized advantages from exploding when using leave-one-out baselines with identical reward samples. Both algorithm and test code are updated accordingly.

Changes

Cohort / File(s) Summary
Advantage Normalization Logic
nemo_rl/algorithms/grpo.py
Modified normalize_advantages_with_epsilon to compute a mask for positive std values and selectively normalize only those entries, leaving others unchanged. Prevents division by near-zero epsilon when std ≤ 0 with leave-one-out baselines. Updated docstring to reflect new behavior.
Test Coverage
tests/unit/algorithms/test_grpo.py
Added comprehensive test cases for normalize_advantages_with_epsilon covering zero std scenarios, leave-one-out baseline edge cases, mixed std values, zero advantages with zero std, and very small non-zero std values to verify normalization occurs only when std > 0.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10-15 minutes

  • Localized conditional logic change in a single utility function
  • Straightforward test additions with clear expected behaviors
  • Changes directly address a known incompatibility without architectural modifications
  • Recommend reviewing the mask application logic for edge cases with different tensor batch shapes

Possibly related PRs

  • PR #1516: Implements opposing zero/near-zero std handling in the same normalization function, providing context for alternative design approaches
  • PR #1423: Introduces the initial epsilon-based normalization logic that this PR refines to conditionally apply based on std values

Suggested labels

Run CICD

Suggested reviewers

  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title is incomplete and vague, cutting off mid-sentence ('loo' appears truncated), making it unclear what incompatibility is being addressed. Complete and clarify the title to fully describe the fix, such as: 'fix: Skip advantage normalization when std=0 with leave-one-out baseline' or 'fix: Handle zero-std variance in advantage normalization with leave-one-out baseline'
✅ Passed checks (5 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR implements Option 3 (Conditional Normalization) from issue #1518, modifying normalize_advantages_with_epsilon to skip normalization when std=0 or is near-zero, preventing numerical explosions while keeping both features compatible.
Out of Scope Changes check ✅ Passed All changes directly address the incompatibility issue: modifying normalization logic in grpo.py and adding corresponding test cases to verify the fix handles zero/near-zero std scenarios correctly.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed PR includes comprehensive test coverage for numerical stability fix with three specific tests addressing zero standard deviation issue and additional edge case tests.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ffrujeri/loo-normalization-bug

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
nemo_rl/algorithms/grpo.py (1)

655-660: In-place tensor modification may cause unexpected behavior.

The function modifies the advantages tensor in-place using advantages[non_zero_std_mask] = .... This mutates the input tensor, which could be surprising to callers expecting a pure function. Consider either:

  1. Documenting this mutation behavior clearly in the docstring
  2. Creating a copy: advantages = advantages.clone() at the start

Apply this diff to document the mutation:

     """Normalize advantages by standard deviation, skipping samples with zero std.

     When std is exactly zero (from leave-one-out baseline with identical rewards),
     normalization is skipped for those samples to prevent numerical instability.
     This makes normalize_rewards compatible with use_leave_one_out_baseline.

     Args:
         advantages: Tensor of shape (batch_size, 1) containing advantage values
+            Note: This tensor is modified in-place.
         std: Tensor of shape (batch_size,) containing standard deviation values
         epsilon: Small value to avoid division by very small std, defaults to 1e-6

     Returns:
         Normalized advantages tensor of same shape as input advantages
     """
tests/unit/algorithms/test_grpo.py (1)

1315-1316: Clarify comment: normalization is skipped, not performed.

The comment states "Sample 0: std=0 but advantage=0 -> normalize (gives 0)", but this is misleading. When std=0, the new implementation skips normalization entirely—the advantage remains unchanged at 0, rather than being "normalized to 0". While the result is the same, the reasoning matters for understanding the code's behavior.

Apply this diff to clarify:

-    # Sample 0: std=0 but advantage=0 -> normalize (gives 0)
+    # Sample 0: std=0 but advantage=0 -> skip normalization (remains 0)
     assert torch.allclose(result[0], torch.tensor([[0.0]]), rtol=1e-5)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fc917f and 3bb7dbc.

📒 Files selected for processing (2)
  • nemo_rl/algorithms/grpo.py (1 hunks)
  • tests/unit/algorithms/test_grpo.py (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-09-20T14:59:08.052Z
Learning: If a change could affect numerics or convergence, include evidence in the PR description demonstrating no regression.
🧬 Code graph analysis (1)
tests/unit/algorithms/test_grpo.py (1)
nemo_rl/algorithms/grpo.py (1)
  • normalize_advantages_with_epsilon (636-660)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (5)
nemo_rl/algorithms/grpo.py (1)

655-660: Consider clarifying the threshold choice (std > 0 vs a small epsilon threshold).

The implementation uses std > 0 to determine which samples to normalize. This means that even std values like 1e-300 will be normalized (with epsilon protection), while exactly 0.0 will be skipped. This is mathematically clean but worth verifying:

  • Is this the intended behavior, or should there be a small threshold (e.g., std > 1e-10) to handle near-zero floating-point values?
  • The current approach treats exact zero specially, which aligns with the leave-one-out case producing exact zero variance

The current approach is likely correct for the stated use case, but please confirm that exact zero (not near-zero) is the intended condition for skipping normalization. If near-zero values should also be skipped, consider using a small threshold instead of comparing to zero.

tests/unit/algorithms/test_grpo.py (4)

1240-1245: Test correctly updated for new normalization behavior.

The expected values have been properly updated to reflect that samples with std=0 now remain unchanged rather than being normalized. The comment accurately describes the expected behavior.


1256-1258: Test correctly handles all-zero std case.

The test properly validates that when all std values are zero, all advantages remain unchanged. This covers an important edge case for the leave-one-out baseline scenario.


1290-1305: Excellent test coverage for the leave-one-out baseline scenario.

This test directly addresses the root cause described in issue #1518. It properly validates:

  • Sample with std=0 keeps its advantage unchanged (preventing explosion)
  • Samples with std>0 are normalized correctly

The test documentation clearly explains the leave-one-out case, making it easy to understand the scenario being tested.


1325-1335: Good test for small non-zero std values, but consider numerical implications.

This test correctly validates that small but non-zero std values still trigger normalization (no threshold). However, note that for very small std values (e.g., 0.0001), the normalized advantage can still become quite large: advantage / (0.0001 + 1e-6) ≈ advantage / 0.0001 = 10,000 * advantage. While not as extreme as the previous 1/epsilon explosion, this could still produce large values in practice.

Based on learnings: Consider whether convergence testing has been performed with small positive std values to ensure numerical stability.

terrykong
terrykong previously approved these changes Nov 17, 2025
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@ffrujeri ffrujeri added the CI:L1 Run doctests, unit tests, and functional tests label Nov 17, 2025
@ffrujeri ffrujeri changed the title fix: Incompatible configuration of normalize_rewards with use_leave_one_out_baseline fix: Incompatible configuration between reward normalization and use_leave_one_out_baseline Nov 17, 2025
@ffrujeri ffrujeri changed the title fix: Incompatible configuration between reward normalization and use_leave_one_out_baseline fix: Incompatible configuration between reward normalization and the loo Nov 17, 2025
@terrykong terrykong merged commit 775fc34 into main Nov 18, 2025
41 of 51 checks passed
@terrykong terrykong deleted the ffrujeri/loo-normalization-bug branch November 18, 2025 01:10
chtruong814 pushed a commit that referenced this pull request Nov 18, 2025
…loo (#1519)

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
…loo (NVIDIA-NeMo#1519)

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026
…loo (NVIDIA-NeMo#1519)

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…loo (NVIDIA-NeMo#1519)

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Leave-One-Out Baseline and Advantage Normalization are incompatible

2 participants