Skip to content

cp: fix: Incompatible configuration between reward normalization and the loo (1519) into r0.4.0#1533

Merged
terrykong merged 1 commit intor0.4.0from
cherry-pick-1519-r0.4.0
Nov 18, 2025
Merged

cp: fix: Incompatible configuration between reward normalization and the loo (1519) into r0.4.0#1533
terrykong merged 1 commit intor0.4.0from
cherry-pick-1519-r0.4.0

Conversation

@chtruong814
Copy link
Copy Markdown
Contributor

@chtruong814 chtruong814 commented Nov 18, 2025

beep boop [🤖]: Hi @ffrujeri 👋,

we've cherry picked #1519 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Bug Fixes

    • Improved numerical stability in advantage normalization by safely handling edge cases with zero variance.
  • Tests

    • Enhanced test coverage for normalization logic, including comprehensive edge case scenarios with zero variance conditions.

…loo (#1519)

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@chtruong814 chtruong814 requested a review from a team as a code owner November 18, 2025 01:11
@chtruong814 chtruong814 requested a review from ffrujeri November 18, 2025 01:11
@chtruong814 chtruong814 requested a review from a team as a code owner November 18, 2025 01:11
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Nov 18, 2025
@terrykong terrykong enabled auto-merge (squash) November 18, 2025 01:12
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 18, 2025

📝 Walkthrough

Walkthrough

Modified the normalize_advantages_with_epsilon function in the GRPO algorithm to skip normalization for samples with zero standard deviation, preventing division by zero. Implemented using a mask-based selective division approach. Updated corresponding unit tests to validate edge cases for zero-std entries.

Changes

Cohort / File(s) Summary
Algorithm normalization logic
nemo_rl/algorithms/grpo.py
Modified normalize_advantages_with_epsilon to use mask-based selective normalization: skips division by zero for samples with std == 0, while applying normalization with epsilon stabilization to samples with std > 0.
Unit tests
tests/unit/algorithms/test_grpo.py
Updated existing test expectations and added new test cases to validate normalization behavior for edge cases: zero-std entries remaining unchanged, non-zero std entries normalizing with epsilon, and combinations with zero/small std values.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • The normalization logic change is localized to a single function with clear intent (mask-based division to avoid zero-std issues)
  • Test updates follow a consistent pattern validating the new behavior across edge cases
  • Changes are straightforward once the masking approach is understood; minimal interdependencies

Possibly related PRs

Suggested labels

CI:L1, r0.4.0

Suggested reviewers

  • ffrujeri
  • terrykong

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title references fixing incompatibility between reward normalization and leave-one-out baseline, which aligns with the actual changes normalizing only positive std samples.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR contains targeted bug fix with comprehensive test coverage for numerical stability in reward normalization with zero standard deviation.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1519-r0.4.0

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abeff54 and cd16afd.

📒 Files selected for processing (2)
  • nemo_rl/algorithms/grpo.py (1 hunks)
  • tests/unit/algorithms/test_grpo.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/unit/algorithms/test_grpo.py (1)
nemo_rl/algorithms/grpo.py (1)
  • normalize_advantages_with_epsilon (545-569)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: build-container / main
  • GitHub Check: Lint check
  • GitHub Check: Check if PR branch is up to date
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (4)
nemo_rl/algorithms/grpo.py (1)

545-569: LGTM! Clean implementation of the mask-based normalization.

The mask-based approach correctly handles the incompatibility between reward normalization and leave-one-out baseline by skipping normalization for samples with zero standard deviation. The logic is efficient and the docstring clearly explains the behavior.

tests/unit/algorithms/test_grpo.py (3)

1228-1241: Test correctly validates zero std handling.

The updated test properly validates that samples with std=0 remain unchanged while samples with std>0 are normalized with epsilon.


1244-1256: Good fix for in-place modification testing.

Properly captures expected values before the in-place modification by cloning. This ensures the test validates the actual behavior rather than comparing the modified tensor to itself.


1288-1343: Excellent test coverage for edge cases.

The new test functions provide comprehensive validation of the mask-based normalization:

  • Leave-one-out baseline compatibility is verified
  • Zero std with both zero and non-zero advantages are tested
  • Small non-zero std values are confirmed to be normalized (no thresholding)

All tests correctly use clone() to capture expected values before in-place modification.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@terrykong terrykong merged commit f3feb8c into r0.4.0 Nov 18, 2025
64 of 71 checks passed
@terrykong terrykong deleted the cherry-pick-1519-r0.4.0 branch November 18, 2025 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L1 Run doctests, unit tests, and functional tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants