fix: Incompatible configuration between reward normalization and the loo#1519
fix: Incompatible configuration between reward normalization and the loo#1519
Conversation
4e271f4 to
5d9e0f9
Compare
…patible with leave one out estimation. Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
5d9e0f9 to
3bb7dbc
Compare
📝 WalkthroughWalkthroughThe PR modifies the advantage normalization function in GRPO to conditionally apply normalization only when standard deviation is positive, skipping normalization for zero or near-zero std cases. This prevents normalized advantages from exploding when using leave-one-out baselines with identical reward samples. Both algorithm and test code are updated accordingly. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10-15 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
nemo_rl/algorithms/grpo.py (1)
655-660: In-place tensor modification may cause unexpected behavior.The function modifies the
advantagestensor in-place usingadvantages[non_zero_std_mask] = .... This mutates the input tensor, which could be surprising to callers expecting a pure function. Consider either:
- Documenting this mutation behavior clearly in the docstring
- Creating a copy:
advantages = advantages.clone()at the startApply this diff to document the mutation:
"""Normalize advantages by standard deviation, skipping samples with zero std. When std is exactly zero (from leave-one-out baseline with identical rewards), normalization is skipped for those samples to prevent numerical instability. This makes normalize_rewards compatible with use_leave_one_out_baseline. Args: advantages: Tensor of shape (batch_size, 1) containing advantage values + Note: This tensor is modified in-place. std: Tensor of shape (batch_size,) containing standard deviation values epsilon: Small value to avoid division by very small std, defaults to 1e-6 Returns: Normalized advantages tensor of same shape as input advantages """tests/unit/algorithms/test_grpo.py (1)
1315-1316: Clarify comment: normalization is skipped, not performed.The comment states "Sample 0: std=0 but advantage=0 -> normalize (gives 0)", but this is misleading. When
std=0, the new implementation skips normalization entirely—the advantage remains unchanged at 0, rather than being "normalized to 0". While the result is the same, the reasoning matters for understanding the code's behavior.Apply this diff to clarify:
- # Sample 0: std=0 but advantage=0 -> normalize (gives 0) + # Sample 0: std=0 but advantage=0 -> skip normalization (remains 0) assert torch.allclose(result[0], torch.tensor([[0.0]]), rtol=1e-5)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_rl/algorithms/grpo.py(1 hunks)tests/unit/algorithms/test_grpo.py(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-09-20T14:59:08.052Z
Learning: If a change could affect numerics or convergence, include evidence in the PR description demonstrating no regression.
🧬 Code graph analysis (1)
tests/unit/algorithms/test_grpo.py (1)
nemo_rl/algorithms/grpo.py (1)
normalize_advantages_with_epsilon(636-660)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (5)
nemo_rl/algorithms/grpo.py (1)
655-660: Consider clarifying the threshold choice (std > 0 vs a small epsilon threshold).The implementation uses
std > 0to determine which samples to normalize. This means that even std values like1e-300will be normalized (with epsilon protection), while exactly0.0will be skipped. This is mathematically clean but worth verifying:
- Is this the intended behavior, or should there be a small threshold (e.g.,
std > 1e-10) to handle near-zero floating-point values?- The current approach treats exact zero specially, which aligns with the leave-one-out case producing exact zero variance
The current approach is likely correct for the stated use case, but please confirm that exact zero (not near-zero) is the intended condition for skipping normalization. If near-zero values should also be skipped, consider using a small threshold instead of comparing to zero.
tests/unit/algorithms/test_grpo.py (4)
1240-1245: Test correctly updated for new normalization behavior.The expected values have been properly updated to reflect that samples with
std=0now remain unchanged rather than being normalized. The comment accurately describes the expected behavior.
1256-1258: Test correctly handles all-zero std case.The test properly validates that when all std values are zero, all advantages remain unchanged. This covers an important edge case for the leave-one-out baseline scenario.
1290-1305: Excellent test coverage for the leave-one-out baseline scenario.This test directly addresses the root cause described in issue #1518. It properly validates:
- Sample with
std=0keeps its advantage unchanged (preventing explosion)- Samples with
std>0are normalized correctlyThe test documentation clearly explains the leave-one-out case, making it easy to understand the scenario being tested.
1325-1335: Good test for small non-zero std values, but consider numerical implications.This test correctly validates that small but non-zero std values still trigger normalization (no threshold). However, note that for very small std values (e.g.,
0.0001), the normalized advantage can still become quite large:advantage / (0.0001 + 1e-6) ≈ advantage / 0.0001 = 10,000 * advantage. While not as extreme as the previous1/epsilonexplosion, this could still produce large values in practice.Based on learnings: Consider whether convergence testing has been performed with small positive std values to ensure numerical stability.
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
…loo (#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
…loo (NVIDIA-NeMo#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
…loo (NVIDIA-NeMo#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
…loo (NVIDIA-NeMo#1519) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do?
Fixes numerical instability in advantage normalization when using leave-one-out baseline by gracefully skipping normalization for zero-variance samples.
Issues
Closes #1518
Changes
Summary
normalize_advantages_with_epsilon()to skip normalization whenstd=0andadvantages!=0normalize_rewards=Trueanduse_leave_one_out_baseline=Truefully compatibleBefore your PR is "Ready for review"
Pre checks:
Summary by CodeRabbit
Bug Fixes
Tests