Skip to content

Conversation

@hijkzzz
Copy link

@hijkzzz hijkzzz commented Jan 22, 2026

  1. REINFORCE++-baseline advantage estimator
  2. Length Penalty
  3. ICEPOP

Summary by CodeRabbit

  • New Features
    • Added ProRL configuration for reinforcement learning training with dynamic sampling and asymmetric clipping.
    • Introduced configurable advantage estimation methods supporting both GRPO and Reinforce++ variants.
    • Added truncated importance sampling (TIS/ICE-POP) support for improved off-policy corrections.
    • Integrated KL penalty support in reward calculation for enhanced training.

✏️ Tip: You can customize this high-level summary in your review settings.

@hijkzzz hijkzzz requested a review from a team as a code owner January 22, 2026 09:30
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

This PR introduces ProRL support with Reinforce++ advantage estimation and ICE-POP importance sampling filtering. It adds a new configuration file, implements configurable advantage estimators, refactors grpo.py to defer advantage computation until logprobs are available, and extends loss functions with truncated importance sampling options.

Changes

Cohort / File(s) Summary
ProRL Configuration
examples/configs/prorl.yaml
New ProRL v2 configuration inheriting from grpo_math_1B.yaml, enabling dynamic sampling (max 10 batches, 1.5× multiplier), Reinforce++ advantage estimation with optional baseline handling, reward shaping with stop-penalty, and ICE-POP truncated importance sampling with asymmetric clipping bounds (0.2–0.27).
Advantage Estimation
nemo_rl/algorithms/advantage_estimator.py
New module implementing GRPOAdvantageEstimator (leave-one-out baseline with optional reward normalization) and ReinforcePlusPlusAdvantageEstimator (per-prompt baseline subtraction with optional KL penalty integration and global batch normalization).
GRPO Training Flow
nemo_rl/algorithms/grpo.py
Adds AdvEstimatorConfig and LengthPenaltyConfig TypedDicts; extends GRPOConfig to include adv_estimator and length_penalty; refactors advantage computation to defer until logprobs availability; initializes appropriate estimator at runtime with validation.
Loss Function & IS
nemo_rl/algorithms/loss_functions.py
Extends ClippedPGLossConfig with truncated_importance_sampling_type (tis/icepop modes), truncated_importance_sampling_ratio_min, and use_kl_in_reward flag; implements ICE-POP filtering to mask importance weights outside valid range and adds validation assertions.

Sequence Diagram(s)

sequenceDiagram
    participant Sampler as Dynamic Sampler
    participant AdvEst as Advantage Estimator
    participant Reward as Reward Shaping
    participant Loss as Loss Function

    Sampler->>Sampler: Filter non-informative prompts<br/>(max 10 batches, 1.5x multiplier)
    activate Sampler
    Sampler->>AdvEst: Pass prompt_ids, rewards, mask
    deactivate Sampler
    
    activate AdvEst
    alt Reinforce++ Path
        AdvEst->>AdvEst: Subtract per-prompt baseline (optional)
        AdvEst->>AdvEst: Integrate KL penalty into reward<br/>(if use_kl_in_reward enabled)
    else GRPO Path
        AdvEst->>AdvEst: Calculate leave-one-out baseline
        AdvEst->>AdvEst: Normalize rewards by std (optional)
    end
    AdvEst->>AdvEst: Apply global batch normalization
    AdvEst->>Reward: Return standardized advantages
    deactivate AdvEst
    
    activate Reward
    Reward->>Reward: Apply stop-penalty shaping
    Reward->>Loss: Pass shaped rewards
    deactivate Reward
    
    activate Loss
    Loss->>Loss: Compute per-token loss with<br/>asymmetric clipping (0.2–0.27)
    Loss->>Loss: Apply Truncated Importance Sampling<br/>(TIS: clamp | ICE-POP: filter)
    Loss->>Loss: Optionally adjust with KL penalties
    Loss->>Loss: Output final loss
    deactivate Loss
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • RL#1423: Modifies advantage computation/normalization paths with baseline and std calculation refactoring affecting grpo.py and utils.py
  • RL#1348: Adds truncated importance sampling ratio clipping to ClippedPGLossFn, extended here with ICE-POP filtering and type selection

Suggested labels

CI:L2, r0.4.0

Suggested reviewers

  • terrykong
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR introduces 300+ lines of code changes (advantage estimator, length penalty, ICEPOP) with no test results or validation information in description. Add test results, performance metrics, and numerical validation information to demonstrate correctness and convergence of the new advantage estimators and loss functions.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Implement ProRLv2 recipe' accurately reflects the main changes: introducing a ProRL v2 configuration file with new advantage estimators (REINFORCE++), loss function enhancements (ICEPOP/TIS), and supporting infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@examples/configs/prorl.yaml`:
- Line 40: The YAML boolean for the key use_leave_one_out_baseline is misspelled
as "fasle" causing parse errors; update the value to the correct boolean literal
"false" for use_leave_one_out_baseline in the prorl config so the YAML parses
and the flag behaves as expected.

In `@nemo_rl/algorithms/advantage_estimator.py`:
- Around line 20-21: The class docstring for ReinforcePlusPlusAdvantageEstimator
contains a duplicated phrase "KL penalty in reward and KL penalty in reward";
update the docstring in the ReinforcePlusPlusAdvantageEstimator declaration to
remove the duplicate and make the wording concise (e.g., "Reinforce++ with
optional baseline subtraction (minus_baseline) and KL penalty in reward"),
ensuring the docstring accurately describes the minus_baseline and KL penalty
behavior.
- Around line 91-98: The code adds a KL penalty without validating KL config;
update the guard in the block using use_kl_in_reward so you only call
calculate_kl when kl_coef and kl_type are not None (e.g., change the condition
to check self.kl_coef is not None and self.kl_type is not None in addition to
self.use_kl_in_reward and non-None logprobs), and if either is missing either
raise a clear ValueError mentioning self.kl_coef/self.kl_type or skip applying
the KL term to adv; reference the symbols use_kl_in_reward, kl_coef, kl_type,
calculate_kl, and adv when making the change.
🧹 Nitpick comments (6)
nemo_rl/algorithms/loss_functions.py (1)

141-148: Default values in code contradict coding guidelines.

Per coding guidelines: "YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values." The defaults "tis" and 0.5 should be defined in YAML configs, not here.

Proposed fix
-        # Type of truncated importance sampling: "tis" (clamp max) or "icepop" (filter [min, max])
-        self.truncated_importance_sampling_type = cfg.get(
-            "truncated_importance_sampling_type", "tis"
-        )
-        # Lower bound for ICE-POP filtering (default 0.5)
-        self.truncated_importance_sampling_ratio_min = cfg.get(
-            "truncated_importance_sampling_ratio_min", 0.5
-        )
+        # Type of truncated importance sampling: "tis" (clamp max) or "icepop" (filter [min, max])
+        self.truncated_importance_sampling_type = cfg.get(
+            "truncated_importance_sampling_type"
+        )
+        # Lower bound for ICE-POP filtering
+        self.truncated_importance_sampling_ratio_min = cfg.get(
+            "truncated_importance_sampling_ratio_min"
+        )

Then ensure all YAML configs that use truncated_importance_sampling_ratio also specify truncated_importance_sampling_type and truncated_importance_sampling_ratio_min. Based on coding guidelines.

nemo_rl/algorithms/advantage_estimator.py (2)

67-72: Default values in code contradict coding guidelines.

Similar to loss_functions.py, defaults like True, 0.01, and "k3" should come from YAML configs per coding guidelines.

Proposed fix
     def __init__(self, estimator_config: dict, loss_config: dict):
-        self.minus_baseline = estimator_config.get("minus_baseline", True)
-        self.use_kl_in_reward = loss_config.get("use_kl_in_reward", False)
-        self.kl_coef = loss_config.get("reference_policy_kl_penalty", 0.01)
-        self.kl_type = loss_config.get("reference_policy_kl_type", "k3")
+        self.minus_baseline = estimator_config.get("minus_baseline")
+        self.use_kl_in_reward = loss_config.get("use_kl_in_reward")
+        self.kl_coef = loss_config.get("reference_policy_kl_penalty")
+        self.kl_type = loss_config.get("reference_policy_kl_type")

Based on coding guidelines.


100-106: Global normalization may have numerical issues with small masks.

mask.sum() could theoretically be zero or very small, leading to division issues. Consider adding a minimum clamp similar to the variance clamp.

Proposed fix
         # global normalization across the batch
-        adv_mean = (adv * mask).sum() / mask.sum()
-        adv_var = ((adv - adv_mean).pow(2) * mask).sum() / mask.sum()
+        mask_sum = mask.sum().clamp(min=1)
+        adv_mean = (adv * mask).sum() / mask_sum
+        adv_var = ((adv - adv_mean).pow(2) * mask).sum() / mask_sum
         adv_rstd = adv_var.clamp(min=1e-8).rsqrt()
         adv = (adv - adv_mean) * adv_rstd
nemo_rl/algorithms/grpo.py (2)

1113-1129: Duplicated adv_estimator initialization code.

The same initialization logic appears in both grpo_train (lines 1113-1129) and async_grpo_train (lines 2120-2136). Extract to a helper function to follow DRY principle.

Proposed refactor
def _create_advantage_estimator(master_config: MasterConfig):
    """Create advantage estimator based on configuration."""
    adv_estimator_config = master_config["grpo"].get("adv_estimator", {})
    adv_estimator_config.setdefault("name", "grpo")
    adv_estimator_config.setdefault("use_leave_one_out_baseline", master_config["grpo"]["use_leave_one_out_baseline"])
    adv_estimator_config.setdefault("normalize_rewards", master_config["grpo"]["normalize_rewards"])
    loss_config = master_config["loss_fn"]

    adv_estimator_name = adv_estimator_config["name"]
    if adv_estimator_name == "grpo":
        print(f"  ✓ Using GRPO advantage estimator")
        return GRPOAdvantageEstimator(adv_estimator_config, loss_config)
    elif adv_estimator_name == "reinforce_plus_plus":
        print(f"  ✓ Using Reinforce++ advantage estimator")
        return ReinforcePlusPlusAdvantageEstimator(adv_estimator_config, loss_config)
    else:
        raise ValueError(f"Invalid adv_estimator name: {adv_estimator_name}")

1397-1406: Placeholder advantages with shape mismatch.

The placeholder advantages has shape (batch_size, 1) but later gets expanded to message["token_ids"].shape in line 1437-1438. This works but the intermediate del advantages on line 1440 happens before the actual advantages are computed, then new advantages are computed in lines 1511-1517. The flow is correct but the variable reuse is confusing.

Consider renaming the placeholder to placeholder_advantages for clarity, or restructuring so the placeholder isn't deleted before the real computation.

examples/configs/prorl.yaml (1)

89-90: use_kl_in_reward: false but Reinforce++ selected.

With adv_estimator.name: "reinforce_plus_plus" and use_kl_in_reward: false, the KL penalty will be in the loss (not reward). This is valid but worth noting that the "full" Reinforce++ experience typically uses KL in reward. Consider adding a comment clarifying this choice.

name: "reinforce_plus_plus" # Use "grpo" for standard GRPO
# GRPO specific
normalize_rewards: true
use_leave_one_out_baseline: fasle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Typo: fasle should be false.

Boolean value is misspelled which will cause a YAML parse error or unexpected behavior.

Proposed fix
-    use_leave_one_out_baseline: fasle
+    use_leave_one_out_baseline: false
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
use_leave_one_out_baseline: fasle
use_leave_one_out_baseline: false
🤖 Prompt for AI Agents
In `@examples/configs/prorl.yaml` at line 40, The YAML boolean for the key
use_leave_one_out_baseline is misspelled as "fasle" causing parse errors; update
the value to the correct boolean literal "false" for use_leave_one_out_baseline
in the prorl config so the YAML parses and the flag behaves as expected.

@joyang-nv
Copy link
Member

@hijkzzz . Thanks for contribution. I think @yuki-97 has a draft PR with similar contents before. Is this improvement or fresh start work? If you are resuming her work. Can you add her to Co-authored-By in the initial commit?

@hijkzzz hijkzzz changed the title [DRAFT] feat: Implement ProRL recipe feat: Implement ProRL recipe Jan 23, 2026
@hijkzzz hijkzzz changed the title feat: Implement ProRL recipe feat: Implement ProRLv2 recipe Jan 23, 2026
@hijkzzz
Copy link
Author

hijkzzz commented Jan 23, 2026

@joyang-nv @yuki-97 doesn’t have a complete implementation of ProRL v2… and it’s not based on the latest codebase.

@joyang-nv
Copy link
Member

@hijkzzz . You are right. @yuki-97 can't finish where she planed according to priorities from management team. And look like you are resuming the rest. :)
Only suggestion is that if you are based on @yuki-97 's previous work, just leave her name in the initial commit which contains her previous work. I will always do this for all developers.

@hijkzzz
Copy link
Author

hijkzzz commented Jan 23, 2026

@joyang-nv This implementation wasn’t developed based on @yuki-97 ’s MR. This was rewritten from scratch and is still being debugged.
Of course, @yuki-97 is also welcome to continue and complete the MR that wasn’t finished earlier.

@hijkzzz hijkzzz requested review from a team as code owners January 23, 2026 09:58
@hijkzzz
Copy link
Author

hijkzzz commented Jan 23, 2026

@terrykong

Training logs:

image image image

@joyang-nv
Copy link
Member

Hi @hijkzzz :

Please note that you are the author of ProRL and @yuki-97 is also contributing to ProRL and landing to Nemo RL. And Yuki was pulled to this contribution due to my request last year. She spent +1 month dedicated efforts on this. And unfortunately, I pulled her on other high priority tasks this year. But I still want to recognize her efforts on ProRL to Nemo RL. I am also thanking for your contribution to ProRL and Nemo RL as well.

Here is list of work Yuki has done and pending tasks:
Merged:

  1. TIS: feat: support truncated importance sampling #1348
  2. KL: feat: add kl penalty k1, k2 #1349

Pending:

  1. Rf++: yuki-97@04854fb
  2. Stop properly penalty: yuki-97@2b3e4d6
  3. Adv boost: yuki-97@04232c2
  4. Split val sampling args: yuki-97@7fc515a
  5. Fix vLLM fail when high concurrent: yuki-97@bbc5a58

Dev branches:

@hijkzzz
Copy link
Author

hijkzzz commented Jan 26, 2026

@joyang-nv This MR was redeveloped from scratch and wasn’t based on the previous code. Most of the time went into debugging rather than writing code; the code itself was completed in about an hour using Cursor.

@hijkzzz hijkzzz requested a review from a team as a code owner January 27, 2026 03:04
@hijkzzz hijkzzz requested a review from a team as a code owner January 27, 2026 05:03
Signed-off-by: jianh <[email protected]>
hijkzzz and others added 25 commits January 28, 2026 17:04
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Yi-Fu Wu <[email protected]>
Signed-off-by: jianh <[email protected]>
- Change epsilon from 1e-8 to 1e-6 in GRPOAdvantageEstimator to match existing implementation
- Remove .get() defaults in ReinforcePlusPlusAdvantageEstimator, access config directly
- Add default values to grpo_math_1B.yaml base config (adv_estimator, stop_properly_penalty_coef, truncated_importance_sampling_ratio_min, truncated_importance_sampling_type, use_kl_in_reward)
- Remove unused normalize_advantages_with_epsilon function from grpo.py
- Replace normalize_advantages_with_epsilon tests with GRPOAdvantageEstimator and ReinforcePlusPlusAdvantageEstimator tests

Signed-off-by: jianh <[email protected]>
…mator defaults

- Add test cases for stop_properly_penalty_coef in reward_functions
- Remove setdefault() calls in grpo.py - use yaml defaults instead
- Update minus_baseline default to true in grpo_math_1B.yaml

Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
@yfw yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants