feat: Implement ProRLv2 recipe #1809

hijkzzz · 2026-01-22T09:30:19Z

REINFORCE++-baseline advantage estimator
Length Penalty
ICEPOP

Summary by CodeRabbit

New Features
- Added ProRL configuration for reinforcement learning training with dynamic sampling and asymmetric clipping.
- Introduced configurable advantage estimation methods supporting both GRPO and Reinforce++ variants.
- Added truncated importance sampling (TIS/ICE-POP) support for improved off-policy corrections.
- Integrated KL penalty support in reward calculation for enhanced training.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-22T09:33:16Z

📝 Walkthrough

Walkthrough

This PR introduces ProRL support with Reinforce++ advantage estimation and ICE-POP importance sampling filtering. It adds a new configuration file, implements configurable advantage estimators, refactors grpo.py to defer advantage computation until logprobs are available, and extends loss functions with truncated importance sampling options.

Changes

Cohort / File(s)	Summary
ProRL Configuration `examples/configs/prorl.yaml`	New ProRL v2 configuration inheriting from grpo_math_1B.yaml, enabling dynamic sampling (max 10 batches, 1.5× multiplier), Reinforce++ advantage estimation with optional baseline handling, reward shaping with stop-penalty, and ICE-POP truncated importance sampling with asymmetric clipping bounds (0.2–0.27).
Advantage Estimation `nemo_rl/algorithms/advantage_estimator.py`	New module implementing GRPOAdvantageEstimator (leave-one-out baseline with optional reward normalization) and ReinforcePlusPlusAdvantageEstimator (per-prompt baseline subtraction with optional KL penalty integration and global batch normalization).
GRPO Training Flow `nemo_rl/algorithms/grpo.py`	Adds AdvEstimatorConfig and LengthPenaltyConfig TypedDicts; extends GRPOConfig to include adv_estimator and length_penalty; refactors advantage computation to defer until logprobs availability; initializes appropriate estimator at runtime with validation.
Loss Function & IS `nemo_rl/algorithms/loss_functions.py`	Extends ClippedPGLossConfig with truncated_importance_sampling_type (tis/icepop modes), truncated_importance_sampling_ratio_min, and use_kl_in_reward flag; implements ICE-POP filtering to mask importance weights outside valid range and adds validation assertions.

Sequence Diagram(s)

sequenceDiagram
    participant Sampler as Dynamic Sampler
    participant AdvEst as Advantage Estimator
    participant Reward as Reward Shaping
    participant Loss as Loss Function

    Sampler->>Sampler: Filter non-informative prompts<br/>(max 10 batches, 1.5x multiplier)
    activate Sampler
    Sampler->>AdvEst: Pass prompt_ids, rewards, mask
    deactivate Sampler
    
    activate AdvEst
    alt Reinforce++ Path
        AdvEst->>AdvEst: Subtract per-prompt baseline (optional)
        AdvEst->>AdvEst: Integrate KL penalty into reward<br/>(if use_kl_in_reward enabled)
    else GRPO Path
        AdvEst->>AdvEst: Calculate leave-one-out baseline
        AdvEst->>AdvEst: Normalize rewards by std (optional)
    end
    AdvEst->>AdvEst: Apply global batch normalization
    AdvEst->>Reward: Return standardized advantages
    deactivate AdvEst
    
    activate Reward
    Reward->>Reward: Apply stop-penalty shaping
    Reward->>Loss: Pass shaped rewards
    deactivate Reward
    
    activate Loss
    Loss->>Loss: Compute per-token loss with<br/>asymmetric clipping (0.2–0.27)
    Loss->>Loss: Apply Truncated Importance Sampling<br/>(TIS: clamp | ICE-POP: filter)
    Loss->>Loss: Optionally adjust with KL penalties
    Loss->>Loss: Output final loss
    deactivate Loss

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

RL#1423: Modifies advantage computation/normalization paths with baseline and std calculation refactoring affecting grpo.py and utils.py
RL#1348: Adds truncated importance sampling ratio clipping to ClippedPGLossFn, extended here with ICE-POP filtering and type selection

Suggested labels

CI:L2, r0.4.0

Suggested reviewers

terrykong

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR introduces 300+ lines of code changes (advantage estimator, length penalty, ICEPOP) with no test results or validation information in description.	Add test results, performance metrics, and numerical validation information to demonstrate correctness and convergence of the new advantage estimators and loss functions.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: Implement ProRLv2 recipe' accurately reflects the main changes: introducing a ProRL v2 configuration file with new advantage estimators (REINFORCE++), loss function enhancements (ICEPOP/TIS), and supporting infrastructure.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@examples/configs/prorl.yaml`:
- Line 40: The YAML boolean for the key use_leave_one_out_baseline is misspelled
as "fasle" causing parse errors; update the value to the correct boolean literal
"false" for use_leave_one_out_baseline in the prorl config so the YAML parses
and the flag behaves as expected.

In `@nemo_rl/algorithms/advantage_estimator.py`:
- Around line 20-21: The class docstring for ReinforcePlusPlusAdvantageEstimator
contains a duplicated phrase "KL penalty in reward and KL penalty in reward";
update the docstring in the ReinforcePlusPlusAdvantageEstimator declaration to
remove the duplicate and make the wording concise (e.g., "Reinforce++ with
optional baseline subtraction (minus_baseline) and KL penalty in reward"),
ensuring the docstring accurately describes the minus_baseline and KL penalty
behavior.
- Around line 91-98: The code adds a KL penalty without validating KL config;
update the guard in the block using use_kl_in_reward so you only call
calculate_kl when kl_coef and kl_type are not None (e.g., change the condition
to check self.kl_coef is not None and self.kl_type is not None in addition to
self.use_kl_in_reward and non-None logprobs), and if either is missing either
raise a clear ValueError mentioning self.kl_coef/self.kl_type or skip applying
the KL term to adv; reference the symbols use_kl_in_reward, kl_coef, kl_type,
calculate_kl, and adv when making the change.

🧹 Nitpick comments (6)

nemo_rl/algorithms/loss_functions.py (1)
141-148: Default values in code contradict coding guidelines.

Per coding guidelines: "YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values." The defaults "tis" and 0.5 should be defined in YAML configs, not here.
Proposed fix
-        # Type of truncated importance sampling: "tis" (clamp max) or "icepop" (filter [min, max])
-        self.truncated_importance_sampling_type = cfg.get(
-            "truncated_importance_sampling_type", "tis"
-        )
-        # Lower bound for ICE-POP filtering (default 0.5)
-        self.truncated_importance_sampling_ratio_min = cfg.get(
-            "truncated_importance_sampling_ratio_min", 0.5
-        )
+        # Type of truncated importance sampling: "tis" (clamp max) or "icepop" (filter [min, max])
+        self.truncated_importance_sampling_type = cfg.get(
+            "truncated_importance_sampling_type"
+        )
+        # Lower bound for ICE-POP filtering
+        self.truncated_importance_sampling_ratio_min = cfg.get(
+            "truncated_importance_sampling_ratio_min"
+        )
Then ensure all YAML configs that use truncated_importance_sampling_ratio also specify truncated_importance_sampling_type and truncated_importance_sampling_ratio_min. Based on coding guidelines.
nemo_rl/algorithms/advantage_estimator.py (2)
67-72: Default values in code contradict coding guidelines.

Similar to loss_functions.py, defaults like True, 0.01, and "k3" should come from YAML configs per coding guidelines.
Proposed fix
     def __init__(self, estimator_config: dict, loss_config: dict):
-        self.minus_baseline = estimator_config.get("minus_baseline", True)
-        self.use_kl_in_reward = loss_config.get("use_kl_in_reward", False)
-        self.kl_coef = loss_config.get("reference_policy_kl_penalty", 0.01)
-        self.kl_type = loss_config.get("reference_policy_kl_type", "k3")
+        self.minus_baseline = estimator_config.get("minus_baseline")
+        self.use_kl_in_reward = loss_config.get("use_kl_in_reward")
+        self.kl_coef = loss_config.get("reference_policy_kl_penalty")
+        self.kl_type = loss_config.get("reference_policy_kl_type")
Based on coding guidelines.

100-106: Global normalization may have numerical issues with small masks.

mask.sum() could theoretically be zero or very small, leading to division issues. Consider adding a minimum clamp similar to the variance clamp.
Proposed fix
         # global normalization across the batch
-        adv_mean = (adv * mask).sum() / mask.sum()
-        adv_var = ((adv - adv_mean).pow(2) * mask).sum() / mask.sum()
+        mask_sum = mask.sum().clamp(min=1)
+        adv_mean = (adv * mask).sum() / mask_sum
+        adv_var = ((adv - adv_mean).pow(2) * mask).sum() / mask_sum
         adv_rstd = adv_var.clamp(min=1e-8).rsqrt()
         adv = (adv - adv_mean) * adv_rstd
nemo_rl/algorithms/grpo.py (2)
1113-1129: Duplicated adv_estimator initialization code.

The same initialization logic appears in both grpo_train (lines 1113-1129) and async_grpo_train (lines 2120-2136). Extract to a helper function to follow DRY principle.
Proposed refactor
def _create_advantage_estimator(master_config: MasterConfig):
    """Create advantage estimator based on configuration."""
    adv_estimator_config = master_config["grpo"].get("adv_estimator", {})
    adv_estimator_config.setdefault("name", "grpo")
    adv_estimator_config.setdefault("use_leave_one_out_baseline", master_config["grpo"]["use_leave_one_out_baseline"])
    adv_estimator_config.setdefault("normalize_rewards", master_config["grpo"]["normalize_rewards"])
    loss_config = master_config["loss_fn"]

    adv_estimator_name = adv_estimator_config["name"]
    if adv_estimator_name == "grpo":
        print(f"  ✓ Using GRPO advantage estimator")
        return GRPOAdvantageEstimator(adv_estimator_config, loss_config)
    elif adv_estimator_name == "reinforce_plus_plus":
        print(f"  ✓ Using Reinforce++ advantage estimator")
        return ReinforcePlusPlusAdvantageEstimator(adv_estimator_config, loss_config)
    else:
        raise ValueError(f"Invalid adv_estimator name: {adv_estimator_name}")
1397-1406: Placeholder advantages with shape mismatch.

The placeholder advantages has shape (batch_size, 1) but later gets expanded to message["token_ids"].shape in line 1437-1438. This works but the intermediate del advantages on line 1440 happens before the actual advantages are computed, then new advantages are computed in lines 1511-1517. The flow is correct but the variable reuse is confusing.

Consider renaming the placeholder to placeholder_advantages for clarity, or restructuring so the placeholder isn't deleted before the real computation.
examples/configs/prorl.yaml (1)

89-90: use_kl_in_reward: false but Reinforce++ selected.

With adv_estimator.name: "reinforce_plus_plus" and use_kl_in_reward: false, the KL penalty will be in the loss (not reward). This is valid but worth noting that the "full" Reinforce++ experience typically uses KL in reward. Consider adding a comment clarifying this choice.

coderabbitai · 2026-01-22T14:26:04Z

examples/configs/prorl.yaml

+    name: "reinforce_plus_plus"  # Use "grpo" for standard GRPO
+    # GRPO specific
+    normalize_rewards: true
+    use_leave_one_out_baseline: fasle


⚠️ Potential issue | 🔴 Critical

Typo: fasle should be false.

Boolean value is misspelled which will cause a YAML parse error or unexpected behavior.

Proposed fix

- use_leave_one_out_baseline: fasle + use_leave_one_out_baseline: false

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

use_leave_one_out_baseline: fasle

use_leave_one_out_baseline: false

🤖 Prompt for AI Agents

In `@examples/configs/prorl.yaml` at line 40, The YAML boolean for the key use_leave_one_out_baseline is misspelled as "fasle" causing parse errors; update the value to the correct boolean literal "false" for use_leave_one_out_baseline in the prorl config so the YAML parses and the flag behaves as expected.

nemo_rl/algorithms/advantage_estimator.py

joyang-nv · 2026-01-23T00:10:50Z

@hijkzzz . Thanks for contribution. I think @yuki-97 has a draft PR with similar contents before. Is this improvement or fresh start work? If you are resuming her work. Can you add her to Co-authored-By in the initial commit?

hijkzzz · 2026-01-23T04:23:44Z

@joyang-nv @yuki-97 doesn’t have a complete implementation of ProRL v2… and it’s not based on the latest codebase.

joyang-nv · 2026-01-23T06:35:11Z

@hijkzzz . You are right. @yuki-97 can't finish where she planed according to priorities from management team. And look like you are resuming the rest. :)
Only suggestion is that if you are based on @yuki-97 's previous work, just leave her name in the initial commit which contains her previous work. I will always do this for all developers.

hijkzzz · 2026-01-23T08:25:16Z

@joyang-nv This implementation wasn’t developed based on @yuki-97 ’s MR. This was rewritten from scratch and is still being debugged.
Of course, @yuki-97 is also welcome to continue and complete the MR that wasn’t finished earlier.

hijkzzz · 2026-01-23T23:30:46Z

@terrykong

Training logs:

joyang-nv · 2026-01-26T08:31:09Z

Hi @hijkzzz :

Please note that you are the author of ProRL and @yuki-97 is also contributing to ProRL and landing to Nemo RL. And Yuki was pulled to this contribution due to my request last year. She spent +1 month dedicated efforts on this. And unfortunately, I pulled her on other high priority tasks this year. But I still want to recognize her efforts on ProRL to Nemo RL. I am also thanking for your contribution to ProRL and Nemo RL as well.

Here is list of work Yuki has done and pending tasks:
Merged:

Pending:

Rf++: yuki-97@04854fb
Stop properly penalty: yuki-97@2b3e4d6
Adv boost: yuki-97@04232c2
Split val sampling args: yuki-97@7fc515a
Fix vLLM fail when high concurrent: yuki-97@bbc5a58

Dev branches:

Integrate ProRL into NeMo-RL: https://github.com/yuki-97/NeMo-RL-ProRL/commits/yukih/prorl/
Integrate ProRL into nanov3

examples/configs/prorl.yaml

nemo_rl/algorithms/advantage_estimator.py

hijkzzz · 2026-01-26T23:52:46Z

@joyang-nv This MR was redeveloped from scratch and wasn’t based on the previous code. Most of the time went into debugging rather than writing code; the code itself was completed in about an hour using Cursor.

nemo_rl/algorithms/advantage_estimator.py

nemo_rl/algorithms/grpo.py

nemo_rl/algorithms/advantage_estimator.py

nemo_rl/algorithms/reward_functions.py

nemo_rl/algorithms/advantage_estimator.py

nemo_rl/algorithms/grpo.py

nemo_rl/algorithms/reward_functions.py

Signed-off-by: jianh <[email protected]>

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Yi-Fu Wu <[email protected]> Signed-off-by: jianh <[email protected]>

- Change epsilon from 1e-8 to 1e-6 in GRPOAdvantageEstimator to match existing implementation - Remove .get() defaults in ReinforcePlusPlusAdvantageEstimator, access config directly - Add default values to grpo_math_1B.yaml base config (adv_estimator, stop_properly_penalty_coef, truncated_importance_sampling_ratio_min, truncated_importance_sampling_type, use_kl_in_reward) - Remove unused normalize_advantages_with_epsilon function from grpo.py - Replace normalize_advantages_with_epsilon tests with GRPOAdvantageEstimator and ReinforcePlusPlusAdvantageEstimator tests Signed-off-by: jianh <[email protected]>

…mator defaults - Add test cases for stop_properly_penalty_coef in reward_functions - Remove setdefault() calls in grpo.py - use yaml defaults instead - Update minus_baseline default to true in grpo_math_1B.yaml Signed-off-by: jianh <[email protected]>

Signed-off-by: jianh <[email protected]>

hijkzzz requested a review from a team as a code owner January 22, 2026 09:30

hijkzzz force-pushed the jianh/prorl branch from c70b24a to 99bde72 Compare January 22, 2026 09:32

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

hijkzzz changed the title ~~[DRAFT] feat: Implement ProRL recipe~~ feat: Implement ProRL recipe Jan 23, 2026

hijkzzz changed the title ~~feat: Implement ProRL recipe~~ feat: Implement ProRLv2 recipe Jan 23, 2026

hijkzzz requested review from a team as code owners January 23, 2026 09:58

yfw reviewed Jan 26, 2026

View reviewed changes

examples/configs/prorl.yaml Show resolved Hide resolved

yfw reviewed Jan 26, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Outdated Show resolved Hide resolved

yfw reviewed Jan 26, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Outdated Show resolved Hide resolved

yfw reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/grpo.py Outdated Show resolved Hide resolved

yfw reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Show resolved Hide resolved

yfw reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Show resolved Hide resolved

yfw reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/reward_functions.py Show resolved Hide resolved

hijkzzz requested a review from a team as a code owner January 27, 2026 03:04

ashors1 reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Outdated Show resolved Hide resolved

ashors1 reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Show resolved Hide resolved

hijkzzz force-pushed the jianh/prorl branch from cf18fa7 to 2b34882 Compare January 27, 2026 03:49

ashors1 reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/grpo.py Show resolved Hide resolved

ashors1 reviewed Jan 27, 2026

View reviewed changes

nemo_rl/algorithms/reward_functions.py Show resolved Hide resolved

hijkzzz requested a review from a team as a code owner January 27, 2026 05:03

init prorl

80c2bb2

Signed-off-by: jianh <[email protected]>

hijkzzz and others added 25 commits January 28, 2026 17:04

fix length penalty

9c1ce7b

Signed-off-by: jianh <[email protected]>

fix

00b19b4

Signed-off-by: jianh <[email protected]>

refactor

094ce41

Signed-off-by: jianh <[email protected]>

fix

6276c6a

Signed-off-by: jianh <[email protected]>

fix

5ba6268

Signed-off-by: jianh <[email protected]>

fix

ef8672e

Signed-off-by: jianh <[email protected]>

Fix comments for tis

5daebf0

Signed-off-by: jianh <[email protected]>

update

98e317c

Signed-off-by: jianh <[email protected]>

update

30e29bd

Signed-off-by: jianh <[email protected]>

update

a0d42c8

Signed-off-by: jianh <[email protected]>

update

827b688

Signed-off-by: jianh <[email protected]>

update

630b299

Signed-off-by: jianh <[email protected]>

update

f4a15b6

Signed-off-by: jianh <[email protected]>

update

5e6f270

Signed-off-by: jianh <[email protected]>

Update nemo_rl/algorithms/advantage_estimator.py

fb4416f

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Yi-Fu Wu <[email protected]> Signed-off-by: jianh <[email protected]>

fix

f5d78de

Signed-off-by: jianh <[email protected]>

fix

1ef57a4

Signed-off-by: jianh <[email protected]>

fix

616464f

Signed-off-by: jianh <[email protected]>

update

0a32be4

Signed-off-by: jianh <[email protected]>

update

d0858e1

Signed-off-by: jianh <[email protected]>

pre-commit

1b411d4

Signed-off-by: jianh <[email protected]>

update

c406f01

Signed-off-by: jianh <[email protected]>

fix CI bugs

179d607

Signed-off-by: jianh <[email protected]>

hijkzzz force-pushed the jianh/prorl branch from ad6852a to 179d607 Compare January 29, 2026 01:05

Merge branch 'main' into jianh/prorl

f301a92

yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 29, 2026

yfw requested a deployment to nemo-ci January 29, 2026 21:53 — with GitHub Actions Queued

	use_leave_one_out_baseline: fasle
	use_leave_one_out_baseline: false

feat: Implement ProRLv2 recipe #1809

Are you sure you want to change the base?

feat: Implement ProRLv2 recipe #1809

Conversation

hijkzzz commented Jan 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joyang-nv commented Jan 23, 2026

Uh oh!

hijkzzz commented Jan 23, 2026

Uh oh!

joyang-nv commented Jan 23, 2026

Uh oh!

hijkzzz commented Jan 23, 2026

Uh oh!

hijkzzz commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joyang-nv commented Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

hijkzzz commented Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hijkzzz commented Jan 22, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

hijkzzz commented Jan 23, 2026 •

edited

Loading