-
Notifications
You must be signed in to change notification settings - Fork 231
feat: Implement ProRLv2 recipe #1809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis PR introduces ProRL support with Reinforce++ advantage estimation and ICE-POP importance sampling filtering. It adds a new configuration file, implements configurable advantage estimators, refactors grpo.py to defer advantage computation until logprobs are available, and extends loss functions with truncated importance sampling options. Changes
Sequence Diagram(s)sequenceDiagram
participant Sampler as Dynamic Sampler
participant AdvEst as Advantage Estimator
participant Reward as Reward Shaping
participant Loss as Loss Function
Sampler->>Sampler: Filter non-informative prompts<br/>(max 10 batches, 1.5x multiplier)
activate Sampler
Sampler->>AdvEst: Pass prompt_ids, rewards, mask
deactivate Sampler
activate AdvEst
alt Reinforce++ Path
AdvEst->>AdvEst: Subtract per-prompt baseline (optional)
AdvEst->>AdvEst: Integrate KL penalty into reward<br/>(if use_kl_in_reward enabled)
else GRPO Path
AdvEst->>AdvEst: Calculate leave-one-out baseline
AdvEst->>AdvEst: Normalize rewards by std (optional)
end
AdvEst->>AdvEst: Apply global batch normalization
AdvEst->>Reward: Return standardized advantages
deactivate AdvEst
activate Reward
Reward->>Reward: Apply stop-penalty shaping
Reward->>Loss: Pass shaped rewards
deactivate Reward
activate Loss
Loss->>Loss: Compute per-token loss with<br/>asymmetric clipping (0.2–0.27)
Loss->>Loss: Apply Truncated Importance Sampling<br/>(TIS: clamp | ICE-POP: filter)
Loss->>Loss: Optionally adjust with KL penalties
Loss->>Loss: Output final loss
deactivate Loss
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@examples/configs/prorl.yaml`:
- Line 40: The YAML boolean for the key use_leave_one_out_baseline is misspelled
as "fasle" causing parse errors; update the value to the correct boolean literal
"false" for use_leave_one_out_baseline in the prorl config so the YAML parses
and the flag behaves as expected.
In `@nemo_rl/algorithms/advantage_estimator.py`:
- Around line 20-21: The class docstring for ReinforcePlusPlusAdvantageEstimator
contains a duplicated phrase "KL penalty in reward and KL penalty in reward";
update the docstring in the ReinforcePlusPlusAdvantageEstimator declaration to
remove the duplicate and make the wording concise (e.g., "Reinforce++ with
optional baseline subtraction (minus_baseline) and KL penalty in reward"),
ensuring the docstring accurately describes the minus_baseline and KL penalty
behavior.
- Around line 91-98: The code adds a KL penalty without validating KL config;
update the guard in the block using use_kl_in_reward so you only call
calculate_kl when kl_coef and kl_type are not None (e.g., change the condition
to check self.kl_coef is not None and self.kl_type is not None in addition to
self.use_kl_in_reward and non-None logprobs), and if either is missing either
raise a clear ValueError mentioning self.kl_coef/self.kl_type or skip applying
the KL term to adv; reference the symbols use_kl_in_reward, kl_coef, kl_type,
calculate_kl, and adv when making the change.
🧹 Nitpick comments (6)
nemo_rl/algorithms/loss_functions.py (1)
141-148: Default values in code contradict coding guidelines.Per coding guidelines: "YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values." The defaults
"tis"and0.5should be defined in YAML configs, not here.Proposed fix
- # Type of truncated importance sampling: "tis" (clamp max) or "icepop" (filter [min, max]) - self.truncated_importance_sampling_type = cfg.get( - "truncated_importance_sampling_type", "tis" - ) - # Lower bound for ICE-POP filtering (default 0.5) - self.truncated_importance_sampling_ratio_min = cfg.get( - "truncated_importance_sampling_ratio_min", 0.5 - ) + # Type of truncated importance sampling: "tis" (clamp max) or "icepop" (filter [min, max]) + self.truncated_importance_sampling_type = cfg.get( + "truncated_importance_sampling_type" + ) + # Lower bound for ICE-POP filtering + self.truncated_importance_sampling_ratio_min = cfg.get( + "truncated_importance_sampling_ratio_min" + )Then ensure all YAML configs that use
truncated_importance_sampling_ratioalso specifytruncated_importance_sampling_typeandtruncated_importance_sampling_ratio_min. Based on coding guidelines.nemo_rl/algorithms/advantage_estimator.py (2)
67-72: Default values in code contradict coding guidelines.Similar to loss_functions.py, defaults like
True,0.01, and"k3"should come from YAML configs per coding guidelines.Proposed fix
def __init__(self, estimator_config: dict, loss_config: dict): - self.minus_baseline = estimator_config.get("minus_baseline", True) - self.use_kl_in_reward = loss_config.get("use_kl_in_reward", False) - self.kl_coef = loss_config.get("reference_policy_kl_penalty", 0.01) - self.kl_type = loss_config.get("reference_policy_kl_type", "k3") + self.minus_baseline = estimator_config.get("minus_baseline") + self.use_kl_in_reward = loss_config.get("use_kl_in_reward") + self.kl_coef = loss_config.get("reference_policy_kl_penalty") + self.kl_type = loss_config.get("reference_policy_kl_type")Based on coding guidelines.
100-106: Global normalization may have numerical issues with small masks.
mask.sum()could theoretically be zero or very small, leading to division issues. Consider adding a minimum clamp similar to the variance clamp.Proposed fix
# global normalization across the batch - adv_mean = (adv * mask).sum() / mask.sum() - adv_var = ((adv - adv_mean).pow(2) * mask).sum() / mask.sum() + mask_sum = mask.sum().clamp(min=1) + adv_mean = (adv * mask).sum() / mask_sum + adv_var = ((adv - adv_mean).pow(2) * mask).sum() / mask_sum adv_rstd = adv_var.clamp(min=1e-8).rsqrt() adv = (adv - adv_mean) * adv_rstdnemo_rl/algorithms/grpo.py (2)
1113-1129: Duplicated adv_estimator initialization code.The same initialization logic appears in both
grpo_train(lines 1113-1129) andasync_grpo_train(lines 2120-2136). Extract to a helper function to follow DRY principle.Proposed refactor
def _create_advantage_estimator(master_config: MasterConfig): """Create advantage estimator based on configuration.""" adv_estimator_config = master_config["grpo"].get("adv_estimator", {}) adv_estimator_config.setdefault("name", "grpo") adv_estimator_config.setdefault("use_leave_one_out_baseline", master_config["grpo"]["use_leave_one_out_baseline"]) adv_estimator_config.setdefault("normalize_rewards", master_config["grpo"]["normalize_rewards"]) loss_config = master_config["loss_fn"] adv_estimator_name = adv_estimator_config["name"] if adv_estimator_name == "grpo": print(f" ✓ Using GRPO advantage estimator") return GRPOAdvantageEstimator(adv_estimator_config, loss_config) elif adv_estimator_name == "reinforce_plus_plus": print(f" ✓ Using Reinforce++ advantage estimator") return ReinforcePlusPlusAdvantageEstimator(adv_estimator_config, loss_config) else: raise ValueError(f"Invalid adv_estimator name: {adv_estimator_name}")
1397-1406: Placeholder advantages with shape mismatch.The placeholder
advantageshas shape(batch_size, 1)but later gets expanded tomessage["token_ids"].shapein line 1437-1438. This works but the intermediatedel advantageson line 1440 happens before the actual advantages are computed, then new advantages are computed in lines 1511-1517. The flow is correct but the variable reuse is confusing.Consider renaming the placeholder to
placeholder_advantagesfor clarity, or restructuring so the placeholder isn't deleted before the real computation.examples/configs/prorl.yaml (1)
89-90:use_kl_in_reward: falsebut Reinforce++ selected.With
adv_estimator.name: "reinforce_plus_plus"anduse_kl_in_reward: false, the KL penalty will be in the loss (not reward). This is valid but worth noting that the "full" Reinforce++ experience typically uses KL in reward. Consider adding a comment clarifying this choice.
examples/configs/prorl.yaml
Outdated
| name: "reinforce_plus_plus" # Use "grpo" for standard GRPO | ||
| # GRPO specific | ||
| normalize_rewards: true | ||
| use_leave_one_out_baseline: fasle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: fasle should be false.
Boolean value is misspelled which will cause a YAML parse error or unexpected behavior.
Proposed fix
- use_leave_one_out_baseline: fasle
+ use_leave_one_out_baseline: false📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| use_leave_one_out_baseline: fasle | |
| use_leave_one_out_baseline: false |
🤖 Prompt for AI Agents
In `@examples/configs/prorl.yaml` at line 40, The YAML boolean for the key
use_leave_one_out_baseline is misspelled as "fasle" causing parse errors; update
the value to the correct boolean literal "false" for use_leave_one_out_baseline
in the prorl config so the YAML parses and the flag behaves as expected.
|
@hijkzzz . Thanks for contribution. I think @yuki-97 has a draft PR with similar contents before. Is this improvement or fresh start work? If you are resuming her work. Can you add her to Co-authored-By in the initial commit? |
|
@joyang-nv @yuki-97 doesn’t have a complete implementation of ProRL v2… and it’s not based on the latest codebase. |
|
@hijkzzz . You are right. @yuki-97 can't finish where she planed according to priorities from management team. And look like you are resuming the rest. :) |
|
@joyang-nv This implementation wasn’t developed based on @yuki-97 ’s MR. This was rewritten from scratch and is still being debugged. |
|
Hi @hijkzzz : Please note that you are the author of ProRL and @yuki-97 is also contributing to ProRL and landing to Nemo RL. And Yuki was pulled to this contribution due to my request last year. She spent +1 month dedicated efforts on this. And unfortunately, I pulled her on other high priority tasks this year. But I still want to recognize her efforts on ProRL to Nemo RL. I am also thanking for your contribution to ProRL and Nemo RL as well. Here is list of work Yuki has done and pending tasks: Pending:
Dev branches:
|
|
@joyang-nv This MR was redeveloped from scratch and wasn’t based on the previous code. Most of the time went into debugging rather than writing code; the code itself was completed in about an hour using Cursor. |
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Yi-Fu Wu <[email protected]> Signed-off-by: jianh <[email protected]>
- Change epsilon from 1e-8 to 1e-6 in GRPOAdvantageEstimator to match existing implementation - Remove .get() defaults in ReinforcePlusPlusAdvantageEstimator, access config directly - Add default values to grpo_math_1B.yaml base config (adv_estimator, stop_properly_penalty_coef, truncated_importance_sampling_ratio_min, truncated_importance_sampling_type, use_kl_in_reward) - Remove unused normalize_advantages_with_epsilon function from grpo.py - Replace normalize_advantages_with_epsilon tests with GRPOAdvantageEstimator and ReinforcePlusPlusAdvantageEstimator tests Signed-off-by: jianh <[email protected]>
…mator defaults - Add test cases for stop_properly_penalty_coef in reward_functions - Remove setdefault() calls in grpo.py - use yaml defaults instead - Update minus_baseline default to true in grpo_math_1B.yaml Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>
Signed-off-by: jianh <[email protected]>



Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.