Conversation
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Prevents incorrect dp size in parallel_state during initial import. Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Add packed_seq_params change to get_topk_logits too Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
📝 WalkthroughWalkthroughA new configuration option Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/algorithms/loss_functions.py (1)
36-61: Fix duplicatedforce_on_policy_ratiokey inClippedPGLossConfig
ClippedPGLossConfigcurrently declaresforce_on_policy_ratiotwice: once as a requiredbooland once asNotRequired[bool]with the explanatory comment. Type checkers will flag this, and the intent (optional flag with documented semantics) matches the second declaration.Recommend keeping a single optional entry:
class ClippedPGLossConfig(TypedDict): @@ - token_level_loss: bool - force_on_policy_ratio: bool + token_level_loss: bool @@ - disable_ppo_ratio: NotRequired[bool] - # If True, force the ratio to 1.0 for truly on-policy behavior, + disable_ppo_ratio: NotRequired[bool] + # If True, force the ratio to 1.0 for truly on-policy behavior, @@ - force_on_policy_ratio: NotRequired[bool] + force_on_policy_ratio: NotRequired[bool]That keeps the docs and avoids inconsistent typing.
🧹 Nitpick comments (2)
nemo_rl/algorithms/grpo.py (1)
585-595: On-policy ratio validation looks correct but is quite strictThe assertion enforces the documented requirement that
force_on_policy_ratioonly be used whentrain_global_batch_size == num_prompts_per_step * num_generations_per_prompt, which is good for catching misconfigurations early. Note that this will also forbid using the flag with async GRPO even if users think the config is “close enough,” since async inherently reuses trajectories; if that’s intentional, consider adding a brief clarifying note in docs or config comments so users don’t try to combine the two.nemo_rl/algorithms/loss_functions.py (1)
117-155:force_on_policy_ratioctor wiring is correct; consider guarding incompatible modesThe ctor wiring of
self.force_on_policy_ratio = cfg.get("force_on_policy_ratio", False)is consistent with existing flags and lets older configs omit the key. Behavior‑wise it works, but note that you now have three mutually interacting switches (disable_ppo_ratio,sequence_level_importance_ratios,force_on_policy_ratio). It may be worth asserting that obviously incompatible combinations (e.g.,disable_ppo_ratio=Trueandforce_on_policy_ratio=True) are rejected early rather than producing confusing metrics.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
examples/configs/grpo_math_1B.yaml(1 hunks)examples/configs/vlm_grpo_3B.yaml(1 hunks)examples/configs/vlm_grpo_3B_megatron.yaml(1 hunks)examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml(1 hunks)nemo_rl/algorithms/grpo.py(3 hunks)nemo_rl/algorithms/loss_functions.py(8 hunks)tests/unit/algorithms/test_grpo.py(1 hunks)tests/unit/algorithms/test_loss_functions.py(2 hunks)tests/unit/algorithms/test_sequence_packing_gradients.py(1 hunks)tests/unit/models/policy/test_dtensor_worker.py(1 hunks)tests/unit/models/policy/test_megatron_worker.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
examples/configs/grpo_math_1B.yamlexamples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yamlexamples/configs/vlm_grpo_3B_megatron.yamltests/unit/models/policy/test_dtensor_worker.pyexamples/configs/vlm_grpo_3B.yamltests/unit/algorithms/test_sequence_packing_gradients.pytests/unit/models/policy/test_megatron_worker.pytests/unit/algorithms/test_grpo.pynemo_rl/algorithms/grpo.pytests/unit/algorithms/test_loss_functions.pynemo_rl/algorithms/loss_functions.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
tests/unit/models/policy/test_dtensor_worker.pytests/unit/algorithms/test_sequence_packing_gradients.pytests/unit/models/policy/test_megatron_worker.pytests/unit/algorithms/test_grpo.pynemo_rl/algorithms/grpo.pytests/unit/algorithms/test_loss_functions.pynemo_rl/algorithms/loss_functions.py
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
tests/unit/models/policy/test_dtensor_worker.pytests/unit/algorithms/test_sequence_packing_gradients.pytests/unit/models/policy/test_megatron_worker.pytests/unit/algorithms/test_grpo.pynemo_rl/algorithms/grpo.pytests/unit/algorithms/test_loss_functions.pynemo_rl/algorithms/loss_functions.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/algorithms/grpo.pynemo_rl/algorithms/loss_functions.py
🧠 Learnings (1)
📚 Learning: 2025-09-18T13:26:43.307Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-8b-base-2n8g-fsdp2tp2.v1.yaml:19-26
Timestamp: 2025-09-18T13:26:43.307Z
Learning: In on-policy distillation workflows, validation can use downstream task performance (like math problem solving) as RL-like reward metrics rather than traditional distillation metrics like KL divergence. In this case, "val_reward" with "higher_is_better: true" is the correct checkpoint monitoring configuration.
Applied to files:
examples/configs/vlm_grpo_3B_megatron.yaml
🧬 Code graph analysis (2)
nemo_rl/algorithms/grpo.py (1)
tests/check_metrics.py (2)
min(25-27)max(30-32)
tests/unit/algorithms/test_loss_functions.py (1)
nemo_rl/algorithms/loss_functions.py (1)
ClippedPGLossFn(76-478)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: sphinx-build / Build docs
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (14)
examples/configs/vlm_grpo_3B.yaml (1)
34-49: Loss config flag addition is consistent and safely defaulted
force_on_policy_ratio: falseis added with a clear comment, and this config already satisfies the documented prerequisite (128 = 8 * 16), so the flag can be safely toggled later without further changes.examples/configs/grpo_math_1B.yaml (1)
38-56: force_on_policy_ratio flag wired correctly into math GRPO configThe new
force_on_policy_ratio: falsefield and its comment are consistent with other configs, and the batch-size relation (512 = 32 * 16) already meets the stated requirement for enabling it.examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml (1)
31-45: Penguin GRPO recipe aligns flag and batch-size constraintAdding
force_on_policy_ratio: falsewith the explanatory comment is consistent with other recipes, and computingtrain_global_batch_sizedirectly fromnum_prompts_per_step * num_generations_per_promptguarantees the documented prerequisite holds when this flag is enabled.tests/unit/models/policy/test_dtensor_worker.py (1)
667-685: DTensor PG loss test config correctly extended with force_on_policy_ratioIncluding
"force_on_policy_ratio": Falsein theClippedPGLossFnconfig keeps this test aligned with the updated loss configuration while preserving its original behavior (microbatch-size invariance under standard PPO ratios).examples/configs/vlm_grpo_3B_megatron.yaml (1)
31-45: Megatron VLM GRPO config cleanly integrates force_on_policy_ratioThe new
force_on_policy_ratio: falseoption and its comment match other configs, and the batch-size relation (128 = 8 * 16) satisfies the documented requirement if users choose to enable this behavior.tests/unit/algorithms/test_sequence_packing_gradients.py (1)
129-145: Sequence-packing gradient test loss_config updated appropriatelyAdding
"force_on_policy_ratio": Falsetoloss_configensures compatibility with the extendedClippedPGLossFnconfiguration without affecting the baseline vs packed gradient comparisons this test asserts.tests/unit/algorithms/test_grpo.py (1)
886-902: GRPO unit tests now pass full ClippedPG loss configThe addition of
"force_on_policy_ratio": Falseto theClippedPGLossFnconfiguration keeps these GRPO tests aligned with the expanded loss options, while preserving existing training-loop behavior in both synchronous and async modes.tests/unit/models/policy/test_megatron_worker.py (1)
39-54: Megatron basic_pg_loss_test_config extended with force_on_policy_ratioUpdating
basic_pg_loss_test_configto include"force_on_policy_ratio": Falsecorrectly aligns all Megatron PG-loss tests with the updatedClippedPGLossConfigwhile maintaining their existing expectations around ratios and gradients.nemo_rl/algorithms/grpo.py (2)
1355-1365: Ratio min/max aggregation is reasonable; just ensure upstream always passes iterablesThe new handling for
probs_ratio_min/maxandprobs_ratio_clamped_min/max(filtering outinfand falling back to-1.0when all values are invalid) matches the sentinel semantics from the loss and should keep logs sane even when some microbatches/seqs have no valid tokens. This loop assumesvis iterable (e.g., NumPy array, list); if any training path ever supplies a scalar here,for x in vwill fail. Worth double‑checking thattrain_results["all_mb_metrics"]always uses array‑like containers.
2294-2304: Mirror of ratio min/max aggregation in async pathThe async GRPO metrics reduction mirrors the sync path and correctly reuses the same
inf‑filtering semantics for the new ratio metrics. Same caveat applies: this assumes each metric value is iterable rather than a bare scalar.nemo_rl/algorithms/loss_functions.py (3)
311-335: On‑policy ratio forcing implementation matches intentThe new
force_on_policy_ratiobranch (log_ratios = curr_logprobs - curr_logprobs.detach()) correctly yieldsratios = 1.0in the forward pass while keeping non‑zero gradients w.r.t.curr_logprobs, so the actor loss behaves as “truly on‑policy PPO” without referencingprev_logprobs. Clamping is effectively a no‑op in this mode, which is exactly what you want. No issues here.
429-455: Ratio min/max metrics and sentinels look goodThe min/max ratio metrics:
- Compute over valid tokens only (
mask.bool()), in line with how the mean ratios are computed.- Use
±infsentinels when there are no valid tokens so downstream aggregation can distinguish “no data” from real extremes.This design matches the aggregation logic in
grpo.pyandSequencePackingLossWrapper; just be aware that any consumer that doesn’t implement the same sentinel filtering will see infinities for completely masked batches.
944-961: SequencePackingLossWrapper aggregation correctly handles new ratio metricsInitializing
probs_ratio_min/_clamped_minwith+infand the max variants with-inf, then skippingmath.isinf(val)when updating, ensures that:
- Sequences with no valid tokens (which emit
±inffrom the base loss) don’t influence the packed aggregate.- “True” min/max values across packed sequences are preserved.
For all other metrics you keep the existing sum‑across‑sequences behavior. This is a clean extension for the new metrics.
tests/unit/algorithms/test_loss_functions.py (1)
30-45: Good coverage for the on‑policy ratio pathExtending
basic_pg_loss_test_configwithforce_on_policy_ratioand addingtest_clipped_pg_loss_force_on_policy_ratiogives solid unit coverage of the new mode: you reuse the existing PPO scenario, assert the actor loss matches the “all ratios = 1.0” expectation, and verify all ratio‑related metrics report exactly 1. This should catch regressions both in the ratio forcing path and in the new min/max metrics.Also applies to: 566-619
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
terrykong
left a comment
There was a problem hiding this comment.
very clever optimization!
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> Co-authored-by: Jiaqi Zeng <jiaqiz@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> Co-authored-by: Jiaqi Zeng <jiaqiz@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> Co-authored-by: Jiaqi Zeng <jiaqiz@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> Co-authored-by: Jiaqi Zeng <jiaqiz@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> Co-authored-by: Jiaqi Zeng <jiaqiz@nvidia.com>
What does this PR do ?
Adds a flag for "forcing" on-policy ratio to 1 in the fully on-policy case (
num_prompts_per_step * num_generations_per_prompt == train_global_batch_size). Original PR created by @HeyyyyyyGIssues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
force_on_policy_ratioconfiguration option to enforce on-policy PPO behavior.Chores
✏️ Tip: You can customize this high-level summary in your review settings.