Conversation
|
Hi @yuki-97, this is the new PR created based on the branch on the nemo-rl repo |
📝 WalkthroughWalkthroughThis PR introduces GDPO (Group Decision Policy Optimization) support to the NeMo RL framework, adding a new multi-component reward advantage estimator, math task dataset (GSM8K) with dedicated verification, multi-reward environment support, and associated training pipeline enhancements to handle per-component reward signals. Changes
Sequence Diagram(s)sequenceDiagram
participant Rollout as Rollout Collector
participant Env as MathMultiRewardEnvironment
participant Worker as HFMultiRewardVerifyWorker
participant Estimator as GDPOAdvantageEstimator
participant Batch as Batch
Rollout->>Env: step(actions, predictions)
Env->>Env: chunk predictions for parallel processing
loop For each chunk
Env->>Worker: verify(predictions)
Worker->>Worker: compute 3 reward signals<br/>(correctness, format, int)
Worker-->>Env: rewards[3]
end
Env->>Env: stack rewards into tensor
Env-->>Rollout: (obs, rewards[batch, 3], done, metadata)
Rollout->>Rollout: accumulate per-component<br/>rewards as reward1, reward2, reward3
Rollout->>Batch: add reward1, reward2, reward3 fields
Batch->>Estimator: compute_advantage(prompt_ids, rewards, repeated_batch, mask)
Estimator->>Estimator: loop over reward components
Estimator->>Estimator: compute per-component baselines
Estimator->>Estimator: normalize per-component rewards
Estimator->>Estimator: sum components
Estimator->>Estimator: apply global normalization
Estimator-->>Batch: advantages (normalized)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/algorithms/grpo.py (1)
955-976:⚠️ Potential issue | 🟠 MajorValidate reward-scaling range before computing the linear map.
If
source_max <= source_min,_scaledivides by zero and can poison rewards withinf/nan.💡 Proposed fix
source_min = float(reward_scaling_cfg["source_min"]) source_max = float(reward_scaling_cfg["source_max"]) target_min = float(reward_scaling_cfg["target_min"]) target_max = float(reward_scaling_cfg["target_max"]) + if source_max <= source_min: + raise ValueError( + "Invalid reward scaling config: source_max must be greater than source_min" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/algorithms/grpo.py` around lines 955 - 976, Check that reward_scaling_cfg values define a valid source range before using them: verify float(reward_scaling_cfg["source_max"]) > float(reward_scaling_cfg["source_min"]) and if not, raise a clear exception or disable scaling; do this check near where source_min/source_max are parsed (the block that defines source_min, source_max, target_min, target_max) and ensure _scale will never divide by zero (i.e., skip scaling or clamp to a no-op mapping when invalid). Include the config keys (reward_scaling_cfg, source_min, source_max) and the _scale function name in your change so reviewers can find the validation logic.
🧹 Nitpick comments (1)
nemo_rl/environments/interfaces.py (1)
47-47: Move reward-shape contract into theEnvironmentReturndocstring.Keeping this as an inline
##comment on a public interface field makes the contract easy to miss; document expected shapes (e.g.,[batch]vs[batch, num_rewards]) in the class docstring and keep the annotation clean.As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/environments/interfaces.py` at line 47, The inline comment on the EnvironmentReturn.rewards field should be removed and its contract moved into the EnvironmentReturn class docstring: update the EnvironmentReturn docstring to describe the expected shapes for rewards (e.g., [batch] for scalar per-batch, [batch, num_rewards] for vector rewards), mention optional semantics (None vs zeros) if relevant, and keep the field annotation as simply "rewards: Tensor"; refer to the EnvironmentReturn class and the rewards attribute when editing so the contract is discoverable to external users.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/prompts/gsm8k.txt`:
- Around line 10-12: The prompt's output tags (<think> and <answer>) are
incompatible with the math verifiers (nemo_rl/environments/dapo_math_verifier.py
and nemo_rl/environments/math_environment.py) that expect an "Answer:" prefix;
update the example prompt (the <answer> block in examples/prompts/gsm8k.txt) to
emit a final line starting with "Answer:" followed by the integer result (e.g.,
"Answer: 42") so it matches the extractor used by the verifier and preserves
integer-only output.
In `@examples/run_grpo.py`:
- Around line 143-147: The code assumes config["grpo"]["adv_estimator"] always
exists and will KeyError for legacy configs; guard access by checking that
config.get("grpo") and config["grpo"].get("adv_estimator") (or use
"adv_estimator" in config["grpo"]) exist and are dict-like before inspecting
["name"], and only raise NotImplementedError when that name == "gdpo"; update
the conditional around the existing check in examples/run_grpo.py so it first
verifies presence of adv_estimator and its "name" key before comparing to
"gdpo".
In `@nemo_rl/algorithms/advantage_estimator.py`:
- Around line 146-153: The current normalization computes mean/std over all
entries and then expands with mask, letting inactive (masked-out) samples skew
stats; change normalization to compute mean and std only over active entries
indicated by mask: extract masked_adv = advantages[mask.bool()] (or equivalent),
compute mean and std from masked_adv, normalize masked_adv (handling std==0 by
subtracting mean only), then write the normalized values back into the original
advantages tensor at the masked positions while leaving inactive entries
unchanged (or zero), and finally return advantages.expand(mask.shape). Update
the block around the existing advantages/std logic (the variables `advantages`
and `mask`) to perform masked statistics and assignment.
In `@nemo_rl/data/datasets/response_datasets/gsm8k.py`:
- Around line 22-25: The function _extract_hash_answer currently returns None
when "####" is missing which can propagate into assistant content; change it to
always return a str by returning an empty string (or text.strip() if you prefer
preserving input) instead of None, and update the type annotation to just str.
Concretely, modify _extract_hash_answer to check for "####" and return "" (or
text.strip()) when absent, and keep the existing split behavior when present so
callers like any code reading assistant content never receive None.
In `@nemo_rl/environments/math_environment.py`:
- Line 300: Remove the leftover commented-out debug call "_mute_output()" in the
verifier path of the reward computation and any other commented executable lines
around the verifier in MathEnvironment (e.g., in the reward function / verifier
handling); either delete these debug-only comments or replace them with a short
explanatory comment stating why the code is intentionally disabled and when it
should be re-enabled, so intent is clear before merge.
- Around line 301-303: The code currently provides hidden defaults for config
keys (e.g., using kwargs.get("math_verify_impl", "hf_math_verify"), cfg.get(...,
"math"), and self.cfg.get(..., "hf_math_verify")) which can mask missing YAML
settings; update the code to read these required config values directly without
non-None fallbacks (access kwargs["math_verify_impl"], cfg["math"],
self.cfg["hf_math_verify"] or corresponding required keys) and validate presence
explicitly (raise a clear KeyError or ValueError with context if missing) before
using them (e.g., before calling correctness_reward_func with math_verify_impl);
ensure any downstream conditional logic uses the retrieved value rather than an
implicit default.
- Around line 637-657: global_post_process_and_metrics assumes scalar rewards
but batch["rewards"] can be [batch_size, 3]; fix by collapsing to a per-sample
scalar correctness reward before using it: detect if batch["rewards"].dim() > 1
and then select the correctness channel (e.g., rewards = batch["rewards"][:,
CORRECTNESS_CHANNEL] or pick the channel by name if provided), then replace
subsequent uses (the masking for correct_solution_generation_lengths that
indexes generation_lengths, the multiplication batch["rewards"] *
batch["is_end"], and the metrics "accuracy" and pass@samples_per_prompt calls)
to operate on this 1D rewards tensor; ensure correct_solution_generation_lengths
uses (batch["generation_lengths"] - batch["prompt_lengths"])[rewards == 1] and
pass the 1D rewards into calculate_pass_rate_per_prompt and metrics.
- Around line 575-600: The code currently dispatches remote work via
self.workers[i].verify.remote(...) and ray.get(...) before checking
return_extracted_answer and raising NotImplementedError; move the early
validation so that return_extracted_answer is checked and the
NotImplementedError is raised before creating futures or calling ray.get to
avoid wasted cluster work. Locate the block that builds futures and calls
ray.get (uses self.workers[i].verify.remote, futures, and ray.get) and add a
guard that validates return_extracted_answer (and any related flags) first; if
unsupported, raise the NotImplementedError immediately, otherwise proceed to
build futures and call ray.get.
In `@nemo_rl/experience/rollouts.py`:
- Around line 837-840: The async rollouts drop per-component rewards because
run_async_multi_turn_rollout constructs final_batch from a fixed schema and
never stacks the reward1..rewardN keys that rollouts.py adds via
multi_reward_seen and reward_acc_list into final_sample_state. Update
run_async_multi_turnout_rollout to detect per-component reward keys (e.g., keys
starting with "reward" or using the multi_reward_seen flag) when assembling
final_batch, and for each reward{n} call the same stacking/concatenation logic
used for other tensor fields (using torch.stack or the existing batching helper)
so reward1..rewardN are included in final_batch with correct dimensions and
dtypes; reference the symbols final_sample_state, reward_acc_list,
multi_reward_seen, and final_batch to locate where to insert this behavior.
- Around line 474-490: The accumulation logic fails when env_output.rewards has
shape [N,1] because number_of_rewards==1 sends it to the else branch and
attempts to add a [N,1] tensor into total_rewards [N]. Update the accumulation
to handle 2D single-component rewards: if number_of_rewards == 1 and
env_output.rewards.ndim >= 2, squeeze the last dimension (e.g., rewards =
env_output.rewards.squeeze(-1)) before adding to total_rewards[active_indices];
otherwise keep existing handling for multi_rewards (number_of_rewards > 1) and
the 1D reward case. Apply this change around the branches that reference
number_of_rewards, multi_rewards, env_output.rewards, total_rewards, and
active_indices.
In `@tests/functional/gdpo.sh`:
- Line 37: The script is using unquoted $@ which can split arguments with
spaces; update the invocation that currently passes $@ to instead pass "$@" so
all forwarded CLI args are preserved as distinct arguments (replace the unquoted
$@ occurrence in the gdpo.sh invocation with the quoted form "$@").
---
Outside diff comments:
In `@nemo_rl/algorithms/grpo.py`:
- Around line 955-976: Check that reward_scaling_cfg values define a valid
source range before using them: verify float(reward_scaling_cfg["source_max"]) >
float(reward_scaling_cfg["source_min"]) and if not, raise a clear exception or
disable scaling; do this check near where source_min/source_max are parsed (the
block that defines source_min, source_max, target_min, target_max) and ensure
_scale will never divide by zero (i.e., skip scaling or clamp to a no-op mapping
when invalid). Include the config keys (reward_scaling_cfg, source_min,
source_max) and the _scale function name in your change so reviewers can find
the validation logic.
---
Nitpick comments:
In `@nemo_rl/environments/interfaces.py`:
- Line 47: The inline comment on the EnvironmentReturn.rewards field should be
removed and its contract moved into the EnvironmentReturn class docstring:
update the EnvironmentReturn docstring to describe the expected shapes for
rewards (e.g., [batch] for scalar per-batch, [batch, num_rewards] for vector
rewards), mention optional semantics (None vs zeros) if relevant, and keep the
field annotation as simply "rewards: Tensor"; refer to the EnvironmentReturn
class and the rewards attribute when editing so the contract is discoverable to
external users.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c2ad0f49-dfb3-488a-92a2-175c53a3280c
📒 Files selected for processing (17)
examples/configs/gdpo_math_1B.yamlexamples/prompts/gsm8k.txtexamples/run_grpo.pynemo_rl/algorithms/advantage_estimator.pynemo_rl/algorithms/grpo.pynemo_rl/algorithms/utils.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/gsm8k.pynemo_rl/data/processors.pynemo_rl/distributed/ray_actor_environment_registry.pynemo_rl/environments/interfaces.pynemo_rl/environments/math_environment.pynemo_rl/environments/utils.pynemo_rl/experience/rollouts.pytests/functional/L1_Functional_Tests_GPU.shtests/functional/gdpo.shtests/unit/algorithms/test_grpo.py
faa7794 to
a410935
Compare
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Shih-Yang Liu <shihyangl@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
great work @nbasyl !
some misc:
- could you also add this to the "news" section on the front page readme. feel free to reference your paper :)
- could you add a section about this in grpo.md so that users know how to enable?
- in environments.md could you also mention the new schema: environments can return X scaler or Y dict/tensor (depending on that thread started about the positional ordering of the rewards). Would also be good to mention the shape of the rewards coming out of the rollout calculation if using multiple rewards. This will help @bxyu-nvidia when the gym side integrates since he'll know reading the docs what the shape of the result should be
RL/docs/guides/environments.md
Line 5 in 8c70ee6
- could you add some unit tests for the new environment?
- @yuki-97 will eval be okay given this change? any tests or code @nbasyl should look at?
| metadata: list[MetadataT] | ||
| next_stop_strings: list[list[str] | None] | list[None] | ||
| rewards: Tensor | ||
| rewards: Tensor ## This could be of different shape |
There was a problem hiding this comment.
nit: could this comment be a little more specific about the shapes? something like [B] | [B,num_reward] or however the shape is arranged?
| ) | ||
|
|
||
| # set a reward of 0 for any incorrectly ended sequences | ||
| rewards = rewards * batch["is_end"] |
There was a problem hiding this comment.
it looks like there's a subtle behavior change here. we used to mask the rewards in place before:
batch["rewards"] = (
batch["rewards"] * batch["is_end"]
could you elaborate on why that's not needed anymore?
There was a problem hiding this comment.
good catch for the in place change!
hmm, but I took a look at the code, seems global_post_process_and_metrics isn't actually used anywhere, so I think this change should be fine.
we should cleanup or make use of this function in another PR. correct me if I missed anything.
| if multi_rewards is not None: | ||
| num_reward_components = multi_rewards.shape[1] | ||
| for i in range(num_reward_components): | ||
| current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone() |
There was a problem hiding this comment.
in a multi-environment rollout, won't we have issues with this if one environment has [math_format, math_correctness] as rewards but then has [code_correctness, code_format] so different orderings?
should environments then return scalar rewards or a dict of rewards so we can figure this out from the metrics? otherwise it looks like we need to look at code to figure out which reward is which, also multiple rewards of different types may get combined in unexpected way if they are positional
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Tests