Skip to content

feat: support GDPO (New)#2069

Open
nbasyl wants to merge 8 commits intomainfrom
GDPO
Open

feat: support GDPO (New)#2069
nbasyl wants to merge 8 commits intomainfrom
GDPO

Conversation

@nbasyl
Copy link

@nbasyl nbasyl commented Mar 5, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added GDPO algorithm support for model training.
    • Integrated GSM8K dataset for mathematics problem training.
    • Enabled multi-reward training with separate reward components.
    • Added math environment supporting multiple distinct reward signals.
  • Tests

    • Added GDPO functional test suite.

@nbasyl nbasyl requested review from a team as code owners March 5, 2026 09:39
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nbasyl nbasyl requested a review from yuki-97 March 5, 2026 09:40
@nbasyl
Copy link
Author

nbasyl commented Mar 5, 2026

Hi @yuki-97, this is the new PR created based on the branch on the nemo-rl repo

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

This PR introduces GDPO (Group Decision Policy Optimization) support to the NeMo RL framework, adding a new multi-component reward advantage estimator, math task dataset (GSM8K) with dedicated verification, multi-reward environment support, and associated training pipeline enhancements to handle per-component reward signals.

Changes

Cohort / File(s) Summary
GDPO Advantage Estimator
nemo_rl/algorithms/advantage_estimator.py, nemo_rl/algorithms/utils.py
Introduces GDPOAdvantageEstimator for multi-component reward estimation with per-component baselines and normalization. Updates GRPOAdvantageEstimator and ReinforcePlusPlusAdvantageEstimator signatures to accept repeated_batch parameter. Adds utility function get_gdpo_reward_component_keys for reward component extraction.
GRPO Training Integration
nemo_rl/algorithms/grpo.py, examples/run_grpo.py
Adds GDPO configuration support and estimator creation logic. Updates AdvEstimatorConfig to document "gdpo" option. Implements reward component scaling and baseline alignment for multi-reward scenarios. Adds early validation guard in async GRPO path preventing GDPO usage with async training.
Multi-Reward Environments
nemo_rl/environments/math_environment.py, nemo_rl/environments/utils.py, nemo_rl/distributed/ray_actor_environment_registry.py
Introduces HFMultiRewardVerifyWorker and MathMultiRewardEnvironment for parallel multi-reward signal computation across correctness, integer format, and response formatting. Registers math_multi_reward environment and MathMultiRewardEnvironment actor in distributed registries.
GSM8K Dataset & Processing
nemo_rl/data/datasets/response_datasets/gsm8k.py, nemo_rl/data/datasets/response_datasets/__init__.py, nemo_rl/data/processors.py
Adds GSM8KDataset class with answer extraction support. Registers gsm8k dataset in DATASET_REGISTRY. Implements math_gdpo_data_processor for GDPO-style data formatting with loss masking and environment metadata propagation.
Rollout Multi-Reward Support
nemo_rl/experience/rollouts.py
Extends single and multi-turn rollout collection to handle multi-component rewards. Infers reward component count from environment output and exposes per-component rewards as reward1, reward2, etc. in batch and sample states. Maintains backward compatibility for single-reward scenarios.
Configuration & Prompts
examples/configs/gdpo_math_1B.yaml, examples/prompts/gsm8k.txt
New GDPO configuration inheriting from grpo_math_1B.yaml with GDPO-specific settings, dataset configuration for GSM8K, and math_multi_reward environment setup. New prompt file providing step-by-step reasoning format for math problem solving.
Data Structure & Testing
nemo_rl/environments/interfaces.py, tests/unit/algorithms/test_grpo.py, tests/functional/gdpo.sh, tests/functional/L1_Functional_Tests_GPU.sh
Updates EnvironmentReturn documentation for variable reward shape. Updates unit tests for new compute_advantage signatures. Adds functional test script for GDPO workflow and integrates into CI test suite.

Sequence Diagram(s)

sequenceDiagram
    participant Rollout as Rollout Collector
    participant Env as MathMultiRewardEnvironment
    participant Worker as HFMultiRewardVerifyWorker
    participant Estimator as GDPOAdvantageEstimator
    participant Batch as Batch

    Rollout->>Env: step(actions, predictions)
    Env->>Env: chunk predictions for parallel processing
    loop For each chunk
        Env->>Worker: verify(predictions)
        Worker->>Worker: compute 3 reward signals<br/>(correctness, format, int)
        Worker-->>Env: rewards[3]
    end
    Env->>Env: stack rewards into tensor
    Env-->>Rollout: (obs, rewards[batch, 3], done, metadata)
    
    Rollout->>Rollout: accumulate per-component<br/>rewards as reward1, reward2, reward3
    Rollout->>Batch: add reward1, reward2, reward3 fields
    
    Batch->>Estimator: compute_advantage(prompt_ids, rewards, repeated_batch, mask)
    Estimator->>Estimator: loop over reward components
    Estimator->>Estimator: compute per-component baselines
    Estimator->>Estimator: normalize per-component rewards
    Estimator->>Estimator: sum components
    Estimator->>Estimator: apply global normalization
    Estimator-->>Batch: advantages (normalized)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

CI:L1, Run CICD

Suggested reviewers

  • terrykong
  • yuki-97
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR description is empty template with no documented test results, convergence validation, performance metrics, or testing information despite 600+ lines of algorithmic changes. Update PR description to include test execution results from tests/functional/gdpo.sh, convergence metrics, performance comparisons, and regression test confirmation.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: support GDPO (New)' directly and clearly describes the main change: adding support for GDPO (Generalized Advantage Estimation with Dynamic Policy Optimization). The changeset introduces GDPO as a new advantage estimator across multiple modules, including configuration, environment, and training infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch GDPO

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/algorithms/grpo.py (1)

955-976: ⚠️ Potential issue | 🟠 Major

Validate reward-scaling range before computing the linear map.

If source_max <= source_min, _scale divides by zero and can poison rewards with inf/nan.

💡 Proposed fix
         source_min = float(reward_scaling_cfg["source_min"])
         source_max = float(reward_scaling_cfg["source_max"])
         target_min = float(reward_scaling_cfg["target_min"])
         target_max = float(reward_scaling_cfg["target_max"])
+        if source_max <= source_min:
+            raise ValueError(
+                "Invalid reward scaling config: source_max must be greater than source_min"
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/grpo.py` around lines 955 - 976, Check that
reward_scaling_cfg values define a valid source range before using them: verify
float(reward_scaling_cfg["source_max"]) >
float(reward_scaling_cfg["source_min"]) and if not, raise a clear exception or
disable scaling; do this check near where source_min/source_max are parsed (the
block that defines source_min, source_max, target_min, target_max) and ensure
_scale will never divide by zero (i.e., skip scaling or clamp to a no-op mapping
when invalid). Include the config keys (reward_scaling_cfg, source_min,
source_max) and the _scale function name in your change so reviewers can find
the validation logic.
🧹 Nitpick comments (1)
nemo_rl/environments/interfaces.py (1)

47-47: Move reward-shape contract into the EnvironmentReturn docstring.

Keeping this as an inline ## comment on a public interface field makes the contract easy to miss; document expected shapes (e.g., [batch] vs [batch, num_rewards]) in the class docstring and keep the annotation clean.

As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/environments/interfaces.py` at line 47, The inline comment on the
EnvironmentReturn.rewards field should be removed and its contract moved into
the EnvironmentReturn class docstring: update the EnvironmentReturn docstring to
describe the expected shapes for rewards (e.g., [batch] for scalar per-batch,
[batch, num_rewards] for vector rewards), mention optional semantics (None vs
zeros) if relevant, and keep the field annotation as simply "rewards: Tensor";
refer to the EnvironmentReturn class and the rewards attribute when editing so
the contract is discoverable to external users.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/prompts/gsm8k.txt`:
- Around line 10-12: The prompt's output tags (<think> and <answer>) are
incompatible with the math verifiers (nemo_rl/environments/dapo_math_verifier.py
and nemo_rl/environments/math_environment.py) that expect an "Answer:" prefix;
update the example prompt (the <answer> block in examples/prompts/gsm8k.txt) to
emit a final line starting with "Answer:" followed by the integer result (e.g.,
"Answer: 42") so it matches the extractor used by the verifier and preserves
integer-only output.

In `@examples/run_grpo.py`:
- Around line 143-147: The code assumes config["grpo"]["adv_estimator"] always
exists and will KeyError for legacy configs; guard access by checking that
config.get("grpo") and config["grpo"].get("adv_estimator") (or use
"adv_estimator" in config["grpo"]) exist and are dict-like before inspecting
["name"], and only raise NotImplementedError when that name == "gdpo"; update
the conditional around the existing check in examples/run_grpo.py so it first
verifies presence of adv_estimator and its "name" key before comparing to
"gdpo".

In `@nemo_rl/algorithms/advantage_estimator.py`:
- Around line 146-153: The current normalization computes mean/std over all
entries and then expands with mask, letting inactive (masked-out) samples skew
stats; change normalization to compute mean and std only over active entries
indicated by mask: extract masked_adv = advantages[mask.bool()] (or equivalent),
compute mean and std from masked_adv, normalize masked_adv (handling std==0 by
subtracting mean only), then write the normalized values back into the original
advantages tensor at the masked positions while leaving inactive entries
unchanged (or zero), and finally return advantages.expand(mask.shape). Update
the block around the existing advantages/std logic (the variables `advantages`
and `mask`) to perform masked statistics and assignment.

In `@nemo_rl/data/datasets/response_datasets/gsm8k.py`:
- Around line 22-25: The function _extract_hash_answer currently returns None
when "####" is missing which can propagate into assistant content; change it to
always return a str by returning an empty string (or text.strip() if you prefer
preserving input) instead of None, and update the type annotation to just str.
Concretely, modify _extract_hash_answer to check for "####" and return "" (or
text.strip()) when absent, and keep the existing split behavior when present so
callers like any code reading assistant content never receive None.

In `@nemo_rl/environments/math_environment.py`:
- Line 300: Remove the leftover commented-out debug call "_mute_output()" in the
verifier path of the reward computation and any other commented executable lines
around the verifier in MathEnvironment (e.g., in the reward function / verifier
handling); either delete these debug-only comments or replace them with a short
explanatory comment stating why the code is intentionally disabled and when it
should be re-enabled, so intent is clear before merge.
- Around line 301-303: The code currently provides hidden defaults for config
keys (e.g., using kwargs.get("math_verify_impl", "hf_math_verify"), cfg.get(...,
"math"), and self.cfg.get(..., "hf_math_verify")) which can mask missing YAML
settings; update the code to read these required config values directly without
non-None fallbacks (access kwargs["math_verify_impl"], cfg["math"],
self.cfg["hf_math_verify"] or corresponding required keys) and validate presence
explicitly (raise a clear KeyError or ValueError with context if missing) before
using them (e.g., before calling correctness_reward_func with math_verify_impl);
ensure any downstream conditional logic uses the retrieved value rather than an
implicit default.
- Around line 637-657: global_post_process_and_metrics assumes scalar rewards
but batch["rewards"] can be [batch_size, 3]; fix by collapsing to a per-sample
scalar correctness reward before using it: detect if batch["rewards"].dim() > 1
and then select the correctness channel (e.g., rewards = batch["rewards"][:,
CORRECTNESS_CHANNEL] or pick the channel by name if provided), then replace
subsequent uses (the masking for correct_solution_generation_lengths that
indexes generation_lengths, the multiplication batch["rewards"] *
batch["is_end"], and the metrics "accuracy" and pass@samples_per_prompt calls)
to operate on this 1D rewards tensor; ensure correct_solution_generation_lengths
uses (batch["generation_lengths"] - batch["prompt_lengths"])[rewards == 1] and
pass the 1D rewards into calculate_pass_rate_per_prompt and metrics.
- Around line 575-600: The code currently dispatches remote work via
self.workers[i].verify.remote(...) and ray.get(...) before checking
return_extracted_answer and raising NotImplementedError; move the early
validation so that return_extracted_answer is checked and the
NotImplementedError is raised before creating futures or calling ray.get to
avoid wasted cluster work. Locate the block that builds futures and calls
ray.get (uses self.workers[i].verify.remote, futures, and ray.get) and add a
guard that validates return_extracted_answer (and any related flags) first; if
unsupported, raise the NotImplementedError immediately, otherwise proceed to
build futures and call ray.get.

In `@nemo_rl/experience/rollouts.py`:
- Around line 837-840: The async rollouts drop per-component rewards because
run_async_multi_turn_rollout constructs final_batch from a fixed schema and
never stacks the reward1..rewardN keys that rollouts.py adds via
multi_reward_seen and reward_acc_list into final_sample_state. Update
run_async_multi_turnout_rollout to detect per-component reward keys (e.g., keys
starting with "reward" or using the multi_reward_seen flag) when assembling
final_batch, and for each reward{n} call the same stacking/concatenation logic
used for other tensor fields (using torch.stack or the existing batching helper)
so reward1..rewardN are included in final_batch with correct dimensions and
dtypes; reference the symbols final_sample_state, reward_acc_list,
multi_reward_seen, and final_batch to locate where to insert this behavior.
- Around line 474-490: The accumulation logic fails when env_output.rewards has
shape [N,1] because number_of_rewards==1 sends it to the else branch and
attempts to add a [N,1] tensor into total_rewards [N]. Update the accumulation
to handle 2D single-component rewards: if number_of_rewards == 1 and
env_output.rewards.ndim >= 2, squeeze the last dimension (e.g., rewards =
env_output.rewards.squeeze(-1)) before adding to total_rewards[active_indices];
otherwise keep existing handling for multi_rewards (number_of_rewards > 1) and
the 1D reward case. Apply this change around the branches that reference
number_of_rewards, multi_rewards, env_output.rewards, total_rewards, and
active_indices.

In `@tests/functional/gdpo.sh`:
- Line 37: The script is using unquoted $@ which can split arguments with
spaces; update the invocation that currently passes $@ to instead pass "$@" so
all forwarded CLI args are preserved as distinct arguments (replace the unquoted
$@ occurrence in the gdpo.sh invocation with the quoted form "$@").

---

Outside diff comments:
In `@nemo_rl/algorithms/grpo.py`:
- Around line 955-976: Check that reward_scaling_cfg values define a valid
source range before using them: verify float(reward_scaling_cfg["source_max"]) >
float(reward_scaling_cfg["source_min"]) and if not, raise a clear exception or
disable scaling; do this check near where source_min/source_max are parsed (the
block that defines source_min, source_max, target_min, target_max) and ensure
_scale will never divide by zero (i.e., skip scaling or clamp to a no-op mapping
when invalid). Include the config keys (reward_scaling_cfg, source_min,
source_max) and the _scale function name in your change so reviewers can find
the validation logic.

---

Nitpick comments:
In `@nemo_rl/environments/interfaces.py`:
- Line 47: The inline comment on the EnvironmentReturn.rewards field should be
removed and its contract moved into the EnvironmentReturn class docstring:
update the EnvironmentReturn docstring to describe the expected shapes for
rewards (e.g., [batch] for scalar per-batch, [batch, num_rewards] for vector
rewards), mention optional semantics (None vs zeros) if relevant, and keep the
field annotation as simply "rewards: Tensor"; refer to the EnvironmentReturn
class and the rewards attribute when editing so the contract is discoverable to
external users.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c2ad0f49-dfb3-488a-92a2-175c53a3280c

📥 Commits

Reviewing files that changed from the base of the PR and between c4f8e1c and 81efbfa.

📒 Files selected for processing (17)
  • examples/configs/gdpo_math_1B.yaml
  • examples/prompts/gsm8k.txt
  • examples/run_grpo.py
  • nemo_rl/algorithms/advantage_estimator.py
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/algorithms/utils.py
  • nemo_rl/data/datasets/response_datasets/__init__.py
  • nemo_rl/data/datasets/response_datasets/gsm8k.py
  • nemo_rl/data/processors.py
  • nemo_rl/distributed/ray_actor_environment_registry.py
  • nemo_rl/environments/interfaces.py
  • nemo_rl/environments/math_environment.py
  • nemo_rl/environments/utils.py
  • nemo_rl/experience/rollouts.py
  • tests/functional/L1_Functional_Tests_GPU.sh
  • tests/functional/gdpo.sh
  • tests/unit/algorithms/test_grpo.py

yuki-97 and others added 2 commits March 5, 2026 21:26
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Shih-Yang Liu <shihyangl@nvidia.com>
@yuki-97 yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Mar 6, 2026
yuki-97 added 2 commits March 6, 2026 00:21
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Copy link
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the effort @nbasyl ! left some small comments.

and could you help to copy the curve you pasted in the previous PR to this PR's description? so that people can take it as a reference.

Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 requested a review from a team as a code owner March 6, 2026 17:14
@yuki-97 yuki-97 requested a review from terrykong March 6, 2026 17:15
@yuki-97 yuki-97 mentioned this pull request Mar 6, 2026
4 tasks
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work @nbasyl !

some misc:

  • could you also add this to the "news" section on the front page readme. feel free to reference your paper :)
  • could you add a section about this in grpo.md so that users know how to enable?
  • in environments.md
    ## Math Environment
    could you also mention the new schema: environments can return X scaler or Y dict/tensor (depending on that thread started about the positional ordering of the rewards). Would also be good to mention the shape of the rewards coming out of the rollout calculation if using multiple rewards. This will help @bxyu-nvidia when the gym side integrates since he'll know reading the docs what the shape of the result should be
  • could you add some unit tests for the new environment?
  • @yuki-97 will eval be okay given this change? any tests or code @nbasyl should look at?

metadata: list[MetadataT]
next_stop_strings: list[list[str] | None] | list[None]
rewards: Tensor
rewards: Tensor ## This could be of different shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could this comment be a little more specific about the shapes? something like [B] | [B,num_reward] or however the shape is arranged?

)

# set a reward of 0 for any incorrectly ended sequences
rewards = rewards * batch["is_end"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like there's a subtle behavior change here. we used to mask the rewards in place before:

batch["rewards"] = (
            batch["rewards"] * batch["is_end"]

could you elaborate on why that's not needed anymore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch for the in place change!

hmm, but I took a look at the code, seems global_post_process_and_metrics isn't actually used anywhere, so I think this change should be fine.
we should cleanup or make use of this function in another PR. correct me if I missed anything.

if multi_rewards is not None:
num_reward_components = multi_rewards.shape[1]
for i in range(num_reward_components):
current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a multi-environment rollout, won't we have issues with this if one environment has [math_format, math_correctness] as rewards but then has [code_correctness, code_format] so different orderings?

should environments then return scalar rewards or a dict of rewards so we can figure this out from the metrics? otherwise it looks like we need to look at code to figure out which reward is which, also multiple rewards of different types may get combined in unexpected way if they are positional

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants