feat: support GDPO (New) by nbasyl · Pull Request #2069 · NVIDIA-NeMo/RL

nbasyl · 2026-03-05T09:39:53Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added GDPO algorithm support for model training.
- Integrated GSM8K dataset for mathematics problem training.
- Enabled multi-reward training with separate reward components.
- Added math environment supporting multiple distinct reward signals.
Tests
- Added GDPO functional test suite.

copy-pr-bot · 2026-03-05T09:39:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

nbasyl · 2026-03-05T09:40:29Z

Hi @yuki-97, this is the new PR created based on the branch on the nemo-rl repo

coderabbitai · 2026-03-05T09:50:49Z

📝 Walkthrough

Walkthrough

This PR introduces GDPO (Group Decision Policy Optimization) support to the NeMo RL framework, adding a new multi-component reward advantage estimator, math task dataset (GSM8K) with dedicated verification, multi-reward environment support, and associated training pipeline enhancements to handle per-component reward signals.

Changes

Cohort / File(s)	Summary
GDPO Advantage Estimator `nemo_rl/algorithms/advantage_estimator.py`, `nemo_rl/algorithms/utils.py`	Introduces GDPOAdvantageEstimator for multi-component reward estimation with per-component baselines and normalization. Updates GRPOAdvantageEstimator and ReinforcePlusPlusAdvantageEstimator signatures to accept repeated_batch parameter. Adds utility function get_gdpo_reward_component_keys for reward component extraction.
GRPO Training Integration `nemo_rl/algorithms/grpo.py`, `examples/run_grpo.py`	Adds GDPO configuration support and estimator creation logic. Updates AdvEstimatorConfig to document "gdpo" option. Implements reward component scaling and baseline alignment for multi-reward scenarios. Adds early validation guard in async GRPO path preventing GDPO usage with async training.
Multi-Reward Environments `nemo_rl/environments/math_environment.py`, `nemo_rl/environments/utils.py`, `nemo_rl/distributed/ray_actor_environment_registry.py`	Introduces HFMultiRewardVerifyWorker and MathMultiRewardEnvironment for parallel multi-reward signal computation across correctness, integer format, and response formatting. Registers math_multi_reward environment and MathMultiRewardEnvironment actor in distributed registries.
GSM8K Dataset & Processing `nemo_rl/data/datasets/response_datasets/gsm8k.py`, `nemo_rl/data/datasets/response_datasets/__init__.py`, `nemo_rl/data/processors.py`	Adds GSM8KDataset class with answer extraction support. Registers gsm8k dataset in DATASET_REGISTRY. Implements math_gdpo_data_processor for GDPO-style data formatting with loss masking and environment metadata propagation.
Rollout Multi-Reward Support `nemo_rl/experience/rollouts.py`	Extends single and multi-turn rollout collection to handle multi-component rewards. Infers reward component count from environment output and exposes per-component rewards as reward1, reward2, etc. in batch and sample states. Maintains backward compatibility for single-reward scenarios.
Configuration & Prompts `examples/configs/gdpo_math_1B.yaml`, `examples/prompts/gsm8k.txt`	New GDPO configuration inheriting from grpo_math_1B.yaml with GDPO-specific settings, dataset configuration for GSM8K, and math_multi_reward environment setup. New prompt file providing step-by-step reasoning format for math problem solving.
Data Structure & Testing `nemo_rl/environments/interfaces.py`, `tests/unit/algorithms/test_grpo.py`, `tests/functional/gdpo.sh`, `tests/functional/L1_Functional_Tests_GPU.sh`	Updates EnvironmentReturn documentation for variable reward shape. Updates unit tests for new compute_advantage signatures. Adds functional test script for GDPO workflow and integrates into CI test suite.

Sequence Diagram(s)

sequenceDiagram
    participant Rollout as Rollout Collector
    participant Env as MathMultiRewardEnvironment
    participant Worker as HFMultiRewardVerifyWorker
    participant Estimator as GDPOAdvantageEstimator
    participant Batch as Batch

    Rollout->>Env: step(actions, predictions)
    Env->>Env: chunk predictions for parallel processing
    loop For each chunk
        Env->>Worker: verify(predictions)
        Worker->>Worker: compute 3 reward signals<br/>(correctness, format, int)
        Worker-->>Env: rewards[3]
    end
    Env->>Env: stack rewards into tensor
    Env-->>Rollout: (obs, rewards[batch, 3], done, metadata)
    
    Rollout->>Rollout: accumulate per-component<br/>rewards as reward1, reward2, reward3
    Rollout->>Batch: add reward1, reward2, reward3 fields
    
    Batch->>Estimator: compute_advantage(prompt_ids, rewards, repeated_batch, mask)
    Estimator->>Estimator: loop over reward components
    Estimator->>Estimator: compute per-component baselines
    Estimator->>Estimator: normalize per-component rewards
    Estimator->>Estimator: sum components
    Estimator->>Estimator: apply global normalization
    Estimator-->>Batch: advantages (normalized)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: Add DAPO dataset and Deepseek-v3 config #1281: Modifies math_environment.py and response_datasets for alternative math task support (DAPO dataset and verifier), sharing the same extension points as this PR's GSM8K integration.
feat: Support DAPO dynamic sampling and reward shaping #602: Extends nemo_rl/algorithms/grpo.py with reward scaling and dynamic training logic, overlapping with this PR's multi-component reward handling infrastructure.
feat: add async RL support #1098: Modifies async GRPO execution path (examples/run_grpo.py and grpo.py), directly affected by this PR's async GDPO validation guard.

Suggested labels

CI:L1, Run CICD

Suggested reviewers

terrykong
yuki-97

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR description is empty template with no documented test results, convergence validation, performance metrics, or testing information despite 600+ lines of algorithmic changes.	Update PR description to include test execution results from tests/functional/gdpo.sh, convergence metrics, performance comparisons, and regression test confirmation.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: support GDPO (New)' directly and clearly describes the main change: adding support for GDPO (Generalized Advantage Estimation with Dynamic Policy Optimization). The changeset introduces GDPO as a new advantage estimator across multiple modules, including configuration, environment, and training infrastructure.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch GDPO

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/algorithms/grpo.py (1)

955-976: ⚠️ Potential issue | 🟠 Major

Validate reward-scaling range before computing the linear map.

If source_max <= source_min, _scale divides by zero and can poison rewards with inf/nan.

💡 Proposed fix

         source_min = float(reward_scaling_cfg["source_min"])
         source_max = float(reward_scaling_cfg["source_max"])
         target_min = float(reward_scaling_cfg["target_min"])
         target_max = float(reward_scaling_cfg["target_max"])
+        if source_max <= source_min:
+            raise ValueError(
+                "Invalid reward scaling config: source_max must be greater than source_min"
+            )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/grpo.py` around lines 955 - 976, Check that
reward_scaling_cfg values define a valid source range before using them: verify
float(reward_scaling_cfg["source_max"]) >
float(reward_scaling_cfg["source_min"]) and if not, raise a clear exception or
disable scaling; do this check near where source_min/source_max are parsed (the
block that defines source_min, source_max, target_min, target_max) and ensure
_scale will never divide by zero (i.e., skip scaling or clamp to a no-op mapping
when invalid). Include the config keys (reward_scaling_cfg, source_min,
source_max) and the _scale function name in your change so reviewers can find
the validation logic.

🧹 Nitpick comments (1)

nemo_rl/environments/interfaces.py (1)
47-47: Move reward-shape contract into the EnvironmentReturn docstring.

Keeping this as an inline ## comment on a public interface field makes the contract easy to miss; document expected shapes (e.g., [batch] vs [batch, num_rewards]) in the class docstring and keep the annotation clean.

As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/environments/interfaces.py` at line 47, The inline comment on the
EnvironmentReturn.rewards field should be removed and its contract moved into
the EnvironmentReturn class docstring: update the EnvironmentReturn docstring to
describe the expected shapes for rewards (e.g., [batch] for scalar per-batch,
[batch, num_rewards] for vector rewards), mention optional semantics (None vs
zeros) if relevant, and keep the field annotation as simply "rewards: Tensor";
refer to the EnvironmentReturn class and the rewards attribute when editing so
the contract is discoverable to external users.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/prompts/gsm8k.txt`:
- Around line 10-12: The prompt's output tags (<think> and <answer>) are
incompatible with the math verifiers (nemo_rl/environments/dapo_math_verifier.py
and nemo_rl/environments/math_environment.py) that expect an "Answer:" prefix;
update the example prompt (the <answer> block in examples/prompts/gsm8k.txt) to
emit a final line starting with "Answer:" followed by the integer result (e.g.,
"Answer: 42") so it matches the extractor used by the verifier and preserves
integer-only output.

In `@examples/run_grpo.py`:
- Around line 143-147: The code assumes config["grpo"]["adv_estimator"] always
exists and will KeyError for legacy configs; guard access by checking that
config.get("grpo") and config["grpo"].get("adv_estimator") (or use
"adv_estimator" in config["grpo"]) exist and are dict-like before inspecting
["name"], and only raise NotImplementedError when that name == "gdpo"; update
the conditional around the existing check in examples/run_grpo.py so it first
verifies presence of adv_estimator and its "name" key before comparing to
"gdpo".

In `@nemo_rl/algorithms/advantage_estimator.py`:
- Around line 146-153: The current normalization computes mean/std over all
entries and then expands with mask, letting inactive (masked-out) samples skew
stats; change normalization to compute mean and std only over active entries
indicated by mask: extract masked_adv = advantages[mask.bool()] (or equivalent),
compute mean and std from masked_adv, normalize masked_adv (handling std==0 by
subtracting mean only), then write the normalized values back into the original
advantages tensor at the masked positions while leaving inactive entries
unchanged (or zero), and finally return advantages.expand(mask.shape). Update
the block around the existing advantages/std logic (the variables `advantages`
and `mask`) to perform masked statistics and assignment.

In `@nemo_rl/data/datasets/response_datasets/gsm8k.py`:
- Around line 22-25: The function _extract_hash_answer currently returns None
when "####" is missing which can propagate into assistant content; change it to
always return a str by returning an empty string (or text.strip() if you prefer
preserving input) instead of None, and update the type annotation to just str.
Concretely, modify _extract_hash_answer to check for "####" and return "" (or
text.strip()) when absent, and keep the existing split behavior when present so
callers like any code reading assistant content never receive None.

In `@nemo_rl/environments/math_environment.py`:
- Line 300: Remove the leftover commented-out debug call "_mute_output()" in the
verifier path of the reward computation and any other commented executable lines
around the verifier in MathEnvironment (e.g., in the reward function / verifier
handling); either delete these debug-only comments or replace them with a short
explanatory comment stating why the code is intentionally disabled and when it
should be re-enabled, so intent is clear before merge.
- Around line 301-303: The code currently provides hidden defaults for config
keys (e.g., using kwargs.get("math_verify_impl", "hf_math_verify"), cfg.get(...,
"math"), and self.cfg.get(..., "hf_math_verify")) which can mask missing YAML
settings; update the code to read these required config values directly without
non-None fallbacks (access kwargs["math_verify_impl"], cfg["math"],
self.cfg["hf_math_verify"] or corresponding required keys) and validate presence
explicitly (raise a clear KeyError or ValueError with context if missing) before
using them (e.g., before calling correctness_reward_func with math_verify_impl);
ensure any downstream conditional logic uses the retrieved value rather than an
implicit default.
- Around line 637-657: global_post_process_and_metrics assumes scalar rewards
but batch["rewards"] can be [batch_size, 3]; fix by collapsing to a per-sample
scalar correctness reward before using it: detect if batch["rewards"].dim() > 1
and then select the correctness channel (e.g., rewards = batch["rewards"][:,
CORRECTNESS_CHANNEL] or pick the channel by name if provided), then replace
subsequent uses (the masking for correct_solution_generation_lengths that
indexes generation_lengths, the multiplication batch["rewards"] *
batch["is_end"], and the metrics "accuracy" and pass@samples_per_prompt calls)
to operate on this 1D rewards tensor; ensure correct_solution_generation_lengths
uses (batch["generation_lengths"] - batch["prompt_lengths"])[rewards == 1] and
pass the 1D rewards into calculate_pass_rate_per_prompt and metrics.
- Around line 575-600: The code currently dispatches remote work via
self.workers[i].verify.remote(...) and ray.get(...) before checking
return_extracted_answer and raising NotImplementedError; move the early
validation so that return_extracted_answer is checked and the
NotImplementedError is raised before creating futures or calling ray.get to
avoid wasted cluster work. Locate the block that builds futures and calls
ray.get (uses self.workers[i].verify.remote, futures, and ray.get) and add a
guard that validates return_extracted_answer (and any related flags) first; if
unsupported, raise the NotImplementedError immediately, otherwise proceed to
build futures and call ray.get.

In `@nemo_rl/experience/rollouts.py`:
- Around line 837-840: The async rollouts drop per-component rewards because
run_async_multi_turn_rollout constructs final_batch from a fixed schema and
never stacks the reward1..rewardN keys that rollouts.py adds via
multi_reward_seen and reward_acc_list into final_sample_state. Update
run_async_multi_turnout_rollout to detect per-component reward keys (e.g., keys
starting with "reward" or using the multi_reward_seen flag) when assembling
final_batch, and for each reward{n} call the same stacking/concatenation logic
used for other tensor fields (using torch.stack or the existing batching helper)
so reward1..rewardN are included in final_batch with correct dimensions and
dtypes; reference the symbols final_sample_state, reward_acc_list,
multi_reward_seen, and final_batch to locate where to insert this behavior.
- Around line 474-490: The accumulation logic fails when env_output.rewards has
shape [N,1] because number_of_rewards==1 sends it to the else branch and
attempts to add a [N,1] tensor into total_rewards [N]. Update the accumulation
to handle 2D single-component rewards: if number_of_rewards == 1 and
env_output.rewards.ndim >= 2, squeeze the last dimension (e.g., rewards =
env_output.rewards.squeeze(-1)) before adding to total_rewards[active_indices];
otherwise keep existing handling for multi_rewards (number_of_rewards > 1) and
the 1D reward case. Apply this change around the branches that reference
number_of_rewards, multi_rewards, env_output.rewards, total_rewards, and
active_indices.

In `@tests/functional/gdpo.sh`:
- Line 37: The script is using unquoted $@ which can split arguments with
spaces; update the invocation that currently passes $@ to instead pass "$@" so
all forwarded CLI args are preserved as distinct arguments (replace the unquoted
$@ occurrence in the gdpo.sh invocation with the quoted form "$@").

---

Outside diff comments:
In `@nemo_rl/algorithms/grpo.py`:
- Around line 955-976: Check that reward_scaling_cfg values define a valid
source range before using them: verify float(reward_scaling_cfg["source_max"]) >
float(reward_scaling_cfg["source_min"]) and if not, raise a clear exception or
disable scaling; do this check near where source_min/source_max are parsed (the
block that defines source_min, source_max, target_min, target_max) and ensure
_scale will never divide by zero (i.e., skip scaling or clamp to a no-op mapping
when invalid). Include the config keys (reward_scaling_cfg, source_min,
source_max) and the _scale function name in your change so reviewers can find
the validation logic.

---

Nitpick comments:
In `@nemo_rl/environments/interfaces.py`:
- Line 47: The inline comment on the EnvironmentReturn.rewards field should be
removed and its contract moved into the EnvironmentReturn class docstring:
update the EnvironmentReturn docstring to describe the expected shapes for
rewards (e.g., [batch] for scalar per-batch, [batch, num_rewards] for vector
rewards), mention optional semantics (None vs zeros) if relevant, and keep the
field annotation as simply "rewards: Tensor"; refer to the EnvironmentReturn
class and the rewards attribute when editing so the contract is discoverable to
external users.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c2ad0f49-dfb3-488a-92a2-175c53a3280c

📥 Commits

Reviewing files that changed from the base of the PR and between c4f8e1c and 81efbfa.

📒 Files selected for processing (17)

examples/configs/gdpo_math_1B.yaml
examples/prompts/gsm8k.txt
examples/run_grpo.py
nemo_rl/algorithms/advantage_estimator.py
nemo_rl/algorithms/grpo.py
nemo_rl/algorithms/utils.py
nemo_rl/data/datasets/response_datasets/__init__.py
nemo_rl/data/datasets/response_datasets/gsm8k.py
nemo_rl/data/processors.py
nemo_rl/distributed/ray_actor_environment_registry.py
nemo_rl/environments/interfaces.py
nemo_rl/environments/math_environment.py
nemo_rl/environments/utils.py
nemo_rl/experience/rollouts.py
tests/functional/L1_Functional_Tests_GPU.sh
tests/functional/gdpo.sh
tests/unit/algorithms/test_grpo.py

examples/prompts/gsm8k.txt

examples/run_grpo.py

nemo_rl/algorithms/advantage_estimator.py

nemo_rl/data/datasets/response_datasets/gsm8k.py

nemo_rl/environments/math_environment.py

nemo_rl/experience/rollouts.py

tests/functional/gdpo.sh

Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: Shih-Yang Liu <shihyangl@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97

thanks for the effort @nbasyl ! left some small comments.

and could you help to copy the curve you pasted in the previous PR to this PR's description? so that people can take it as a reference.

nemo_rl/algorithms/advantage_estimator.py

tests/unit/algorithms/test_grpo.py

nemo_rl/experience/rollouts.py

tests/unit/algorithms/test_grpo.py

Signed-off-by: Yuki Huang <yukih@nvidia.com>

terrykong

great work @nbasyl !

some misc:

could you also add this to the "news" section on the front page readme. feel free to reference your paper :)
could you add a section about this in grpo.md so that users know how to enable?
in environments.md

RL/docs/guides/environments.md

Line 5 in 8c70ee6

## Math Environment

could you also mention the new schema: environments can return X scaler or Y dict/tensor (depending on that thread started about the positional ordering of the rewards). Would also be good to mention the shape of the rewards coming out of the rollout calculation if using multiple rewards. This will help @bxyu-nvidia when the gym side integrates since he'll know reading the docs what the shape of the result should be
could you add some unit tests for the new environment?
@yuki-97 will eval be okay given this change? any tests or code @nbasyl should look at?

nemo_rl/algorithms/grpo.py

examples/run_grpo.py

terrykong · 2026-03-07T07:43:43Z

nemo_rl/environments/interfaces.py

    metadata: list[MetadataT]
    next_stop_strings: list[list[str] | None] | list[None]
-    rewards: Tensor
+    rewards: Tensor  ## This could be of different shape


nit: could this comment be a little more specific about the shapes? something like [B] | [B,num_reward] or however the shape is arranged?

terrykong · 2026-03-07T07:48:08Z

nemo_rl/environments/math_environment.py

+        )
+
+        # set a reward of 0 for any incorrectly ended sequences
+        rewards = rewards * batch["is_end"]


it looks like there's a subtle behavior change here. we used to mask the rewards in place before:

batch["rewards"] = ( batch["rewards"] * batch["is_end"]

could you elaborate on why that's not needed anymore?

good catch for the in place change!

hmm, but I took a look at the code, seems global_post_process_and_metrics isn't actually used anywhere, so I think this change should be fine.
we should cleanup or make use of this function in another PR. correct me if I missed anything.

terrykong · 2026-03-07T07:55:07Z

nemo_rl/experience/rollouts.py

+    if multi_rewards is not None:
+        num_reward_components = multi_rewards.shape[1]
+        for i in range(num_reward_components):
+            current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone()


in a multi-environment rollout, won't we have issues with this if one environment has [math_format, math_correctness] as rewards but then has [code_correctness, code_format] so different orderings?

should environments then return scalar rewards or a dict of rewards so we can figure this out from the metrics? otherwise it looks like we need to look at code to figure out which reward is which, also multiple rewards of different types may get combined in unexpected way if they are positional

nemo_rl/environments/math_environment.py

Signed-off-by: Yuki Huang <yukih@nvidia.com>

nbasyl requested review from a team as code owners March 5, 2026 09:39

nbasyl requested a review from yuki-97 March 5, 2026 09:40

nbasyl force-pushed the GDPO branch from 81efbfa to 30e80b7 Compare March 5, 2026 09:43

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

nbasyl force-pushed the GDPO branch 2 times, most recently from faa7794 to a410935 Compare March 6, 2026 04:18

yuki-97 force-pushed the GDPO branch from a410935 to cf6c7ea Compare March 6, 2026 05:18

yuki-97 and others added 2 commits March 5, 2026 21:26

move from my personal brannch to here

91bd7c6

Signed-off-by: Yuki Huang <yukih@nvidia.com>

remove redudant repeat_batch key

dfd53e1

Signed-off-by: Shih-Yang Liu <shihyangl@nvidia.com>

yuki-97 force-pushed the GDPO branch from cf6c7ea to dfd53e1 Compare March 6, 2026 05:26

yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Mar 6, 2026

yuki-97 temporarily deployed to nemo-ci March 6, 2026 05:38 — with GitHub Actions Inactive

yuki-97 added 2 commits March 6, 2026 00:21

lint

9b7bc11

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update math env

e1a3452

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 reviewed Mar 6, 2026

View reviewed changes

nemo_rl/algorithms/advantage_estimator.py Outdated Show resolved Hide resolved

tests/unit/algorithms/test_grpo.py Outdated Show resolved Hide resolved

nemo_rl/experience/rollouts.py Outdated Show resolved Hide resolved

tests/unit/algorithms/test_grpo.py Show resolved Hide resolved

pyrefly

841341d

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 requested a review from a team as a code owner March 6, 2026 17:14

yuki-97 requested a review from terrykong March 6, 2026 17:15

yuki-97 mentioned this pull request Mar 6, 2026

feat: support GDPO #1986

Closed

4 tasks

terrykong reviewed Mar 7, 2026

View reviewed changes

address comments

77af8bf

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 mentioned this pull request Mar 9, 2026

Support MathMultiRewardEnvironment in eval #2088

Open

add assert

e23fa7e

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add unit test for gdpo estimator and multi-reward env

198c991

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels Mar 9, 2026

yuki-97 temporarily deployed to nemo-ci March 9, 2026 14:28 — with GitHub Actions Inactive

Conversation

nbasyl commented Mar 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

nbasyl commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

terrykong Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

yuki-97 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nbasyl commented Mar 5, 2026 •

edited by coderabbitai bot

Loading