feat: Support DAPO dynamic sampling and reward shaping by peri044 · Pull Request #602 · NVIDIA-NeMo/RL

peri044 · 2025-07-03T08:27:40Z

What does this PR do ?

This PR implements DAPO dynamic sampling and reward shaping as extensions to the current GRPO algorithm. This doesn't include tests and docs yet but please review if this is the right direction to implement DAPO.

Issues

Fixes #425

Usage

Here is the example config file for a DAPO experiment

# DAPO configuration
grpo:
  num_prompts_per_step: 24
  num_generations_per_prompt: 8
  max_rollout_turns: 1 # for multi-turn rollouts. Math Environments just have 1 turn (answering the question)
  max_num_steps: 100
  normalize_rewards: true
  use_leave_one_out_baseline: false
  val_period: 10
  val_at_start: false
  max_val_samples: 480
  val_batch_size: 32
  use_dynamic_sampling: true # when enabled, it is recommended to use num_prompts_per_step to be much higher than train_global_batch_size so that there is sufficient data to sample from.
  max_num_gen_batches: 10

reward_fn:
  enabled: true
  overlong_buffer_length: 1024
  overlong_buffer_penalty: 1.0
  max_response_length: ${policy.max_total_sequence_length}

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Ready-to-run DAPO config for Qwen2.5-32B with Megatron + vLLM, dynamic sampling, reward scaling/shaping, generation colocated options, sequence packing, dynamic batching, logging backends, and multi-node cluster settings.
- DAPOMath17K dataset added and exported for easy loading.
- Optional DAPO-style math answer verifier integrated into verification flow.
Bug Fixes
- Hardened prompt handling to fallback safely when prompts are absent.

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

github-actions · 2025-07-24T02:15:45Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2025-07-31T02:15:58Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

nemo_rl/algorithms/grpo.py

ashors1 · 2025-08-07T19:48:03Z

Thank you for the PR @peri044, and apologies on the delay reviewing. If available, could you include some convergence plots in the PR description?

ashors1 · 2025-08-07T19:53:36Z

Another note -- I think the general approach you've proposed looks good, but I think we should try to come up with an implementation that is as data efficient as possible (e.g. if only half of a batch is needed to fill up the current buffer, save the other half for the subsequent rollout than scrapping it and starting from scratch with the next batch). Let me know if you have any thoughts on this. Thanks again!

chore: rebase Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

peri044 · 2025-08-11T16:52:08Z

Thank you for the review @ashors1. I've moved the dapo changes into separate files and updated the logic addressing your comments. Please review again. I'm working on training part. I shall update this PR with convergence plots once they are available.

nemo_rl/algorithms/dapo.py

ashors1 · 2025-08-11T20:09:48Z

Thanks for the quick response @peri044! Is there a reason you decided to move the DAPO implementation to a separate file? It looks like the majority of the code is duplicated from GRPO, so I think it makes sense to keep them together to emphasize their similarities and to improve maintainability. If readability is a concern, maybe we can have a helper function that handles the dynamic sampling logic. What do you think?

peri044 · 2025-08-11T23:12:58Z

Thanks for the quick response @peri044! Is there a reason you decided to move the DAPO implementation to a separate file? It looks like the majority of the code is duplicated from GRPO, so I think it makes sense to keep them together to emphasize their similarities and to improve maintainability. If readability is a concern, maybe we can have a helper function that handles the dynamic sampling logic. What do you think?

Sounds good. I moved it because of complexity and to ensure GRPO doesn't get bloated but I shall merge them back now.

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

chore: move DAPO impl back to GRPO

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

nemo_rl/algorithms/reward_functions.py

examples/configs/recipes/llm/dapo-deepscaler-1.5b-8K.yaml

github-actions · 2025-10-16T21:45:12Z

ℹ️ File Consistency Check

Check based on commit: f25b48b (PR #602 from dapo)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: ashors1 <ashors@nvidia.com>

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com> Signed-off-by: ashors1 <ashors@nvidia.com> Co-authored-by: ashors1 <ashors@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

peri044 added 2 commits July 3, 2025 00:15

feat: Support DAPO dynamic sampling and reward shaping

751d995

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

Merge branch 'main' into dapo

e5dd193

github-actions bot added the Stale label Jul 24, 2025

github-actions bot closed this Jul 31, 2025

terrykong reopened this Jul 31, 2025

github-actions bot removed the Stale label Aug 1, 2025

ashors1 reviewed Aug 7, 2025

View reviewed changes

nemo_rl/algorithms/grpo.py Outdated Show resolved Hide resolved

ashors1 reviewed Aug 7, 2025

View reviewed changes

nemo_rl/algorithms/grpo.py Outdated Show resolved Hide resolved

ashors1 reviewed Aug 7, 2025

View reviewed changes

nemo_rl/algorithms/grpo.py Outdated Show resolved Hide resolved

peri044 added 3 commits August 9, 2025 21:11

-s

d3ec46f

chore: rebase Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

feat: Add DAPO implementation

7b9226b

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

chore: revert DAPO changes in GRPO files

d6cd81d

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

ashors1 reviewed Aug 11, 2025

View reviewed changes

nemo_rl/algorithms/dapo.py Outdated Show resolved Hide resolved

ashors1 reviewed Aug 11, 2025

View reviewed changes

nemo_rl/algorithms/dapo.py Outdated Show resolved Hide resolved

ashors1 reviewed Aug 11, 2025

View reviewed changes

nemo_rl/algorithms/dapo.py Outdated Show resolved Hide resolved

ashors1 reviewed Aug 11, 2025

View reviewed changes

nemo_rl/algorithms/dapo.py Outdated Show resolved Hide resolved

peri044 added 4 commits August 12, 2025 00:09

chore: move DAPO impl back to GRPO

108e1e0

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

chore: remove dapo scripts

f249641

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

Merge pull request #1 from peri044/dapo_merge

17b519c

chore: move DAPO impl back to GRPO

fix: Fix config bug

0d7f9bf

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>

ashors1 reviewed Aug 12, 2025

View reviewed changes

nemo_rl/algorithms/reward_functions.py Outdated Show resolved Hide resolved

ashors1 reviewed Aug 12, 2025

View reviewed changes

examples/configs/recipes/llm/dapo-deepscaler-1.5b-8K.yaml Outdated Show resolved Hide resolved

snowmanwwg linked an issue Aug 13, 2025 that may be closed by this pull request

Add DAPO reward shaping #327

Closed

terrykong temporarily deployed to nemo-ci October 16, 2025 21:57 — with GitHub Actions Inactive

fix grpo unit test

769f11f

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 dismissed terrykong’s stale review via 769f11f October 16, 2025 22:20

terrykong previously approved these changes Oct 17, 2025

View reviewed changes

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 17, 2025

terrykong temporarily deployed to nemo-ci October 17, 2025 00:21 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 17, 2025 00:39 — with GitHub Actions Inactive

terrykong had a problem deploying to nemo-ci October 17, 2025 03:02 — with GitHub Actions Failure

fix default hf_config_overrides

9bb2a9c

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 dismissed terrykong’s stale review via 9bb2a9c October 17, 2025 03:59

fix remaining hf_overrides default

02a8e1e

Signed-off-by: ashors1 <ashors@nvidia.com>

terrykong approved these changes Oct 17, 2025

View reviewed changes

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 17, 2025

terrykong temporarily deployed to nemo-ci October 17, 2025 05:11 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 17, 2025 07:13 — with GitHub Actions Inactive

terrykong merged commit 7bd853a into NVIDIA-NeMo:main Oct 17, 2025
40 of 41 checks passed

terrykong mentioned this pull request Oct 20, 2025

docs: Refactor Home Page and New About Section #1338

Merged

chtruong814 mentioned this pull request Oct 31, 2025

cp: Support DAPO dynamic sampling and reward shaping (#602) into r0.4.0 #1458

Merged

4 tasks

coderabbitai bot mentioned this pull request Jan 18, 2026

feat: NeMo Gym GRPO on-policy fix params; Per-agent group-level rewards #1779

Open

4 tasks

coderabbitai bot mentioned this pull request Feb 18, 2026

feat: support GDPO #1986

Closed

4 tasks

coderabbitai bot mentioned this pull request Mar 5, 2026

feat: support GDPO (New) #2069

Merged

4 tasks

Conversation

peri044 commented Jul 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashors1 commented Aug 7, 2025

Uh oh!

ashors1 commented Aug 7, 2025

Uh oh!

peri044 commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashors1 commented Aug 11, 2025

Uh oh!

peri044 commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 16, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

peri044 commented Jul 3, 2025 •

edited by coderabbitai bot

Loading