Skip to content

feat: Support DAPO dynamic sampling and reward shaping#602

Merged
terrykong merged 116 commits intoNVIDIA-NeMo:mainfrom
peri044:dapo
Oct 17, 2025
Merged

feat: Support DAPO dynamic sampling and reward shaping#602
terrykong merged 116 commits intoNVIDIA-NeMo:mainfrom
peri044:dapo

Conversation

@peri044
Copy link
Copy Markdown
Contributor

@peri044 peri044 commented Jul 3, 2025

What does this PR do ?

This PR implements DAPO dynamic sampling and reward shaping as extensions to the current GRPO algorithm. This doesn't include tests and docs yet but please review if this is the right direction to implement DAPO.

Issues

Fixes #425

Usage

Here is the example config file for a DAPO experiment

# DAPO configuration
grpo:
  num_prompts_per_step: 24
  num_generations_per_prompt: 8
  max_rollout_turns: 1 # for multi-turn rollouts. Math Environments just have 1 turn (answering the question)
  max_num_steps: 100
  normalize_rewards: true
  use_leave_one_out_baseline: false
  val_period: 10
  val_at_start: false
  max_val_samples: 480
  val_batch_size: 32
  use_dynamic_sampling: true # when enabled, it is recommended to use num_prompts_per_step to be much higher than train_global_batch_size so that there is sufficient data to sample from.
  max_num_gen_batches: 10

reward_fn:
  enabled: true
  overlong_buffer_length: 1024
  overlong_buffer_penalty: 1.0
  max_response_length: ${policy.max_total_sequence_length}

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Ready-to-run DAPO config for Qwen2.5-32B with Megatron + vLLM, dynamic sampling, reward scaling/shaping, generation colocated options, sequence packing, dynamic batching, logging backends, and multi-node cluster settings.
    • DAPOMath17K dataset added and exported for easy loading.
    • Optional DAPO-style math answer verifier integrated into verification flow.
  • Bug Fixes
    • Hardened prompt handling to fallback safely when prompts are absent.

@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 24, 2025
@github-actions
Copy link
Copy Markdown

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Jul 31, 2025
@terrykong terrykong reopened this Jul 31, 2025
@github-actions github-actions bot removed the Stale label Aug 1, 2025
@ashors1
Copy link
Copy Markdown
Contributor

ashors1 commented Aug 7, 2025

Thank you for the PR @peri044, and apologies on the delay reviewing. If available, could you include some convergence plots in the PR description?

@ashors1
Copy link
Copy Markdown
Contributor

ashors1 commented Aug 7, 2025

Another note -- I think the general approach you've proposed looks good, but I think we should try to come up with an implementation that is as data efficient as possible (e.g. if only half of a batch is needed to fill up the current buffer, save the other half for the subsequent rollout than scrapping it and starting from scratch with the next batch). Let me know if you have any thoughts on this. Thanks again!

peri044 added 3 commits August 9, 2025 21:11
chore: rebase
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
@peri044
Copy link
Copy Markdown
Contributor Author

peri044 commented Aug 11, 2025

Thank you for the review @ashors1. I've moved the dapo changes into separate files and updated the logic addressing your comments. Please review again. I'm working on training part. I shall update this PR with convergence plots once they are available.

@ashors1
Copy link
Copy Markdown
Contributor

ashors1 commented Aug 11, 2025

Thanks for the quick response @peri044! Is there a reason you decided to move the DAPO implementation to a separate file? It looks like the majority of the code is duplicated from GRPO, so I think it makes sense to keep them together to emphasize their similarities and to improve maintainability. If readability is a concern, maybe we can have a helper function that handles the dynamic sampling logic. What do you think?

@peri044
Copy link
Copy Markdown
Contributor Author

peri044 commented Aug 11, 2025

Thanks for the quick response @peri044! Is there a reason you decided to move the DAPO implementation to a separate file? It looks like the majority of the code is duplicated from GRPO, so I think it makes sense to keep them together to emphasize their similarities and to improve maintainability. If readability is a concern, maybe we can have a helper function that handles the dynamic sampling logic. What do you think?

Sounds good. I moved it because of complexity and to ensure GRPO doesn't get bloated but I shall merge them back now.

Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
chore: move DAPO impl back to GRPO
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
@snowmanwwg snowmanwwg linked an issue Aug 13, 2025 that may be closed by this pull request
@github-actions
Copy link
Copy Markdown

ℹ️ File Consistency Check

Check based on commit: f25b48b (PR #602 from dapo)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Signed-off-by: ashors1 <ashors@nvidia.com>
terrykong
terrykong previously approved these changes Oct 17, 2025
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 17, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 17, 2025
@terrykong terrykong merged commit 7bd853a into NVIDIA-NeMo:main Oct 17, 2025
40 of 41 checks passed
terrykong added a commit that referenced this pull request Nov 1, 2025
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
terrykong added a commit that referenced this pull request Nov 1, 2025
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
terrykong added a commit that referenced this pull request Nov 1, 2025
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
terrykong added a commit that referenced this pull request Nov 1, 2025
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
terrykong added a commit that referenced this pull request Nov 1, 2025
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
@coderabbitai coderabbitai bot mentioned this pull request Feb 18, 2026
4 tasks
@coderabbitai coderabbitai bot mentioned this pull request Mar 5, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests community-request documentation Improvements or additions to documentation r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DAPO features DAPO Dynamic sampling Add DAPO reward shaping

8 participants