feat: Support DAPO dynamic sampling and reward shaping#602
feat: Support DAPO dynamic sampling and reward shaping#602terrykong merged 116 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
|
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
|
This PR was closed because it has been inactive for 7 days since being marked as stale. |
|
Thank you for the PR @peri044, and apologies on the delay reviewing. If available, could you include some convergence plots in the PR description? |
|
Another note -- I think the general approach you've proposed looks good, but I think we should try to come up with an implementation that is as data efficient as possible (e.g. if only half of a batch is needed to fill up the current buffer, save the other half for the subsequent rollout than scrapping it and starting from scratch with the next batch). Let me know if you have any thoughts on this. Thanks again! |
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
|
Thank you for the review @ashors1. I've moved the dapo changes into separate files and updated the logic addressing your comments. Please review again. I'm working on training part. I shall update this PR with convergence plots once they are available. |
|
Thanks for the quick response @peri044! Is there a reason you decided to move the DAPO implementation to a separate file? It looks like the majority of the code is duplicated from GRPO, so I think it makes sense to keep them together to emphasize their similarities and to improve maintainability. If readability is a concern, maybe we can have a helper function that handles the dynamic sampling logic. What do you think? |
Sounds good. I moved it because of complexity and to ensure GRPO doesn't get bloated but I shall merge them back now. |
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
chore: move DAPO impl back to GRPO
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
ℹ️ File Consistency CheckCheck based on commit: f25b48b (PR #602 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com> Signed-off-by: ashors1 <ashors@nvidia.com> Co-authored-by: ashors1 <ashors@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com> Signed-off-by: ashors1 <ashors@nvidia.com> Co-authored-by: ashors1 <ashors@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com> Signed-off-by: ashors1 <ashors@nvidia.com> Co-authored-by: ashors1 <ashors@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com> Signed-off-by: ashors1 <ashors@nvidia.com> Co-authored-by: ashors1 <ashors@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com> Signed-off-by: ashors1 <ashors@nvidia.com> Co-authored-by: ashors1 <ashors@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
What does this PR do ?
This PR implements DAPO dynamic sampling and reward shaping as extensions to the current GRPO algorithm. This doesn't include tests and docs yet but please review if this is the right direction to implement DAPO.
Issues
Fixes #425
Usage
Here is the example config file for a DAPO experiment
Before your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit