Skip to content

Conversation

Ritesh1905
Copy link
Contributor

@Ritesh1905 Ritesh1905 commented Sep 19, 2025

The reward-mean converges in less than 15 training steps and takes ~5 mins

wandb run: https://meta.wandb.io/torchforge/sumdigits-training/runs/uxzowpkp?nw=nwuserrithesh

image
  • Made some corrections to GRPO loss function.
  • Updates the sumdigits example to use GRPO loss
  • Introduces a small curriculum setup.

TODO:

  • Lot of code redundancy, which will be fixed once the abstractions are landed.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 19, 2025
@Ritesh1905 Ritesh1905 changed the title Sumdigits exmaple with GPRO Sumdigits exmaple with GRPO Sep 19, 2025
@Ritesh1905 Ritesh1905 marked this pull request as ready for review September 19, 2025 01:50
@Ritesh1905 Ritesh1905 merged commit ce155a3 into main Sep 19, 2025
5 checks passed
@Ritesh1905 Ritesh1905 deleted the rithesh/grpo branch October 7, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants