Rithesh/toy app #175

Ritesh1905 · 2025-09-18T18:05:35Z

A simple toy app RL loop that (almost) converges in less than 5 mins. This uses a much simpler reinforce loss. I could not get the reward-mean converging with the GRPO loss. Sending this PR here to get early feedback and once this makes sense, I will figure out to make it work with GRPO loss.

https://meta.wandb.io/rithesh/sumdigits-training/runs/kmj952x7?nw=nwuserrithesh

joecummings

Overall, this looks great and makes huge strides toward correctness in Forge. Just a couple of comments.

joecummings · 2025-09-18T18:54:55Z

src/forge/types.py



 Scalar = Union[int, float]
+


I wouldn't say we're confident that these Episode and Group abstractions are the best ones yet - I'd be more comfortable if you just copy-pasta'd them into the sumdigits.py file in order to use them for now.

@vidhyav is rolling out the abstractions soon. Jut centralizing this so that he just has 1 place to fix.

let me know if you still wish for me to copy paste them.

I would still prefer a copy paste if that's alright? Sorry for being a stickler :)

Cool. copy pasted the code and added a TODO.

joecummings · 2025-09-18T18:56:45Z

apps/toy_rl/sumdigits.py

+                mlogger.log("loss/training_step", loss, training_step)
+                print(f"loss/training_step: {loss} at {training_step}")
+                if training_step % 5 == 0:
+                    await trainer.push_weights.call(policy_version)


Weight sync is off by 5?

Yes. because this is a toy app and the weight sync take a long time. :)

Ideally I wish for us to have a accumulate and apply gradients abstractions so that we can just accumulate the gradients and apply them after every N batches (in this case 5)

Make sense - just curious, how much faster does it converge when weight sync is just off by 1 via the replay buffer?

here is the off-by-1 run. https://meta.wandb.io/torchforge/sumdigits-training/runs/wblr9xh7?nw=nwuserrithesh

Updated to be on-policy. we can figure out what's best later when we are setting up the CI.

joecummings

LGTM!

JenniferWang · 2025-09-19T13:46:58Z

@Ritesh1905 , based on your experience, is this a regression?

https://meta.wandb.io/jiyue/sumdigits-training/runs/ah2js4mb?nw=nwuserjiyue

rithesh added 6 commits September 17, 2025 09:04

simple app

7d886ee

some intermitted non-working changes

9ced245

Slightly working changes?

0e2771d

some more changes

41357ff

more working changes

3dfbef0

cleaned up code

1b90071

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025

lint issues

18adb67

Ritesh1905 marked this pull request as ready for review September 18, 2025 18:12

Ritesh1905 requested a review from joecummings September 18, 2025 18:26

joecummings reviewed Sep 18, 2025

View reviewed changes

adressing code comments

e903223

joecummings approved these changes Sep 18, 2025

View reviewed changes

joecummings merged commit d4fb5e1 into main Sep 18, 2025
5 checks passed

Ritesh1905 deleted the rithesh/toy_app branch September 18, 2025 20:37

Rithesh/toy app #175

Rithesh/toy app #175

Conversation

Ritesh1905 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JenniferWang commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ritesh1905 commented Sep 18, 2025 •

edited

Loading