Flatten GRPO main: Group and Episode #400

Jack-Khuu · 2025-10-14T07:10:36Z

The updates boil down to 2 changes that don't alter behavior in grpo/main:

Group is downgraded from a dataclass to a typedef oflist[Episode], since it's never required
Episode now directly holds a Completion with redundant attributes in Episode being removed
- See df0e5a9 for how the redundant fields are mapped between Episode and ScoredCompletion.

(There's also various typehint improvements sprinkled in)

Note: This PR does not address or utilize Episode from data_models, but convergence is imminent

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

Wandb looks roughly the same
Before: torchforge/grpo-training/runs/wca6wke2
After torchforge/grpo-training/runs/ul34xjr9

JenniferWang

1 for removing the Group abstraction.

apps/grpo/main.py

JenniferWang · 2025-10-14T14:34:20Z

apps/grpo/main.py

            # Calculate advantages and add to replay buffer
-            advantages = await compute_advantages.compute.call_one(group)
-            for episode, advantage in zip(group.episodes, advantages):
+            advantages = await compute_advantages.compute.call_one(episodes)


Not related to this diff but now since we're scrutinizing the main flow again, I think making compute_advantages its own Actor is very weird and probably the opposite to an "optimization"

We do not expose capability to specify the hostmesh for a specific actor -- ideally, this should be collocated with the generator replica that produces this batch.

ComputeAdvantage only needs the rewards; so very likely the entire episodes are serialized.

I wonder, if for now it should be just inlined in the sample call; or allocating a proc on the Policy mesh along side the PolicyWorker to handle the computation but chain the calls and return the result together in sample

@JenniferWang these are good points. I want to propose an idea (not for you to implement @Jack-Khuu just brainstorming if this makes sense)

with policy.session() as s: host: HostMesh = await s.get_host_mesh() # returns the host mesh associated with this replica advantages = host.run_task(compute_advantages) # where compute_advantages is a function

Chained calls would be cool 👀

This looks legit; +1 on chained calls

allenwang28 · 2025-10-14T14:44:57Z

apps/grpo/main.py

            # Calculate advantages and add to replay buffer
-            advantages = await compute_advantages.compute.call_one(group)
-            for episode, advantage in zip(group.episodes, advantages):
+            advantages = await compute_advantages.compute.call_one(episodes)


@JenniferWang these are good points. I want to propose an idea (not for you to implement @Jack-Khuu just brainstorming if this makes sense)

with policy.session() as s: host: HostMesh = await s.get_host_mesh() # returns the host mesh associated with this replica advantages = host.run_task(compute_advantages) # where compute_advantages is a function

apps/grpo/main.py

joecummings

Awesome stuff! Just a bunch of small comments.

apps/grpo/main.py

joecummings · 2025-10-14T14:53:15Z

apps/grpo/main.py

            # Calculate advantages and add to replay buffer
-            advantages = await compute_advantages.compute.call_one(group)
-            for episode, advantage in zip(group.episodes, advantages):
+            advantages = await compute_advantages.compute.call_one(episodes)


Chained calls would be cool 👀

Push initial removal; debugging hang

df0e5a9

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 14, 2025

Jack-Khuu marked this pull request as draft October 14, 2025 07:10

Fix hang, remove test properties

85fab12

Jack-Khuu changed the title ~~[WIP] Flatten GRPO main: Group and Episode~~ Flatten GRPO main: Group and Episode Oct 14, 2025

Jack-Khuu requested review from allenwang28, ebsmothers, felipemello1, joecummings and pbontrager October 14, 2025 08:40

Jack-Khuu marked this pull request as ready for review October 14, 2025 08:40

JenniferWang reviewed Oct 14, 2025

View reviewed changes

allenwang28 reviewed Oct 14, 2025

View reviewed changes

joecummings reviewed Oct 14, 2025

View reviewed changes

Jack-Khuu added 4 commits October 14, 2025 09:50

Merge remote-tracking branch 'origin/main' into grpo-group

de51076

Address comments

843420b

Remove device

6c7e600

Merge branch 'main' into grpo-group

9ad1291

JenniferWang approved these changes Oct 14, 2025

View reviewed changes

joecummings approved these changes Oct 14, 2025

View reviewed changes

Jack-Khuu merged commit 4c14792 into main Oct 14, 2025
9 checks passed

Jack-Khuu deleted the grpo-group branch October 14, 2025 21:42

allenwang28 pushed a commit to allenwang28/forge that referenced this pull request Oct 15, 2025

Flatten GRPO main: Group and Episode (meta-pytorch#400)

25ebf86

Flatten GRPO main: Group and Episode #400

Flatten GRPO main: Group and Episode #400

Uh oh!

Conversation

Jack-Khuu commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JenniferWang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JenniferWang Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allenwang28 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

allenwang28 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joecummings Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jack-Khuu commented Oct 14, 2025 •

edited

Loading

JenniferWang Oct 14, 2025 •

edited

Loading