Skeleton of GRPO #58

joecummings · 2025-08-18T23:18:39Z

What this PR DOES do:

Core GRPO Implementation:

Complete GRPO training system for RL fine-tuning
Integration with vLLM policy actors for text generation during rollouts
Multi-reward system supporting math correctness and thinking tag rewards
Replay buffer with configurable batch sizes and policy version tracking
Async training loops with concurrent rollout generation and policy training

Service Infrastructure:

Service-based architecture using Monarch actors for distributed components
Policy actors with configurable sampling, device assignment, and tensor parallelism
Trainer actors with GRPO loss computation, KL regularization, and gradient clipping
Reference model actors for computing baseline log probabilities
Advantage computation using reward-to-go with normalization
Dataset integration with GSM8K math problems

What this PR will NOT cover:

Missing Features:

Weight synchronization between trainer and policy (commented out: # await trainer.update_weights(policy)) @pradeepfn @pbontrager
Policy versioning system (hardcoded to version 0) @joecummings @Jack-Khuu
Logging/monitoring infrastructure (No wandb / tensorboard integration) @calvinpelletier @DNXie
Multi-turn conversations or complex dialog handling
Production-ready error handling and recovery mechanisms @allenwang28

Incomplete Work:

TODO: Move policy processing initialization into setup (line 453 in **main.py**) @Jack-Khuu
Manual device assignment instead of automatic GPU allocation @pbontrager
Hardcoded hyperparameters throughout the system

Technical Limitations:

Single model support (Qwen3-1.7B)
Fixed reward functions (no pluggable reward interface) @DNXie
No distributed training across multiple machines

This is a working prototype of GRPO with the core RL loop functional but requiring additional work for production deployment, proper weight updates, and comprehensive logging.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pbontrager · 2025-08-26T14:55:53Z

apps/grpo/main.py

+        training_step = 0
+        while True:
+            batch = await replay_buffer.sample.choose(curr_policy_version=0)
+            if batch is None:


I feel this should just be inside of the buffer.sample method. Then you'll just await until there's enough data for a sample.

Hmm I disagree. The contract here is just that the buffer will return a sample here when it computationally can, not when there's something inside.

In addition, this would push possible errors a layer down if the replay buffer isn't getting filled for some reason. I'd rather have this be exposed logic to the user.

When does sample return None? Is it when number of usable rollouts < batch_size? I feel like this is going to surprise some people who try to use this and start running into Nones in their batch.

Yeah if the replay buffer has nothing in it

I think exposing logic to the user makes sense, my only nit is that I might want it to have more of a queue-like semantic? Like we set a timeout, it raises an error if we hit the timeout etc.

Does choose have a built-in timeout feature?

it doesn't, but the buffer itself can return an exception which will be propagated through choose

but the current service implementation will then mark the replica as unhealthy so we shouldn't add this yet

ebsmothers

A bunch of initial questions -- I understand this is not really meant to be the final form of the GRPO loop so let me know which gaps are deliberate hacks to unblock vs ones that require more thought. (Ideally just put this in the PR summary to be explicit.)

Would also be good to start factoring out some stuff like rewards, actors, etc into separate files so that main.py starts looking closer to a pure training script.

apps/grpo/main.py

ebsmothers · 2025-08-26T15:18:48Z

apps/grpo/main.py

+        )  # Remove batch dimension for single response
+
+
+class DatasetActor(ForgeActor):


Longer-term what is our plan here? (I.e. is our HfIterableDataset compatible with how we're setting things up here?)

I think we need to re-build up our HfIterableDataset based on the patterns we observe when testing out this training loop. Right now, it's not super intuitive to use and it's also not clear what we actually need from there.

My guess is that we'll keep 75% of the current implementation, but I don't want to assume before we actually start testing.

ebsmothers · 2025-08-26T15:19:24Z

apps/grpo/main.py

+        ds = ds.map(gsm8k_to_messages)
+        ds = ds.shuffle()


I notice no split_dataset_by_node. Is this by design?

No, just missing it. Will add.

apps/grpo/main.py

ebsmothers · 2025-08-26T15:49:02Z

apps/grpo/main.py

+                avg_reward = sum(group.reward for group in episode.groups) / len(
+                    episode.groups
+                )
+                wandb.log({"rollout_count": rollout_count, "avg_reward": avg_reward})


Will this eventually be its own actor?

I'm not sure it needs to be unless we're making so many requests from all over to log things and we want it going through a central source.

Otherwise, we can just treat the logger as a normal component.

would wandb complain about having multiple "sessions" created from the same job? If we can minimize the messages being passed in Monarch that generally seems like a good idea

I think we want it to be it's own actor so logs are called from within actors too. It's a lot of unnecessary data passing otherwise, especially if people log artifacts

hmm, I'm not sure if I understand that comment @pbontrager

if the metrics logger is an actor, then all actors would need to pass messages over Monarch to the metrics actor right? If actors can just write directly to wandb that would minimize data passing?

apps/grpo/main.py

ebsmothers · 2025-08-26T16:08:18Z

apps/grpo/main.py

+        self.lambda_ = lambda_  # GAE lambda parameter
+
+    @endpoint
+    async def __call__(self, groups: list[Group]) -> list[float]:


Is this based off of some specific reference implementation?

My brain, the old impl of GRPO in torchtune, and good ol devmate.

I'm not sure I trust that consensus..

We're probably not going to want a Group type, but we put the type work on the back burner for now

we should spin up a track fully dedicated to ensuring numerical correctness (which I think is the basis of @ebsmothers comment), but imo the right timing for this is after we have the working prototype

ebsmothers · 2025-08-26T16:09:53Z

src/forge/data/replay_buffer.py

    @endpoint
-    async def add(self, trajectory: Trajectory) -> None:
-        self.buffer.append(trajectory)
+    async def add(self, episode) -> None:


So we are moving from Trajectories to Episodes in the replay buffer? Isn't this overly-tailored to GRPO?

Everything in here is overly tailored to GRPO. I want to base our design decisions from real life, otherwise we can talk all day. This is definitely not the final state of our APIs

Episode isn't GRPO specific, it's basically another word for Trajectory.

src/forge/actors/policy.py

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

apps/grpo/main.py

allenwang28

this is awesome, great work @joecummings!

apps/grpo/main.py

allenwang28 · 2025-08-27T18:51:38Z

apps/grpo/main.py

+        self.prompt = prompt
+        self.target = target
+        self.policy_version = policy_version
+        self.groups: list[Group] = []


naming nit, but isn't what we're calling Group here actually an individual output, and what we're calling self.groups the actual Group?

allenwang28 · 2025-08-27T18:54:44Z

apps/grpo/main.py

+            ref_logprobs_list = []
+            advantages_list = []
+
+            for group in groups:


no action needed now, but imo it would be nifty for the episode structure itself to handle things like this

for episode in batch: tokenized = self.tokenizer( episode.as_response_group(), ... )

allenwang28 · 2025-08-27T19:02:50Z

apps/grpo/main.py

+        training_step = 0
+        while True:
+            batch = await replay_buffer.sample.choose(curr_policy_version=0)
+            if batch is None:


I think exposing logic to the user makes sense, my only nit is that I might want it to have more of a queue-like semantic? Like we set a timeout, it raises an error if we hit the timeout etc.

pbontrager

This is really great! There will be a lot of refactor work going forward so I won't push for changes here.

pbontrager · 2025-08-27T19:27:59Z

src/forge/actors/policy.py

    @endpoint
    async def setup(self):
        # Set up policy_worker
+        self.available_devices = (


This should really come from the script and not the config. Ideally the script makes the master set of ranks and then passes them to the actors.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

[WIP] Skeleton of GRPO

5935260

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 18, 2025

joecummings added 14 commits August 18, 2025 16:29

Get current version

2a1546b

Add refmodel stub

6185edd

Async enter

81ad66d

Stub

c67630b

Updates

1a6db14

Weight sync mf

9526a05

<Replace this line with a title. Use 1 line only, 67 chars or less>

417c2d8

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Stub

4e48b28

Stub

850bb01

<Replace this line with a title. Use 1 line only, 67 chars or less>

e19e4d4

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Make work with new defition of replica

bb750ab

Stub

2003e92

Merge remote-tracking branch 'upstream/main' into skeleton-rl

9dea5e9

Use new service interface

d60e311

pbontrager reviewed Aug 26, 2025

View reviewed changes

ebsmothers reviewed Aug 26, 2025

View reviewed changes

DNXie reviewed Aug 26, 2025

View reviewed changes

src/forge/actors/policy.py Outdated Show resolved Hide resolved

<Replace this line with a title. Use 1 line only, 67 chars or less>

86f1e55

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pbontrager mentioned this pull request Aug 26, 2025

RLTrainer #40

Merged

DNXie reviewed Aug 26, 2025

View reviewed changes

apps/grpo/main.py Show resolved Hide resolved

joecummings added 3 commits August 27, 2025 10:06

Working

22a1e74

Merge remote-tracking branch 'upstream/main' into skeleton-rl

c022837

Minor cleanup

7a4c343

joecummings changed the title ~~[WIP] Skeleton of GRPO~~ Skeleton of GRPO Aug 27, 2025

allenwang28 approved these changes Aug 27, 2025

View reviewed changes

pbontrager approved these changes Aug 27, 2025

View reviewed changes

joecummings added 2 commits August 27, 2025 18:03

<Replace this line with a title. Use 1 line only, 67 chars or less>

a6eccda

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Nits

da27f21

joecummings merged commit 6d76a41 into meta-pytorch:main Aug 28, 2025
4 checks passed

		) # Remove batch dimension for single response


		class DatasetActor(ForgeActor):

Skeleton of GRPO #58

Skeleton of GRPO #58

Uh oh!

Conversation

joecummings commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR DOES do:

What this PR will NOT cover:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbontrager left a comment

joecummings commented Aug 18, 2025 •

edited

Loading