Off-by-1 GRPO #140

joecummings · 2025-09-08T23:20:30Z

https://wandb.ai/jcummings/grpo-training/workspace?nw=nwuserjcummings

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

casteryh · 2025-09-09T15:05:18Z

Just fyi the link gives me 404

joecummings · 2025-09-09T15:06:53Z

Just fyi the link gives me 404

https://wandb.ai/jcummings/grpo-training/workspace?nw=nwuserjcummings

Try now?

pbontrager

This is very important to get in, really thanks for this! I added a comment on how I think we should handle the policy update logic though.

apps/grpo/main.py

pbontrager · 2025-09-09T15:15:40Z

apps/grpo/main.py

                return
            prompt, target = sample["request"], sample["target"]
-            version = 0  # await policy.get_current_version.choose()
+            responses = await policy.generate.choose(prompt)


We'll throw away a lot of data this way for fully on policy

For short responses yeah definitely, if you look at the WandB logs (buffer_size/rollout), you can see that we build up a buffer of about 100 episodes and then evict the majority of them back and forth during weight updates.

When we start allowing much longer generations and our models are much bigger, this won't be as big of an issue.

apps/grpo/main.py

pbontrager · 2025-09-09T15:20:08Z

apps/grpo/main.py

-                        logger.log("loss/training_step", loss_value, training_step)
-                # await trainer.update_weights(policy)
+                logger.log("loss/training_step", loss, training_step)
+                await trainer.push_weights.choose(policy_version)


This should also be call technically even though choose works since replicas=1

This should also be call technically even though choose works since replicas=1

As a side note, even if we do call(), what we are doing here is all the trainers training on the same batch right?

The replicas are just for fault tolerance? In this regard, if we want different trainers to train on different batches, the trainers themselves have to pull the batches right?

An alternative way to do this is we split the batch into microbatches and then call choose() on each microbatch. After the whole batch is done. We then do an all_reduce (or other forms of reduce) to average the weights.

pbontrager · 2025-09-09T15:20:32Z

apps/grpo/main.py

                await asyncio.sleep(0.1)
            else:
-                training_result = await trainer.train_step.choose(batch)
+                loss = await trainer.train_step.choose(batch)


This should also be a call

src/forge/actors/policy.py

casteryh · 2025-09-09T15:31:37Z

Just fyi the link gives me 404

https://wandb.ai/jcummings/grpo-training/workspace?nw=nwuserjcummings

Try now?

Saw the graphs 📈 nice!! it's working!

joecummings · 2025-09-09T16:00:57Z

apps/grpo/main.py

        project="grpo-training",
    )

+    store = await MultiProcessStore.create_store()


@LucasLLC Is this still the recommended way of doing things?

Seems like we are using ts.initialize() now and there is a global singleton torchstore. But I will let @LucasLLC weigh in.

src/forge/actors/policy.py

apps/grpo/main.py

src/forge/actors/policy.py

Jack-Khuu · 2025-09-09T16:22:17Z

src/forge/actors/policy.py

-            f"Starting model update from torchstore with key: {self.state_dict_key}{DELIM}{version}"
-        )
-
+        key = f"{self.state_dict_key}{DELIM}{version}"


Let's make this a function since the caller also uses it

Definitely make this a function, and we should do f"{version}{DELIM}{self.state_dict_key}" instead.
Which is the correct hierarchy we should use given the new keys(prefix) in torchstore api.

casteryh · 2025-09-09T16:41:04Z

apps/grpo/main.py

-                        logger.log("loss/training_step", loss_value, training_step)
-                # await trainer.update_weights(policy)
+                logger.log("loss/training_step", loss, training_step)
+                await trainer.push_weights.choose(policy_version)


This should also be call technically even though choose works since replicas=1

As a side note, even if we do call(), what we are doing here is all the trainers training on the same batch right?

The replicas are just for fault tolerance? In this regard, if we want different trainers to train on different batches, the trainers themselves have to pull the batches right?

An alternative way to do this is we split the batch into microbatches and then call choose() on each microbatch. After the whole batch is done. We then do an all_reduce (or other forms of reduce) to average the weights.

casteryh · 2025-09-09T16:47:03Z

src/forge/actors/policy.py

-            f"Starting model update from torchstore with key: {self.state_dict_key}{DELIM}{version}"
-        )
-
+        key = f"{self.state_dict_key}{DELIM}{version}"


Definitely make this a function, and we should do f"{version}{DELIM}{self.state_dict_key}" instead.
Which is the correct hierarchy we should use given the new keys(prefix) in torchstore api.

src/forge/actors/policy.py

casteryh · 2025-09-09T17:04:37Z

apps/grpo/main.py

        project="grpo-training",
    )

+    store = await MultiProcessStore.create_store()


Seems like we are using ts.initialize() now and there is a global singleton torchstore. But I will let @LucasLLC weigh in.

pbontrager · 2025-09-11T19:47:18Z

apps/grpo/main.py

    beta: float = 0.1
-    epsilon: float = 0.1
    device: torch.device | None = None
+    store: MultiProcessStore | None = None


This isn't the current recommended way to use store now. You should just call it as a singleton inside of the trainer

pbontrager · 2025-09-11T19:50:58Z

apps/grpo/main.py

+
+        self.logger.info(f"Trainer model initialized on {self.device}")
+
+    def _qwen3_hf_to_vllm(self, saved_sd):


Can you put this in the trainer.py file? We'll need to reuse this with titan and this will merge nicely with Pradeep's PR

apps/grpo/main.py

pbontrager · 2025-09-11T21:54:25Z

apps/grpo/main.py

                await asyncio.sleep(0.1)
            else:
-                training_result = await trainer.train_step.choose(batch)
+                loss = sum(await trainer.train_step.call(batch))


This shouldn't be returning a list to sum over right? Their should be 1 replica for this service?

Service call currently returns a list regardless of num_replicas

I was wrong earlier, this should be a choose call. I think only policy.update_weights should be a call

pbontrager · 2025-09-11T21:55:35Z

apps/grpo/main.py

-    asyncio.run(main(cfg))
+if __name__ == "__main__":

+    @parse


Can you explain this change?

pbontrager · 2025-09-11T22:01:50Z

src/forge/actors/policy.py


    @endpoint
-    async def generate(self, prompt: str, priority: int = 0) -> List[CompletionOutput]:
+    async def generate(self, prompt: str, priority: int = 0) -> RequestOutput:


Why was this necessary? This pattern is never great for readability

pbontrager · 2025-09-11T22:03:07Z

src/forge/actors/policy.py

+        start = time.time()
        await self._load_tensor_parallel_state_dict(current_state_dict, version)
-        logger.debug("Successfully updated model weights from torchstore")
+        self.logger.debug(


Better to log this to wandb then spam the terminal

why do we need to do self.logger instead of just logger?

With the creation of ForgeActor, each actor has it's own logger and should use that (b/c it logs information about which actor is logging) instead of a global logger.

ah, the ForgeActor logger was implemented in a way where it's supposed to just work with logger (ie no need to self.logger) but please let me know if that doesn't work

src/forge/actors/policy.py

allenwang28

generally looks good, if you could just do me a favor and add some of these issue tags!

allenwang28 · 2025-09-12T13:39:12Z

apps/grpo/main.py

-        pass
+    async def push_weights(self, version: int):
+        """Update policy model weights with trainer's current weights."""
+        start_time = time.time()


Suggested change

start_time = time.time()

# TODO - issues/148 followup

start_time = time.time()

Just for my future reference, tagging some pieces for observability

allenwang28 · 2025-09-12T13:50:24Z

apps/grpo/main.py


    @endpoint
    async def compute(self, group: Group) -> list[float]:
        # TODO: add batch processing


Suggested change

# TODO: add batch processing

# TODO: issues/120 add batch processing

allenwang28 · 2025-09-12T13:53:23Z

src/forge/actors/policy.py

        model = self.worker.model_runner.model
        current_state_dict = model.state_dict()
-
+        start = time.time()


Suggested change

start = time.time()

# TODO - issues/148

start = time.time()

allenwang28 · 2025-09-12T13:53:45Z

src/forge/actors/policy.py

+        start = time.time()
        await self._load_tensor_parallel_state_dict(current_state_dict, version)
-        logger.debug("Successfully updated model weights from torchstore")
+        self.logger.debug(


why do we need to do self.logger instead of just logger?

allenwang28 · 2025-09-12T14:05:04Z

apps/grpo/main.py

                await asyncio.sleep(0.1)
            else:
-                training_result = await trainer.train_step.choose(batch)
+                loss = sum(await trainer.train_step.call(batch))


Service call currently returns a list regardless of num_replicas

pbontrager

Thanks for going through so much detail. Please remove the "_functions" but good to go.

pbontrager · 2025-09-12T19:25:22Z

apps/grpo/main.py

                await asyncio.sleep(0.1)
            else:
-                training_result = await trainer.train_step.choose(batch)
+                loss = sum(await trainer.train_step.call(batch))


I was wrong earlier, this should be a choose call. I think only policy.update_weights should be a call

Ritesh1905 · 2025-09-12T21:40:09Z

src/forge/actors/trainer.py

            self.engine.checkpointer.close()
+
+
+def _qwen3_hf_to_vllm(


May be for the next PR: Thoughts on having this in the generator/policy side? Ideally trainer should be agnostic to how the poilicy packs/transforms certain tensors.

Ritesh1905 · 2025-09-12T21:43:49Z

src/forge/data/rewards.py

        """Compute math correctness reward."""
-        # Parse expected
-        expected_answer = self._to_float(target)
+        target_number = self._to_float(target)


Curious: Do we a plan to have math verifier libraries which provides certain things (ExactMatch, partial match etc..) out of the box?

pbontrager and others added 23 commits August 29, 2025 14:22

first changes

e6b7692

core updates

a95a001

batch update

3ba0df6

fix typo

3e32264

Merge branch 'main' into ungroup

e4723bb

Merge branch 'main' into ungroup

5a17c8b

missing import

52028a5

debug merge

e2a3a68

more fixes

2cf9d00

Remove dtype warnings

b85320c

Stub

f7626ce

It runs

bf31587

Add in ref

53c8c89

<Replace this line with a title. Use 1 line only, 67 chars or less>

f494949

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Pass linting?

a13a1ac

Remove extraneous 'calculations'

833a6b6

Stub out push weights

0acbe4a

Remove tokenizer, add back in formatting

7d05aad

Cleanup

3c880dd

Working w/ weight sync

8796fa1

stub

75447d9

Merge remote-tracking branch 'upstream/main' into working-updates

2838937

Queue while updating weights

3120100

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 8, 2025

joecummings changed the title ~~Working updates~~ On-policy GRPO Sep 8, 2025

pbontrager reviewed Sep 9, 2025

View reviewed changes

joecummings commented Sep 9, 2025

View reviewed changes

Jack-Khuu reviewed Sep 9, 2025

View reviewed changes

casteryh reviewed Sep 9, 2025

View reviewed changes

joecummings added 9 commits September 9, 2025 19:28

Cleanup

8f4bda1

Make sd conversion happen on push

7825255

Sum over train_step valuemesh

b511fe3

Merge remote-tracking branch 'upstream/main' into working-updates

9b46a77

Update config

1a6d6df

Loss updates

e31f815

Updated rewards (just played around a bit)

55c32be

Update rewards

b74a47c

Fix last math reward test

14d6354

joecummings changed the title ~~On-policy GRPO~~ Off-policy GRPO Sep 11, 2025

joecummings changed the title ~~Off-policy GRPO~~ Off-by-1 GRPO Sep 11, 2025

Async by 1

8fa4451

pbontrager reviewed Sep 11, 2025

View reviewed changes

allenwang28 reviewed Sep 12, 2025

View reviewed changes

joecummings added 2 commits September 12, 2025 09:38

Seg fault

bdd03a8

Make torchstore actually work!

7eedc91

pbontrager approved these changes Sep 12, 2025

View reviewed changes

Last updates

4044087

joecummings merged commit a6ca591 into meta-pytorch:main Sep 12, 2025
5 checks passed

Ritesh1905 reviewed Sep 12, 2025

View reviewed changes


		self.logger.info(f"Trainer model initialized on {self.device}")

		def _qwen3_hf_to_vllm(self, saved_sd):

	start_time = time.time()
	# TODO - issues/148 followup
	start_time = time.time()

	# TODO: add batch processing
	# TODO: issues/120 add batch processing

Off-by-1 GRPO #140

Off-by-1 GRPO #140

Uh oh!

Conversation

joecummings commented Sep 8, 2025

Uh oh!

casteryh commented Sep 9, 2025

Uh oh!

joecummings commented Sep 9, 2025

Uh oh!

pbontrager left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

casteryh commented Sep 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!