Weight loading working correctly with tp: use vllm builtin load_weights() #184

casteryh · 2025-09-19T02:42:57Z

Add a flag use_vllm_bultin_load to policy and trainer.
When set to true, will use vllm builtin load_weights() method to exchange weights.
In particular, this works correctly with TP.
Tested with sumdigits example tp_size = 2.
https://meta.wandb.io/torchforge/sumdigits-training/runs/f9jb060e/panel/2pa6e8ptg?nw=nwuseryuxuanh
avg_reward >0.9 in 3k steps

vidhyav · 2025-09-24T21:26:21Z

src/forge/actors/policy.py


        logger.debug(f"Starting weight update on {self.__class__.__name__}")
-        await self.policy_worker.update.call(version=policy_version)
+        if self.use_vllm_builtin_load:


Eventually, this will be the default right?

seems like the plan

vidhyav · 2025-09-24T21:27:02Z

src/forge/actors/policy.py

        logger.debug(f"Loaded state dict from {key} in {time.time() - start} seconds")

+    @endpoint
+    async def _update_hf_nonsharded(self, version: int):


Why is this specific to hf??

this just means we are pushing/reading the state dict using the hugging face format. not titan, not vllm.

Perhaps call it update_DEPRECATED and update. I'd like to keep the DEPRECATED one just for A/B testing and delete it before the PTC.

Could you explain the choice between using get_state_dict/get_state_dict and the get/put API?

I am also confused at the load_weights API -- will it handle sharding itself? If so, should we call this function on the driver worker (0) once?

I am also confused at the load_weights API -- will it handle sharding itself? If so, should we call this function on the driver worker (0) once?

every worker(rank) has to call load_weights.
After load_weights() is called, every worker will figure out its own rank and just read its own shard when load_weights() is called.

Moreover, load_weights() supports incremental updating, i.e., if there is only one tensor in the passed in weights, it will update that part specifically (it even handles these concatenated weights as well).
For example, if you pass in (I am making up the fqn but you get the point) a single kv pair "model.layers.0.q_proj.xxx" -> full_tensor, it will actually update the q_proj part of the fused qkv_proj weight.

vidhyav · 2025-09-24T21:30:50Z

Is the source of the problem also because we have deconstructed the vllm engine and using it piecemeal instead of using it as a whole?

JenniferWang · 2025-09-19T18:22:55Z

apps/grpo/main.py

+                mlogger.log(
+                    "push_weights_time/training_step",
+                    time.perf_counter() - start_time,
+                    training_step,
+                )


This is great! I think let's split this diff to

add weight sync counter

add options to do per-tensor weight sync

added this for debugging myself, will do!

JenniferWang · 2025-09-19T18:23:51Z

src/forge/actors/policy.py


    @endpoint
-    async def update_weights(self):
+    async def update_weights(self, policy_version: int):


You probably want to rebase on this #181
I'll address the comments and merge the PR ASAP

JenniferWang · 2025-09-25T21:42:03Z

src/forge/actors/trainer.py

+        else:
+            await self._push_weights_sharded(policy_version)
+
+    async def _push_weights_sharded(self, policy_version: int) -> None:


I have some confusion: in my fix, I feel that this is not sharded. The difference is whether to process the state dict or not. Basically we just need to skip this path _qwen3_hf_to_vllm and keep the rest as is?

yep, I am bad at naming things. it actually has nothing to do with sharding at this point.
maybe we just call this push_weights_vllm vs push_weights_hf (or push_weights_DEPRECATED vs push_weights if you will)

JenniferWang · 2025-09-25T21:50:06Z

src/forge/actors/policy.py

        logger.debug(f"Loaded state dict from {key} in {time.time() - start} seconds")

+    @endpoint
+    async def _update_hf_nonsharded(self, version: int):


Perhaps call it update_DEPRECATED and update. I'd like to keep the DEPRECATED one just for A/B testing and delete it before the PTC.

JenniferWang · 2025-09-25T21:52:38Z

src/forge/actors/policy.py

        logger.debug(f"Loaded state dict from {key} in {time.time() - start} seconds")

+    @endpoint
+    async def _update_hf_nonsharded(self, version: int):


Could you explain the choice between using get_state_dict/get_state_dict and the get/put API?

JenniferWang · 2025-09-25T21:55:22Z

src/forge/actors/policy.py

        logger.debug(f"Loaded state dict from {key} in {time.time() - start} seconds")

+    @endpoint
+    async def _update_hf_nonsharded(self, version: int):


I am also confused at the load_weights API -- will it handle sharding itself? If so, should we call this function on the driver worker (0) once?

casteryh · 2025-09-26T00:28:20Z

@JenniferWang

Could you explain the choice between using get_state_dict/get_state_dict and the get/put API?

get_state_dict will get all the weights all at once which we probably don't want.
I don't have problems with put_state_dict, but the thing is if I am not using get_state_dict, I am not sure how to properly read something written in with put_state_dict.
I'd rather have a flat kv structure and everything I can control myself.
Also, the torchstore api explicitly says get_state_dict and put_state_dict are for testing purposes (last time I checked the codebase). I got no idea why it's used here in the first place.

casteryh · 2025-09-26T00:38:09Z

@vidhyav

Is the source of the problem also because we have deconstructed the vllm engine and using it piecemeal instead of using it as a whole?

What we are doing here is we are using vllm workers and let monarch handles the collective operations instead of vllm's default method. That said, we can still use load_weights() method on vllm workers.

Ordinary vllm weight loading: vllm using its own collectives to call load_weights() on each worker.
This PR / our approach: monarch calls load_weights() on each worker.
Previous buggy approach: each vllm worker (wrapped in monarch actor) is trying to circumvent the load_weights() method all together and directly operate on the state_dict() of the underlying nn.Module.

apps/toy_rl/sumdigits.py

joecummings · 2025-09-26T14:55:49Z

apps/toy_rl/sumdigits.py


    @endpoint
-    async def push_weights(self, version: int):
+    async def push_weights_DEPRECATED(self, policy_version: int):  # noqa: N802


If we're confident in this fix, we should just fully delete the old way. My thinking is as follows:

Gets everyone immediately testing the new version for any bugs 👍

Reduces the chance an end user sees and uses this endpoint 👍

Less code to parse through right now 👍

Gets everyone immediately testing the new version for any bugs 👍

Yes the new one is the default now! I think the plan is keep the DEPRECATED method just for benchmarking purposes now? @JenniferWang

joecummings · 2025-09-26T14:56:23Z

src/forge/actors/policy.py

+        # Instead, we just call load_weights with one parameter at a time.
+        for name in hf_names:
+            param = await ts.get(get_param_key(version, name))
+            loaded = model.load_weights([(name, param)])


This is super cool! I didn't realize you could do it per-param :)

yeah it's surprisingly good

joecummings · 2025-09-26T14:57:10Z

src/forge/actors/policy.py

+            loaded = model.load_weights([(name, param)])
+            del param
+            loaded_weights.update(loaded)
+        self.logger.info(f"Updated {len(loaded_weights)} parameters")


nit: I prefer the old debug message that prints out the time it took to update the weights

will add it back

src/forge/actors/_torchstore_utils.py

joecummings

This seems reasonable!

joecummings · 2025-09-26T20:43:29Z

apps/grpo/main.py

    @parse
    def _main(cfg):
-        asyncio.run(main(cfg))
+        with TemporaryDirectory(prefix="forge_run_", dir="/dev/shm") as dcp_path:


yeah this won't work lemme revert

use vllm load_weights() in GRPO

17e0c05

casteryh requested review from JenniferWang and joecummings September 19, 2025 02:42

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 19, 2025

casteryh added 4 commits September 23, 2025 12:32

Merge branch 'main' into weight-loading

2657324

modify trainer.py

6b67be9

modify trainer.py

a6c7aef

fix

7ec461b

casteryh mentioned this pull request Sep 23, 2025

[not for land] Weight sync test patch #220

Closed

casteryh added 2 commits September 23, 2025 14:05

lint

762bf24

merge main

87a7bc2

casteryh changed the title ~~[WIP][not for land] use vllm load_weights() in GRPO~~ [WIP] use vllm builtin load_weights Sep 23, 2025

casteryh changed the title ~~[WIP] use vllm builtin load_weights~~ [WIP] vllm builtin load_weights() Sep 23, 2025

casteryh added 6 commits September 23, 2025 17:06

stash

98e6dd3

stash

830efcf

stash

5a6245c

update yaml

3cf2e32

tp size must divide head num = 14

1189017

cleanup

b9291cb

casteryh changed the title ~~[WIP] vllm builtin load_weights()~~ Weight loading working correctly with tp: use vllm builtin load_weights() Sep 24, 2025

casteryh marked this pull request as ready for review September 24, 2025 05:49

lint

46e855f

JenniferWang mentioned this pull request Sep 24, 2025

Fix manual vLLM Qwen3 sharding bug when trainer export the weights #229

Merged

vidhyav reviewed Sep 24, 2025

View reviewed changes

JenniferWang reviewed Sep 25, 2025

View reviewed changes

JenniferWang reviewed Sep 26, 2025

View reviewed changes

apps/toy_rl/sumdigits.py Outdated Show resolved Hide resolved

casteryh added 3 commits September 25, 2025 20:01

add _DEPRECATED, switch default to vllm builtin loading.

68b91a4

clean up

d1e7ec6

Merge branch 'main' into weight-loading

4d61a58

joecummings reviewed Sep 26, 2025

View reviewed changes

casteryh added 3 commits September 26, 2025 08:24

dcp support

d224bea

fix and add test

8c3471c

rename to _torchstore_utils, add time

0f67dec

joecummings mentioned this pull request Sep 26, 2025

[Fix] Reasonable loss for non-distributed training #242

Merged

joecummings approved these changes Sep 26, 2025

View reviewed changes

casteryh added 5 commits September 26, 2025 13:05

Merge branch 'main' into weight-loading

a08c96d

fix

ab56d05

fix main

33ec0ef

tweak tmpdir

7e05b3a

tweak tmpdir

aaf60ec

joecummings reviewed Sep 26, 2025

View reviewed changes

casteryh added 10 commits September 26, 2025 14:20

fix dcp_path

1c0afa9

debug

3e5a417

Merge branch 'main' into weight-loading

7addf9a

debug

446b123

debug

c02e988

disable new weight load for 8b for now

30fbe46

fix config

9676aa0

fix config

0691b43

fix oom

fed8688

lint

26e1334

casteryh merged commit 5f19d68 into meta-pytorch:main Sep 27, 2025
5 checks passed

casteryh deleted the weight-loading branch September 27, 2025 05:23

Weight loading working correctly with tp: use vllm builtin load_weights() #184

Weight loading working correctly with tp: use vllm builtin load_weights() #184

Uh oh!

Conversation

casteryh commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casteryh Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vidhyav commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casteryh commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casteryh commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casteryh Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

casteryh commented Sep 19, 2025 •

edited

Loading

casteryh Sep 25, 2025 •

edited

Loading

vidhyav commented Sep 24, 2025 •

edited

Loading

casteryh commented Sep 26, 2025 •

edited

Loading

casteryh commented Sep 26, 2025 •

edited

Loading

casteryh Sep 26, 2025 •

edited

Loading