factor out weight cleanup to separate file, also non-blocking now #292

casteryh · 2025-10-02T20:55:02Z

#252
test run: https://meta.wandb.io/torchforge/grpo-training/runs/9epdrv7m

> ls forge_dcp_tmp/
policy_ver_0000000014.dcp_whole_state_dict

joecummings · 2025-10-02T21:19:06Z

src/forge/util/weight_sync.py

Why not put this in _torchstore_utils?

It seems that's where all the other weight sync information is. If that's not supposed to be the end location, I'd almost rather all _torchstore_utils be moved out into a weight_sync.py file.

rather all _torchstore_utils be moved out into a weight_sync.py

Maybe I will do this. Let me know wyt. Do you want me to make it _weight_sync.py instead?

joecummings · 2025-10-02T21:24:39Z

src/forge/util/weight_sync.py

+)
+
+
+async def drop_weights(version: int):


Now that we're here, I'd prefer a name like "delete_old_weights"

And instead of version, something like "oldest_version_to_keep"

joecummings · 2025-10-02T21:24:48Z

src/forge/util/weight_sync.py

+
+
+async def drop_weights(version: int):
+    print(f"Dropping weights @ version {version}")


Remove this

joecummings · 2025-10-02T21:24:55Z

src/forge/util/weight_sync.py

+    for key in matching_keys:
+        await ts.delete(key)
+    elapsed = time.perf_counter() - start_time
+    print(f"Dropped weights @ version {version}, took {elapsed:.2f} seconds")


Log instead of print

joecummings · 2025-10-02T21:25:33Z

apps/grpo/main.py

-                #     await drop_weights(training_step - 1)
-                #     t.step("drop_weights")
+                if training_step >= 2:
+                    await drop_weights(training_step - 1)


Can this be truly async or does it have to be blocking like this?

this can be truly async, if we just create a task and not await on it.

casteryh · 2025-10-07T16:05:43Z

Ended up doing a whole refactor. It does the weight clean up in the background in a non-blocking way now.
tested: unit tests and running grpo main (https://meta.wandb.io/torchforge/grpo-training/runs/r8q88uie)
ptal @joecummings

LucasLLC

Asyncio looks reasonable to me. A good way to measure this would be to run this code for a while and always synchronize (finish your tasks) on weighsync step, and measure the delta between doing that and waiting until the next step

LucasLLC · 2025-10-07T17:17:10Z

src/forge/util/_torchstore.py

+    start_time = time.perf_counter()
+    prefix = get_param_prefix(version)
+    matching_keys = await ts.keys(prefix)
+    # TODO: once we have something like `get_meta()` in torchstore, we can just


we do have a 'get_meta' in torchstore (although it's lacking a proper object).

LucasLLC · 2025-10-07T17:18:11Z

src/forge/util/_torchstore.py

+    matching_keys = await ts.keys(prefix)
+    # TODO: once we have something like `get_meta()` in torchstore, we can just
+    # query the type of the object instead of relying on keys.
+    dcp_key = get_dcp_whole_state_dict_key(version)


Is this implementation specific to DCP?

Do we need something like (ts.delete(r"key.*") support in torchstore?

Is this implementation specific to DCP?

Yes

Do we need something like (ts.delete(r"key.*") support in torchstore?

It would be good if we can have it. Although currently it is not a bottleneck to simply call delete on every key.

joecummings · 2025-10-07T17:31:18Z

apps/grpo/main.py

-                if training_step >= 2:
-                    await drop_weights(training_step - 1)
-                    t.step("drop_weights")
+                # weight cleanup is non-blocking, the task is executed in the background


How are you confirming this all finishes before adding more weights?

Also in typical async form this step would just be an async method that you'd await now or later. Why is there an extra method called "wait"?

How are you confirming this all finishes before adding more weights?

I thought the point is you don't, if you just need the weight to be eventually deleted. when you do step(), the task is scheduled in the background and everything else proceeds as normal.

Also in typical async form this step would just be an async method that you'd await now or later.

Yes but in that case, if we want to schedule the task in the background and not await for it, we need to manage the task in main.py, which we supposedly don't want to do. This essentially hides the task scheduling logic in the WeightCleaner class.

Why is there an extra method called "wait"?

If you want to make sure all the scheduled tasks are indeed completed (i.e. all old weights are deleted. like you mentioned earliner), you can await weight_cleaner.wait(). Presumably this can be named better, let me know what you think.

Also in typical async form this step would just be an async method that you'd await now or later. Why is there an extra method called "wait"?

My understanding is, in typical async code, if you don't explicitly create a task, then it will never get executed unless you await on it? I think we can also always schedule the task and return a join handle.

joecummings · 2025-10-07T17:31:32Z

src/forge/util/_torchstore.py

Move to core app/

Everything or only the WeightCleaner? trainer and policy both need functions in this file.

casteryh · 2025-10-16T20:08:26Z

ptal @joecummings

casteryh added 2 commits October 2, 2025 11:10

add back drop_weights, factoring it out

7643064

import time

e66efee

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 2, 2025

casteryh requested a review from joecummings October 2, 2025 20:55

joecummings reviewed Oct 2, 2025

View reviewed changes

casteryh added 4 commits October 6, 2025 15:36

refactor

c8b2b42

Merge branch 'main' into weight-cleanup-fix

73d233c

only keep one version

cee8d29

move dcp_handle drop to separate thread

0b5edb7

casteryh changed the title ~~factor out weight cleanup to separate file~~ factor out weight cleanup to separate file, also non-blocking now Oct 7, 2025

LucasLLC reviewed Oct 7, 2025

View reviewed changes

joecummings reviewed Oct 7, 2025

View reviewed changes

Move torchstore_utils from actors to utils

19e5100

casteryh mentioned this pull request Oct 15, 2025

[NOT TO LAND] Move torchstore_utils from actor/ to util/ #428

Open

Jack-Khuu and others added 5 commits October 15, 2025 16:45

Merge branch 'main' into move-ts-util

cf5dab5

Merge branch 'main' into weight-cleanup-fix

5fe537f

Merge branch 'move-ts-util' into weight-cleanup-fix

e91f66c

fix test

d517a9a

clean up

90a2dba

casteryh requested a review from Jack-Khuu October 16, 2025 22:58



		async def drop_weights(version: int):
		print(f"Dropping weights @ version {version}")

factor out weight cleanup to separate file, also non-blocking now #292

Are you sure you want to change the base?

factor out weight cleanup to separate file, also non-blocking now #292

Uh oh!

Conversation

casteryh commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casteryh commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasLLC left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casteryh Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casteryh commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

casteryh commented Oct 2, 2025 •

edited

Loading

casteryh commented Oct 7, 2025 •

edited

Loading

casteryh Oct 7, 2025 •

edited

Loading