shared memory multiprocess prefetch for weight update #430

casteryh · 2025-10-15T22:47:16Z

What this PR does

Add a SharedTensor class that's backed by shared memory. It exposes a handle that can be trivially serialized and shared across processes inside a host. This is necessary since pytorch's native tensor backed by shared memory is serialized by value in monarch/cloudpickle.
Spawn dedicated procs colocated with the workers that fetch weights from torchstore to shared memory which will be forwarded to GeneratorWorker. Note naively parallelize weight fetch by doing asyncio.gather() will cause monarch RDMA operations to fail, having multiple procs circumvents this.
Let the dedicated procs prefetch while generator is still generating.

Perf

TL;DR e2e weight sync time is now ~50s for QWen3 32b; one training step takes <70s

	before	after
wait for existing generations	5 - 15s	5 - 15s
update_weights (total)	~80s	~50s
update_weights (since generation completes)	~80s	30s - 45s
fetch_weights	n/a	~45s
shared_memory -> vllm worker	n/a	~5s

Tested with

unit tests for SharedTensor
integration test with qwen3_32b paste
dcp code path still works (only tested with 8b because 32b is waaaay too slow) paste
apps/grpo/main
- run 1 dead after 60 steps, disk quota exceeded when writing wandb logs...
- run 2 crashed after 40+ steps, illegal CUDA memory access (might be a vLLM bug [Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered vllm-project/vllm#8177)
- run 3 (installed flashinfer) crashed/link broken after 60+ steps, no obvious reasons from log
- control run at 9afb769 (installed flashinfer) crashed/link broken after 40+steps, no obvious reasons from log

…workers

allenwang28 · 2025-10-16T01:01:51Z

src/forge/actors/generator.py

            log_stats=None,
        )
        self._start_processing()
+        fetcher_procs = this_host().spawn_procs(


can we guard this with if prefetch_weights? It may also be a bit cleaner to put this init in another function w/ some documentation for what the fetchers here do

will move to a separate function.

allenwang28 · 2025-10-16T01:09:18Z

src/forge/actors/generator.py

        )
        self._start_processing()
+        fetcher_procs = this_host().spawn_procs(
+            per_host={"procs": self.n_fetcher_procs}


not that we need to address now, but I think we need to spawn these fetcher procs across all generator nodes right?

yes. but I assume setup call is broadcasted and every Generator node will spawn their own fetcher_procs?

that is correct, I meant if a generator workers span 2 nodes i.e. DeepSeek

In that case we would probably want to spin up the fetchers on the worker nodes right?

I am a bit confused -- shouldn't vLLM worker be scoped to a single node?

I also don't follow why you need more than 1. Is it to allow you to parallelize torchstore requests?

JenniferWang · 2025-10-16T19:31:06Z

@casteryh could you split the PR with separate concerns?

Using shared memory for weight sync when CPU is involved (TorchStore RDMA CPU-CPU, DCP). This can be turned on by default for now.
Prefetching the weights while completing the on-the-fly generation requests.

Also maybe add comment somewhere saying the following is up-for-discussion

TorchStore RDMA GPU-GPU.
Multi-node vLLM

I also don't quite get the "before" and "after" table.

casteryh · 2025-10-16T19:44:01Z

@casteryh could you split the PR with separate concerns?

I think it makes little sense to split, see below.

Using shared memory for weight sync when CPU is involved (TorchStore RDMA CPU-CPU, DCP). This can be turned on by default for now.

Currently TorchStore RDMA only works with CPU-CPU.
I can enable this for DCP in a separate PR.

Prefetching the weights while completing the on-the-fly generation requests.

This actually comes automatically once you have separate processes fetching the weights to shared memory.
I can add a flag that waits until all generation completes and then fetch the weights, but why would someone want to do that in the first place?

Also maybe add comment somewhere saying the following is up-for-discussion

TorchStore RDMA GPU-GPU.

Multi-node vLLM

Will do

I also don't quite get the "before" and "after" table.

Ah maybe it's confusing because I am trying to do two things at once. Basically the speed up comes from 1. multiprocess shared memory (this saves 30 seconds) & 2. prefetch while completing on-the-fly generation (this saves about 10 seconds on average)

JenniferWang · 2025-10-16T20:53:37Z

This actually comes automatically once you have separate processes fetching the weights to shared memory.
I can add a flag that waits until all generation completes and then fetch the weights, but why would someone want to do that in the first place?

Ahha, yes, I got it now. Okay, so the boolean guard is not "use prefetch or not" -- is should be "use shared memory or not". I think you should try profiling on proc = 8 / replica for policy.

casteryh · 2025-10-17T01:56:00Z

Ahha, yes, I got it now. Okay, so the boolean guard is not "use prefetch or not" -- is should be "use shared memory or not". I think you should try profiling on proc = 8 / replica for policy.

@JenniferWang
for tp=8 on policy
almost the same except loading from shared memory to gpu is now faster (because each worker has a smaller shard)

generator_perf/_fetch_weights/total_duration_avg_s: 43.30411313060904
generator_perf/_fetch_weights/total_duration_max_s: 44.24197581084445
generator_perf/waiting_for_fetch_weights/total_duration_avg_s: 28.329474024591036
generator_perf/waiting_for_fetch_weights/total_duration_max_s: 40.779690923169255
generator_worker_perf/update_weights_from_shared_memory/total_duration_avg_s: 3.1711842537115444
generator_worker_perf/update_weights_from_shared_memory/total_duration_max_s: 3.380359285045415

casteryh · 2025-10-17T01:56:46Z

@allenwang28 @JenniferWang ptal
do you want me to enable this by default?

JenniferWang · 2025-10-17T14:30:38Z

Ahha, yes, I got it now. Okay, so the boolean guard is not "use prefetch or not" -- is should be "use shared memory or not". I think you should try profiling on proc = 8 / replica for policy.

@JenniferWang for tp=8 on policy almost the same except loading from shared memory to gpu is now faster (because each worker has a smaller shard)
generator_perf/_fetch_weights/total_duration_avg_s: 43.30411313060904
generator_perf/_fetch_weights/total_duration_max_s: 44.24197581084445
generator_perf/waiting_for_fetch_weights/total_duration_avg_s: 28.329474024591036
generator_perf/waiting_for_fetch_weights/total_duration_max_s: 40.779690923169255
generator_worker_perf/update_weights_from_shared_memory/total_duration_avg_s: 3.1711842537115444
generator_worker_perf/update_weights_from_shared_memory/total_duration_max_s: 3.380359285045415

Yes, I was thinking that tp = 8 on policy would be worse without shared memory ?

JenniferWang · 2025-10-17T14:32:32Z

I think we should make using shared memory by default for CPU based weight sync, with a flag to turn it off.

This commit fixes multiple memory leak issues in the SharedTensor implementation by introducing explicit lifecycle management and proper cleanup patterns. Key Changes: 1. Fixed __del__ bug: changed hasattr(self, "shm") to check "_shm" 2. Added explicit close() method for releasing shared memory handles 3. Changed tensor from @cached_property to @Property with manual caching 4. Added closed state tracking with is_closed property 5. Made tensor access after close() raise RuntimeError (fail-fast) 6. Made get_handle() after close() raise RuntimeError 7. Updated drop() to call close() first, then unlink 8. Added context manager support (__enter__/__exit__) 9. Fixed _WeightFetcher to explicitly close after getting handle 10. Fixed GeneratorWorker to close shared memory after loading weights 11. Optimized SharedTensorHandle.drop() to not create unnecessary instances Memory Leak Prevention: - Creators must call close() after getting handle - Receivers must call close() after using tensor - One process should call drop() to unlink after all are done - close() and drop() are idempotent and safe to call multiple times Documentation: - Added comprehensive class docstring with lifecycle model - Documented that cached tensor references become invalid after close() - Added warnings about not relying on __del__ for cleanup - Added 12 new tests for close/cleanup behavior Test Results: 65/65 tests pass with no warnings

Refactor generator code to use context manager pattern (with statement) for SharedTensor cleanup instead of explicit close() calls. This provides: - Clearer intent: context manager makes lifecycle explicit - Automatic cleanup: ensures close() is called even on exceptions - More idiomatic Python: standard pattern for resource management Changes: - GeneratorWorker.update_weights(): Use 'with' for SharedTensor from handles - _WeightFetcher.fetch(): Use 'with' when creating SharedTensor and getting handle The context manager automatically calls close() on exit, making the code more concise and safer.

casteryh · 2025-10-17T17:38:50Z

Ahha, yes, I got it now. Okay, so the boolean guard is not "use prefetch or not" -- is should be "use shared memory or not". I think you should try profiling on proc = 8 / replica for policy.

@JenniferWang for tp=8 on policy almost the same except loading from shared memory to gpu is now faster (because each worker has a smaller shard)
generator_perf/_fetch_weights/total_duration_avg_s: 43.30411313060904
generator_perf/_fetch_weights/total_duration_max_s: 44.24197581084445
generator_perf/waiting_for_fetch_weights/total_duration_avg_s: 28.329474024591036
generator_perf/waiting_for_fetch_weights/total_duration_max_s: 40.779690923169255
generator_worker_perf/update_weights_from_shared_memory/total_duration_avg_s: 3.1711842537115444
generator_worker_perf/update_weights_from_shared_memory/total_duration_max_s: 3.380359285045415
Yes, I was thinking that tp = 8 on policy would be worse without shared memory ?

yes I believe it was ~100 seconds without shared memory for tp=8, but I have some problem with my slurm node and can't test now.

Change prefetch_weights_to_shm from False to True to enable the new shared memory-based weight prefetching feature by default.

casteryh · 2025-10-17T17:41:10Z

I think we should make using shared memory by default for CPU based weight sync, with a flag to turn it off.

done

Remove qwen3_32b_experimental.yaml as the shared memory weight prefetching feature is now enabled by default and no longer experimental.

casteryh · 2025-10-17T18:20:08Z

fixed a memory leak

src/forge/actors/generator.py

JenniferWang · 2025-10-17T18:41:52Z

src/forge/util/_shared_tensor.py

+            shm.close()
+            shm.unlink()
+        except Exception:
+            pass


What's the consideration behind swallowing the exceptions in cleaning up the resource?

To make this idempotent and safe to be called from multiple-processes. Open to other ideas.

Co-authored-by: Jiyue Wang <[email protected]>

pbontrager

This is awesome! I think we'll want to go over how we're doing prefetch again after some of this is upstreamed to torchstore, but otherwise it looks great. I wonder if this is too risky of a change to make before PTC thought?

pbontrager · 2025-10-17T21:33:30Z

src/forge/actors/generator.py

        )
+
+
+class _WeightFetcher(ForgeActor):


I think this could be a method on the generator that gets called from main, so prefetch is controlled and visible from the main loop. I am curious if this has to actually be a separate process since this is an async method and I would think most of the time it's waiting on ts.get.

This has to be a separate actor because it has to be launched in a separate process

pbontrager · 2025-10-17T21:34:27Z

src/forge/actors/generator.py

+            param_key = get_param_key(version, name)
+            param = await ts.get(param_key)
+            # Use context manager to ensure cleanup after getting handle
+            with SharedTensor(tensor=param) as shared_tensor:


Is the plan to move this to TS and hide the rdma/shared memory logic from the user?

Hopefully yes.

pbontrager · 2025-10-17T21:37:38Z

src/forge/actors/generator.py

        )
        self._start_processing()
+        fetcher_procs = this_host().spawn_procs(
+            per_host={"procs": self.n_fetcher_procs}


I also don't follow why you need more than 1. Is it to allow you to parallelize torchstore requests?

pbontrager · 2025-10-17T21:38:50Z

src/forge/actors/generator.py

    engine_args: EngineArgs | Mapping = field(default_factory=EngineArgs)
    sampling_params: SamplingParams | Mapping = field(default_factory=SamplingParams)
    use_dcp_for_weight_sync: bool | None = None
+    prefetch_weights_to_shm: bool = True


In general we should try to avoid changing the "public" api when we expect to quickly change the backend again. After launch we should try to keep this in mind.

casteryh · 2025-10-17T21:49:01Z

I also don't follow why you need more than 1. Is it to allow you to parallelize torchstore requests?

Yes it's 2x faster than 1 process. I haven't tuned this parameter too much though. @pbontrager

casteryh · 2025-10-17T21:51:50Z

This is awesome! I think we'll want to go over how we're doing prefetch again after some of this is upstreamed to torchstore, but otherwise it looks great. I wonder if this is too risky of a change to make before PTC thought?

I am testing its stability right now. But fwiw, the current main is not stable / well tested either.

casteryh · 2025-10-17T21:55:02Z

This is awesome! I think we'll want to go over how we're doing prefetch again after some of this is upstreamed to torchstore, but otherwise it looks great. I wonder if this is too risky of a change to make before PTC thought?

we can also switch the flag to be False by default

casteryh and others added 30 commits October 14, 2025 10:26

shared tensor util

10d336a

refactor, fix test

68395f9

set up titan distributed through monarch utils, colocate policy with …

3789d49

…workers

add SharedTensorHandle class

23c8fcb

merge

f7ed526

shared memory weight loading

15d2784

oopsie

c7ca738

end -> stop

cbd529e

typo

20e162f

make policy_version optional

eadf3a5

fix

2b675f8

no leak

8b488b7

disable dcp in 8b

5e7528c

undo the colocation

2b23e50

debug info

4b850ad

typo

f7cbcb4

refactor

09836f3

Merge branch 'main' into yhu/shared-tensor

e5f984e

temp: reduce num_replicas to 2

9fa7395

fix bad merge

7744f4c

revert to 4

8970ff4

move _fetch_weights to policy worker

48821de

typo

2798f2e

endpoint

ddf8d26

clean up

71b89c1

fix

1971a4f

fix

dc301aa

rearrange

c879753

log

c462911

vllm colocation works

571750f

casteryh requested a review from joecummings October 16, 2025 00:59

casteryh changed the title ~~[wip] shared memory multiprocess prefetch for weight update~~ shared memory multiprocess prefetch for weight update Oct 16, 2025

allenwang28 reviewed Oct 16, 2025

View reviewed changes

casteryh added 2 commits October 16, 2025 09:41

multiprocessing -> multiprocess

bda4798

separate spawn_fetchers method

ab7dae5

casteryh added 4 commits October 16, 2025 18:38

rename flag

e2785f7

rename flag

125b039

Merge branch 'main' into yhu/shared-tensor-mp

b36eb9a

tp=8

c9df35f

casteryh added 2 commits October 17, 2025 10:31

Enable shared memory weight prefetching by default

ac9240c

Change prefetch_weights_to_shm from False to True to enable the new shared memory-based weight prefetching feature by default.

casteryh added 2 commits October 17, 2025 10:48

Remove experimental config file

9c35c31

Remove qwen3_32b_experimental.yaml as the shared memory weight prefetching feature is now enabled by default and no longer experimental.

Merge remote-tracking branch 'origin/main' into yhu/shared-tensor-mp

4c981bd

JenniferWang approved these changes Oct 17, 2025

View reviewed changes

Update src/forge/actors/generator.py

a556064

Co-authored-by: Jiyue Wang <[email protected]>

pbontrager reviewed Oct 17, 2025

View reviewed changes

shared memory multiprocess prefetch for weight update #430

Are you sure you want to change the base?

shared memory multiprocess prefetch for weight update #430

Conversation

casteryh commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Perf

Tested with

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casteryh Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JenniferWang commented Oct 16, 2025

Uh oh!

casteryh commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JenniferWang commented Oct 16, 2025

Uh oh!

casteryh commented Oct 17, 2025

Uh oh!

casteryh commented Oct 17, 2025

Uh oh!

JenniferWang commented Oct 17, 2025

Uh oh!

JenniferWang commented Oct 17, 2025

Uh oh!

casteryh commented Oct 17, 2025

Uh oh!

casteryh commented Oct 17, 2025

Uh oh!

casteryh commented Oct 17, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbontrager left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casteryh commented Oct 17, 2025

Uh oh!

casteryh commented Oct 17, 2025

Uh oh!

casteryh commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

casteryh commented Oct 15, 2025 •

edited

Loading

casteryh Oct 16, 2025 •

edited

Loading

casteryh commented Oct 16, 2025 •

edited

Loading