Add AsyncRWLock-Based Weight Transfer for GRPO Training #106

casteryh · 2025-09-01T09:11:49Z

Add AsyncRWLock-Based Weight Transfer for GRPO Training

Overview

Implements weight transfer between GRPO trainer and policy actors using AsyncRWLock for concurrency control and temporary files for serialization.

Key Changes

1. AsyncRWLock for Safe Weight Updates

Added src/forge/controller/synchronization.py with AsyncRWLock implementation
When weight update requested, new generation requests are immediately canceled
Existing requests complete before weight update proceeds
Prevents race conditions during model weight updates

2. GRPO Weight Update Integration

Added export_weights() method to Trainer class in apps/grpo/main.py
Exports model weights to SafeTensors format in temporary files
Integrated periodic weight updates into training loop
Uses /dev/shm for memory-backed storage when available

3. vLLM Model Weight Loading

Implements _update_from_file() in src/forge/actors/policy.py
Uses vLLM's model.load_weights() method which handles HuggingFace → vLLM format conversion
Converts SafeTensors to required Iterable[tuple[str, Tensor]] format
Leverages vLLM's existing tensor parallel loading logic

4. Temporary File-Based Transfer

Uses SafeTensors files instead of TorchStore due to implementation issues
TorchStore client serialization through Monarch RPC is unreliable
Sending actor references via RPC also exhibits buggy behavior
Temporary files provide stable workaround until RPC issues resolved

5. Monarch RPC Compatibility Issues

Current implementation reveals limitations in what objects can be sent via Monarch RPC
TorchStore clients and actor references fail to serialize/deserialize properly
Need systematic investigation of RPC-compatible types
Using file-based approach as interim solution

Technical Details

Weight transfer flow:

Trainer serializes model weights to SafeTensors file in temp directory
Returns WeightsHandle with file path
Policy worker loads file and converts to vLLM format
Uses vLLM's load_weights() for proper model update

Concurrency control:

Generation requests acquire read lock, fail fast if writer waiting
Weight updates acquire write lock, block until all readers finish
Prevents corrupted model states during updates

Testing

AsyncRWLock test suite covering all concurrency scenarios
Tests writer blocking readers, waiting writer blocking new readers, writer mutual exclusion
Exception safety and cleanup verification

Known Limitations

File cleanup requires manual management
Large models need sufficient /dev/shm space
RPC compatibility scope unclear, needs investigation

Future Work

Investigate full Monarch RPC type compatibility
Implement TorchStore integration once RPC resolved
Add automatic temp file cleanup

allenwang28 · 2025-09-02T20:14:18Z

Thanks @casteryh for taking the initiative! This is an interesting approach, but we're actively developing towards using torchstore as our abstraction for weights sync (e.g. https://github.com/meta-pytorch/forge/pull/108/files)

casteryh · 2025-09-02T20:20:20Z

Thanks @casteryh for taking the initiative! This is an interesting approach, but we're actively developing towards using torchstore as our abstraction for weights sync (e.g. https://github.com/meta-pytorch/forge/pull/108/files)

Hi @allenwang28 , the main point of the PR is orthogonal to the underlying backend of weight sync mechanism.

We need some sort of lock mechanism and cancellation mechanism to ensure integrity of model weights and model updates actually get scheduled.
Adding the policy versioning logic.

The file based weight transfer is only a placeholder before the actual torchstore weight transfer mechanism is ready, so that we have an actual runnable prototype. Can you review the rest of the PR?

pbontrager · 2025-09-02T20:47:03Z

The weight sync solution does not require locking. There's already a workstream to build out the solution. You can inquire about it to learn more. Since we are not moving in this direction, I'm going to close this PR now.

casteryh added 5 commits September 1, 2025 02:02

util

3642275

almost working, torchstore needs fixing

be34baf

use tempfile instead

06d5b48

add tests and clean up

23a3446

clean up

fc5c843

casteryh requested review from allenwang28, joecummings and pbontrager September 1, 2025 09:11

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 1, 2025

pbontrager closed this Sep 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AsyncRWLock-Based Weight Transfer for GRPO Training #106

Add AsyncRWLock-Based Weight Transfer for GRPO Training #106

Uh oh!

casteryh commented Sep 1, 2025

Uh oh!

allenwang28 commented Sep 2, 2025

Uh oh!

casteryh commented Sep 2, 2025 •

edited

Loading

Uh oh!

pbontrager commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add AsyncRWLock-Based Weight Transfer for GRPO Training #106

Add AsyncRWLock-Based Weight Transfer for GRPO Training #106

Uh oh!

Conversation

casteryh commented Sep 1, 2025