Skip to content

Conversation

casteryh
Copy link
Contributor

@casteryh casteryh commented Sep 1, 2025

Add AsyncRWLock-Based Weight Transfer for GRPO Training

Overview

Implements weight transfer between GRPO trainer and policy actors using AsyncRWLock for concurrency control and temporary files for serialization.

Key Changes

1. AsyncRWLock for Safe Weight Updates

  • Added src/forge/controller/synchronization.py with AsyncRWLock implementation
  • When weight update requested, new generation requests are immediately canceled
  • Existing requests complete before weight update proceeds
  • Prevents race conditions during model weight updates

2. GRPO Weight Update Integration

  • Added export_weights() method to Trainer class in apps/grpo/main.py
  • Exports model weights to SafeTensors format in temporary files
  • Integrated periodic weight updates into training loop
  • Uses /dev/shm for memory-backed storage when available

3. vLLM Model Weight Loading

  • Implements _update_from_file() in src/forge/actors/policy.py
  • Uses vLLM's model.load_weights() method which handles HuggingFace → vLLM format conversion
  • Converts SafeTensors to required Iterable[tuple[str, Tensor]] format
  • Leverages vLLM's existing tensor parallel loading logic

4. Temporary File-Based Transfer

  • Uses SafeTensors files instead of TorchStore due to implementation issues
  • TorchStore client serialization through Monarch RPC is unreliable
  • Sending actor references via RPC also exhibits buggy behavior
  • Temporary files provide stable workaround until RPC issues resolved

5. Monarch RPC Compatibility Issues

  • Current implementation reveals limitations in what objects can be sent via Monarch RPC
  • TorchStore clients and actor references fail to serialize/deserialize properly
  • Need systematic investigation of RPC-compatible types
  • Using file-based approach as interim solution

Technical Details

Weight transfer flow:

  1. Trainer serializes model weights to SafeTensors file in temp directory
  2. Returns WeightsHandle with file path
  3. Policy worker loads file and converts to vLLM format
  4. Uses vLLM's load_weights() for proper model update

Concurrency control:

  • Generation requests acquire read lock, fail fast if writer waiting
  • Weight updates acquire write lock, block until all readers finish
  • Prevents corrupted model states during updates

Testing

  • AsyncRWLock test suite covering all concurrency scenarios
  • Tests writer blocking readers, waiting writer blocking new readers, writer mutual exclusion
  • Exception safety and cleanup verification

Known Limitations

  • File cleanup requires manual management
  • Large models need sufficient /dev/shm space
  • RPC compatibility scope unclear, needs investigation

Future Work

  • Investigate full Monarch RPC type compatibility
  • Implement TorchStore integration once RPC resolved
  • Add automatic temp file cleanup

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 1, 2025
@allenwang28
Copy link
Contributor

Thanks @casteryh for taking the initiative! This is an interesting approach, but we're actively developing towards using torchstore as our abstraction for weights sync (e.g. https://github.com/meta-pytorch/forge/pull/108/files)

@casteryh
Copy link
Contributor Author

casteryh commented Sep 2, 2025

Thanks @casteryh for taking the initiative! This is an interesting approach, but we're actively developing towards using torchstore as our abstraction for weights sync (e.g. https://github.com/meta-pytorch/forge/pull/108/files)

Hi @allenwang28 , the main point of the PR is orthogonal to the underlying backend of weight sync mechanism.

  • We need some sort of lock mechanism and cancellation mechanism to ensure integrity of model weights and model updates actually get scheduled.
  • Adding the policy versioning logic.

The file based weight transfer is only a placeholder before the actual torchstore weight transfer mechanism is ready, so that we have an actual runnable prototype. Can you review the rest of the PR?

@pbontrager
Copy link
Contributor

The weight sync solution does not require locking. There's already a workstream to build out the solution. You can inquire about it to learn more. Since we are not moving in this direction, I'm going to close this PR now.

@pbontrager pbontrager closed this Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants