Skip to content

RDMA Write Operation Fails When Fetching Model Weights #572

@zeocax

Description

@zeocax

🐛 Describe the bug

python -m apps.grpo.main --config apps/grpo/qwen3_8b.yaml

Error Log

...
[TitanTrainer-0/2] 2025-11-14 20:00:59 INFO Pushing weights for policy version 1
[TitanTrainer-1/2] 2025-11-14 20:00:59 INFO Pushing weights for policy version 1
[ReferenceModel-0/1] 2025-11-14 20:01:02 INFO [GC] Performing periodic GC collection took 0.00 seconds
[ReferenceModel-0/1] 2025-11-14 20:01:10 INFO [GC] Performing periodic GC collection took 0.00 seconds
[TitanTrainer-0/2] 2025-11-14 20:01:10 INFO Completed weights push in 11.85 seconds
[TitanTrainer-1/2] 2025-11-14 20:01:11 INFO Completed weights push in 12.03 seconds
[Generator-0/1] 2025-11-14 20:01:11 INFO [Generator] Fetching weights for v1 to shared memory 
CRITICAL:root:Unhandled exception in actor endpoint
Traceback (most recent call last):
  File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/monarch/_src/actor/actor_mesh.py", line 935, in handle
    result = await the_method(*args, **kwargs) 
  File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/torchstore/storage_volume.py", line 65, in get
    return await self.store.get(key, transport_buffer, request)
  File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/torchstore/storage_volume.py", line 257, in get
    await transport_buffer.write_from(extracted_tensor)
  File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/torchstore/transport/buffers.py", line 219, in write_from
    await self.rdma_buffers[idx].write_from(chunk)
  File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/monarch/_src/actor/future.py", line 138, in mark_complete
    func, value = fut.set_result, await coro
  File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/monarch/_src/rdma/rdma.py", line 333, in write_from_nonblocking
    res = await self._buffer.write_from(
Exception: failed to write from buffer: RDMA polling completion failed: Send work completion failed with status: 9, vendor error: 138, wr_id: 0, send_cq_idx: 0 [lkey=2546176, rkey=2546176, addr=0x7fe166e7f040, size=622329856]

Maybe monarch's issue?

Versions

torchmonarch 0.1.2
torchshow 0.5.2
torchstore 0.1.2
torchtitan 0.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions