-
Notifications
You must be signed in to change notification settings - Fork 69
Open
Description
🐛 Describe the bug
python -m apps.grpo.main --config apps/grpo/qwen3_8b.yaml
Error Log
...
[TitanTrainer-0/2] 2025-11-14 20:00:59 INFO Pushing weights for policy version 1
[TitanTrainer-1/2] 2025-11-14 20:00:59 INFO Pushing weights for policy version 1
[ReferenceModel-0/1] 2025-11-14 20:01:02 INFO [GC] Performing periodic GC collection took 0.00 seconds
[ReferenceModel-0/1] 2025-11-14 20:01:10 INFO [GC] Performing periodic GC collection took 0.00 seconds
[TitanTrainer-0/2] 2025-11-14 20:01:10 INFO Completed weights push in 11.85 seconds
[TitanTrainer-1/2] 2025-11-14 20:01:11 INFO Completed weights push in 12.03 seconds
[Generator-0/1] 2025-11-14 20:01:11 INFO [Generator] Fetching weights for v1 to shared memory
CRITICAL:root:Unhandled exception in actor endpoint
Traceback (most recent call last):
File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/monarch/_src/actor/actor_mesh.py", line 935, in handle
result = await the_method(*args, **kwargs)
File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/torchstore/storage_volume.py", line 65, in get
return await self.store.get(key, transport_buffer, request)
File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/torchstore/storage_volume.py", line 257, in get
await transport_buffer.write_from(extracted_tensor)
File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/torchstore/transport/buffers.py", line 219, in write_from
await self.rdma_buffers[idx].write_from(chunk)
File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/monarch/_src/actor/future.py", line 138, in mark_complete
func, value = fut.set_result, await coro
File "/mnt/shared-storage-user/[USERNAME]/zeocax/envs/torchforge/lib/python3.12/site-packages/monarch/_src/rdma/rdma.py", line 333, in write_from_nonblocking
res = await self._buffer.write_from(
Exception: failed to write from buffer: RDMA polling completion failed: Send work completion failed with status: 9, vendor error: 138, wr_id: 0, send_cq_idx: 0 [lkey=2546176, rkey=2546176, addr=0x7fe166e7f040, size=622329856]
Maybe monarch's issue?
Versions
torchmonarch 0.1.2
torchshow 0.5.2
torchstore 0.1.2
torchtitan 0.2.0
Metadata
Metadata
Assignees
Labels
No labels