-
Notifications
You must be signed in to change notification settings - Fork 16
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
π Describe the bug
I was debugging the metric logger issue by commenting out all the metric logger related statements in the grpo/main: #437
This code change should change anything. And metric logger shouldn't do anything related to torchstore.
Running
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
I got:
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 4.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 4.91it/s]
[0] [0]
[0] [0] INFO 10-16 12:47:37 [default_loader.py:262] Loading weights took 0.46 seconds
[0] [0] INFO 10-16 12:47:38 [gpu_model_runner.py:1892] Model loading took 3.2152 GiB and 1.035845 seconds
[0] [0] INFO 10-16 12:47:43 [backends.py:530] Using cache directory: /home/dxie/.cache/vllm/torch_compile_cache/774b299512/rank_0_0/backbone for vLLM's torch.compile
[0] [0] INFO 10-16 12:47:43 [backends.py:541] Dynamo bytecode transform time: 4.28 s
[0] [0] INFO 10-16 12:47:47 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.596 s
[-]E1016 12:47:48.760388 775752 hyperactor/src/channel/net.rs:872] error_msg:session unix:@Z9sV1AJ0HJs8FWClUBuaww2e.7455948586553057363: failed to deliver message within timeout
[-]E1016 12:47:50.826796 775752 hyperactor/src/channel/net.rs:872] error_msg:session unix:@cXwFuzc6LZR0EaOTBVt2PSVx.14241017779262408940: failed to deliver message within timeout
[0] [0] INFO 10-16 12:47:51 [monitor.py:34] torch.compile takes 4.28 s in total
[0] [0] INFO 10-16 12:47:53 [gpu_worker.py:255] Available KV cache memory: 76.00 GiB
[0] [0] INFO 10-16 12:47:53 [kv_cache_utils.py:833] GPU KV cache size: 711,520 tokens
[0] [0] INFO 10-16 12:47:53 [kv_cache_utils.py:837] Maximum concurrency for 40,960 tokens per request: 17.37x
Capturing CUDA graph shapes: 100%|ββββββββββ| 67/67 [00:01<00:00, 33.62it/s]
[0] [0] INFO 10-16 12:47:56 [gpu_model_runner.py:2485] Graph capturing finished in 2 secs, took 0.60 GiB
All services initialized successfully!
[-]E1016 12:47:57.552378 775752 hyperactor/src/channel/net.rs:872] error_msg:session unix:@Z9sV1AJ0HJs8FWClUBuaww2e.2645662419857517690: failed to deliver message within timeout
[-]E1016 12:48:28.750134 775752 hyperactor/src/channel/net.rs:885] error_msg:session unix:@Y8fq0TvZm97486NFz2xbpJab.7448927969152336640: failed to receive ack within timeout 30 secs; link is currently broken
[-]E1016 12:48:28.750514 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.649452225+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[119644[...436 chars] CRC:1e5c6f6d 5b 5b 31 31 39 36 34 34 [...428 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...418 chars] CRC:15df9d6b 5b 30 2c 30 2c 30 2c 30 [...410 bytes]","is_illegal":false,"parts":[]}},"typehash":7252717004981455529}},"dest_port":{"actor_name":"agent","port":9548687619301914726},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750589 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.701559725+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[104803[...125 chars] CRC:424d5e55 5b 5b 31 30 34 38 30 33 [...117 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...286 chars] CRC:365a9783 5b 30 2c 30 2c 30 2c 30 [...278 bytes]","is_illegal":false,"parts":[]}},"typehash":1330557499926649425}},"dest_port":{"actor_name":"agent","port":10639140449372694939},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750643 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent,hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.701597422+00:00, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[119644[...436 chars] CRC:97c45964 5b 5b 31 31 39 36 34 34 [...428 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...389 chars] CRC:3b1000b4 5b 30 2c 30 2c 30 2c 30 [...381 bytes]","is_illegal":false,"parts":[]}},"typehash":7252717004981455529}},"dest_port":{"actor_name":"agent","port":9548687619301914726},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750695 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.754350855+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=log_forwarder_1rLHDBEhwHnQ, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Suffixed":"[\"log_fo[...38 chars] CRC:22a39fcb 5b 22 6c 6f 67 5f 66 6f [...30 bytes]"}},"data":{"bindings":[],"message":{"encoded":{"Multipart":{"body":"[1,0,0,0[...11 chars] CRC:5597ea9f 5b 31 2c 30 2c 30 2c 30 [...3 bytes]","is_illegal":false,"parts":[]}},"typehash":12098283485503622895}},"dest_port":{"actor_name":"log_forw[...26 chars] CRC:ef862055 6c 6f 67 5f 66 6f 72 77 [...18 bytes]","port":18039862042805291298},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750748 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.754369373+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=logger_1aZQw2Sm9N34, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Suffixed":"[\"logger[...30 chars] CRC:7be1ce35 5b 22 6c 6f 67 67 65 72 [...22 bytes]"}},"data":{"bindings":[],"message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...12 chars] CRC:b564c5bf 5b 30 2c 30 2c 30 2c 30 [...4 bytes]","is_illegal":false,"parts":[]}},"typehash":10911722563623013595}},"dest_port":{"actor_name":"logger_1[...19 chars] CRC:6940c6b 6c 6f 67 67 65 72 5f 31 [...11 bytes]","port":10648924583654510167},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750804 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.756514783+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[104803[...125 chars] CRC:424d5e55 5b 5b 31 30 34 38 30 33 [...117 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...512 chars] CRC:eb4c594f 5b 30 2c 30 2c 30 2c 30 [...504 bytes]","is_illegal":false,"parts":[]}},"typehash":1330557499926649425}},"dest_port":{"actor_name":"agent","port":10639140449372694939},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750881 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.756561193+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[119644[...436 chars] CRC:3914ab48 5b 5b 31 31 39 36 34 34 [...428 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...448 chars] CRC:944c8fe3 5b 30 2c 30 2c 30 2c 30 [...440 bytes]","is_illegal":false,"parts":[]}},"typehash":7252717004981455529}},"dest_port":{"actor_name":"agent","port":9548687619301914726},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750931 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent,hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.649415320+00:00, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[104803[...125 chars] CRC:424d5e55 5b 5b 31 30 34 38 30 33 [...117 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...614 chars] CRC:6119f83f 5b 30 2c 30 2c 30 2c 30 [...606 bytes]","is_illegal":false,"parts":[]}},"typehash":1330557499926649425}},"dest_port":{"actor_name":"agent","port":10639140449372694939},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750994 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,agent[0][16476089041836231337<hyperactor_mesh::resource::GetRankStatus>], headers:hyperactor_mesh::comm::multicast::cast_point=procs=0/1,hyperactor_mesh::comm::multicast::cast_originating_sender=unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0],hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.536707210+00:00, data:GetRankStatus{"name":{"Suffixed":"[\"comm\",[...28 chars] CRC:69aeeab5 5b 22 63 6f 6d 6d 22 2c [...20 bytes]"},"reply":{"phantom":null,"port_id":"[[{\"Dire[...93 chars] CRC:8e6e607d 5b 5b 7b 22 44 69 72 65 [...85 bytes]","reducer_opts":null,"reducer_spec":{"builder_params":null,"typehash":4493460774099310317}}}
[-]E1016 12:48:28.751029 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][9438540675267141827<hyperactor_mesh::comm::CommActorMode>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.648765882+00:00, data:CommActorMode{"Mesh":"[0,{\"0\":[...105 chars] CRC:e8e3dbf2 5b 30 2c 7b 22 30 22 3a [...97 bytes]"}
Traceback (most recent call last):
File "/home/dxie/.conda/envs/forge/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dxie/.conda/envs/forge/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/dxie/forge/apps/grpo/main.py", line 510, in <module>
_main() # @parse grabs the cfg from CLI
File "/home/dxie/forge/src/forge/cli/config.py", line 310, in wrapper
sys.exit(recipe_main(conf))
File "/home/dxie/forge/apps/grpo/main.py", line 508, in _main
asyncio.run(main(cfg))
File "/home/dxie/.conda/envs/forge/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/dxie/.conda/envs/forge/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/dxie/forge/apps/grpo/main.py", line 352, in main
await ts.initialize(
File "/home/dxie/.conda/envs/forge/lib/python3.10/site-packages/torchstore/api.py", line 80, in initialize
await controller.init.call(
File "/home/dxie/.conda/envs/forge/lib/python3.10/site-packages/monarch/_src/actor/endpoint.py", line 132, in call
extent = self._send(args, kwargs, port=p)
File "/home/dxie/.conda/envs/forge/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 372, in _send
objects, buffer = flatten((args, kwargs), _is_ref_or_mailbox)
File "/home/dxie/.conda/envs/forge/lib/python3.10/site-packages/monarch/_src/actor/pickle.py", line 86, in flatten
pickler.dump(obj)
File "/home/dxie/.conda/envs/forge/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1303, in dump
return super().dump(obj)
RuntimeError: error spawning actor mesh: statuses: Timeout(30.000849209s)=0..1
Versions
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working