Skip to content

GRPO gives torchstore error when commenting metric logger related statements.Β #439

@DNXie

Description

@DNXie

πŸ› Describe the bug

I was debugging the metric logger issue by commenting out all the metric logger related statements in the grpo/main: #437
This code change should change anything. And metric logger shouldn't do anything related to torchstore.

Running

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

I got:

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  4.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  4.91it/s]
[0] [0]
[0] [0] INFO 10-16 12:47:37 [default_loader.py:262] Loading weights took 0.46 seconds
[0] [0] INFO 10-16 12:47:38 [gpu_model_runner.py:1892] Model loading took 3.2152 GiB and 1.035845 seconds
[0] [0] INFO 10-16 12:47:43 [backends.py:530] Using cache directory: /home/dxie/.cache/vllm/torch_compile_cache/774b299512/rank_0_0/backbone for vLLM's torch.compile
[0] [0] INFO 10-16 12:47:43 [backends.py:541] Dynamo bytecode transform time: 4.28 s
[0] [0] INFO 10-16 12:47:47 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.596 s
[-]E1016 12:47:48.760388 775752 hyperactor/src/channel/net.rs:872] error_msg:session unix:@Z9sV1AJ0HJs8FWClUBuaww2e.7455948586553057363: failed to deliver message within timeout
[-]E1016 12:47:50.826796 775752 hyperactor/src/channel/net.rs:872] error_msg:session unix:@cXwFuzc6LZR0EaOTBVt2PSVx.14241017779262408940: failed to deliver message within timeout
[0] [0] INFO 10-16 12:47:51 [monitor.py:34] torch.compile takes 4.28 s in total
[0] [0] INFO 10-16 12:47:53 [gpu_worker.py:255] Available KV cache memory: 76.00 GiB
[0] [0] INFO 10-16 12:47:53 [kv_cache_utils.py:833] GPU KV cache size: 711,520 tokens
[0] [0] INFO 10-16 12:47:53 [kv_cache_utils.py:837] Maximum concurrency for 40,960 tokens per request: 17.37x
Capturing CUDA graph shapes: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 67/67 [00:01<00:00, 33.62it/s]
[0] [0] INFO 10-16 12:47:56 [gpu_model_runner.py:2485] Graph capturing finished in 2 secs, took 0.60 GiB
All services initialized successfully!
[-]E1016 12:47:57.552378 775752 hyperactor/src/channel/net.rs:872] error_msg:session unix:@Z9sV1AJ0HJs8FWClUBuaww2e.2645662419857517690: failed to deliver message within timeout
[-]E1016 12:48:28.750134 775752 hyperactor/src/channel/net.rs:885] error_msg:session unix:@Y8fq0TvZm97486NFz2xbpJab.7448927969152336640: failed to receive ack within timeout 30 secs; link is currently broken
[-]E1016 12:48:28.750514 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.649452225+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[119644[...436 chars] CRC:1e5c6f6d 5b 5b 31 31 39 36 34 34 [...428 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...418 chars] CRC:15df9d6b 5b 30 2c 30 2c 30 2c 30 [...410 bytes]","is_illegal":false,"parts":[]}},"typehash":7252717004981455529}},"dest_port":{"actor_name":"agent","port":9548687619301914726},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750589 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.701559725+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[104803[...125 chars] CRC:424d5e55 5b 5b 31 30 34 38 30 33 [...117 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...286 chars] CRC:365a9783 5b 30 2c 30 2c 30 2c 30 [...278 bytes]","is_illegal":false,"parts":[]}},"typehash":1330557499926649425}},"dest_port":{"actor_name":"agent","port":10639140449372694939},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750643 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent,hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.701597422+00:00, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[119644[...436 chars] CRC:97c45964 5b 5b 31 31 39 36 34 34 [...428 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...389 chars] CRC:3b1000b4 5b 30 2c 30 2c 30 2c 30 [...381 bytes]","is_illegal":false,"parts":[]}},"typehash":7252717004981455529}},"dest_port":{"actor_name":"agent","port":9548687619301914726},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750695 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.754350855+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=log_forwarder_1rLHDBEhwHnQ, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Suffixed":"[\"log_fo[...38 chars] CRC:22a39fcb 5b 22 6c 6f 67 5f 66 6f [...30 bytes]"}},"data":{"bindings":[],"message":{"encoded":{"Multipart":{"body":"[1,0,0,0[...11 chars] CRC:5597ea9f 5b 31 2c 30 2c 30 2c 30 [...3 bytes]","is_illegal":false,"parts":[]}},"typehash":12098283485503622895}},"dest_port":{"actor_name":"log_forw[...26 chars] CRC:ef862055 6c 6f 67 5f 66 6f 72 77 [...18 bytes]","port":18039862042805291298},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750748 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.754369373+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=logger_1aZQw2Sm9N34, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Suffixed":"[\"logger[...30 chars] CRC:7be1ce35 5b 22 6c 6f 67 67 65 72 [...22 bytes]"}},"data":{"bindings":[],"message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...12 chars] CRC:b564c5bf 5b 30 2c 30 2c 30 2c 30 [...4 bytes]","is_illegal":false,"parts":[]}},"typehash":10911722563623013595}},"dest_port":{"actor_name":"logger_1[...19 chars] CRC:6940c6b 6c 6f 67 67 65 72 5f 31 [...11 bytes]","port":10648924583654510167},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750804 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.756514783+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[104803[...125 chars] CRC:424d5e55 5b 5b 31 30 34 38 30 33 [...117 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...512 chars] CRC:eb4c594f 5b 30 2c 30 2c 30 2c 30 [...504 bytes]","is_illegal":false,"parts":[]}},"typehash":1330557499926649425}},"dest_port":{"actor_name":"agent","port":10639140449372694939},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750881 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.756561193+00:00,hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[119644[...436 chars] CRC:3914ab48 5b 5b 31 31 39 36 34 34 [...428 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...448 chars] CRC:944c8fe3 5b 30 2c 30 2c 30 2c 30 [...440 bytes]","is_illegal":false,"parts":[]}},"typehash":7252717004981455529}},"dest_port":{"actor_name":"agent","port":9548687619301914726},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750931 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][13147652568889606402<hyperactor_mesh::comm::multicast::CastMessage>], headers:hyperactor_mesh::actor_mesh::cast_actor_mesh_id=agent,hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.649415320+00:00, data:CastMessage{"dest":{"selection":"True","slice":{"offset":0,"sizes":[1],"strides":[1]}},"message":{"actor_mesh_id":{"V1":{"Reserved":"agent"}},"data":{"bindings":"[[104803[...125 chars] CRC:424d5e55 5b 5b 31 30 34 38 30 33 [...117 bytes]","message":{"encoded":{"Multipart":{"body":"[0,0,0,0[...614 chars] CRC:6119f83f 5b 30 2c 30 2c 30 2c 30 [...606 bytes]","is_illegal":false,"parts":[]}},"typehash":1330557499926649425}},"dest_port":{"actor_name":"agent","port":10639140449372694939},"sender":"[{\"Direc[...86 chars] CRC:c4ca528b 5b 7b 22 44 69 72 65 63 [...78 bytes]","shape":{"labels":"[\"procs\"[...9 chars] CRC:d1f1d7de 5b 22 70 72 6f 63 73 22 [...1 bytes]","slice":{"offset":0,"sizes":[1],"strides":[1]}}}}
[-]E1016 12:48:28.750994 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,agent[0][16476089041836231337<hyperactor_mesh::resource::GetRankStatus>], headers:hyperactor_mesh::comm::multicast::cast_point=procs=0/1,hyperactor_mesh::comm::multicast::cast_originating_sender=unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0],hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.536707210+00:00, data:GetRankStatus{"name":{"Suffixed":"[\"comm\",[...28 chars] CRC:69aeeab5 5b 22 63 6f 6d 6d 22 2c [...20 bytes]"},"reply":{"phantom":null,"port_id":"[[{\"Dire[...93 chars] CRC:8e6e607d 5b 5b 7b 22 44 69 72 65 [...85 bytes]","reducer_opts":null,"reducer_spec":{"builder_params":null,"typehash":4493460774099310317}}}
[-]E1016 12:48:28.751029 775752 hyperactor/src/mailbox.rs:782] message not delivered, , name:undelivered_message_abandoned, actor_name:client, actor_id:unix:@JNSI8BUWkXh9JHXoSCLF4Lh4,mesh_root_client_proc,client[0], dest:unix:@Y8fq0TvZm97486NFz2xbpJab,anon_0_1dGYT3NiRuxZ,comm_1jLMe6EJnpiN[0][9438540675267141827<hyperactor_mesh::comm::CommActorMode>], headers:hyperactor::mailbox::headers::send_timestamp=2025-10-16T19:47:58.648765882+00:00, data:CommActorMode{"Mesh":"[0,{\"0\":[...105 chars] CRC:e8e3dbf2 5b 30 2c 7b 22 30 22 3a [...97 bytes]"}
Traceback (most recent call last):
  File "/home/dxie/.conda/envs/forge/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dxie/.conda/envs/forge/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/dxie/forge/apps/grpo/main.py", line 510, in <module>
    _main()  # @parse grabs the cfg from CLI
  File "/home/dxie/forge/src/forge/cli/config.py", line 310, in wrapper
    sys.exit(recipe_main(conf))
  File "/home/dxie/forge/apps/grpo/main.py", line 508, in _main
    asyncio.run(main(cfg))
  File "/home/dxie/.conda/envs/forge/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/dxie/.conda/envs/forge/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/dxie/forge/apps/grpo/main.py", line 352, in main
    await ts.initialize(
  File "/home/dxie/.conda/envs/forge/lib/python3.10/site-packages/torchstore/api.py", line 80, in initialize
    await controller.init.call(
  File "/home/dxie/.conda/envs/forge/lib/python3.10/site-packages/monarch/_src/actor/endpoint.py", line 132, in call
    extent = self._send(args, kwargs, port=p)
  File "/home/dxie/.conda/envs/forge/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 372, in _send
    objects, buffer = flatten((args, kwargs), _is_ref_or_mailbox)
  File "/home/dxie/.conda/envs/forge/lib/python3.10/site-packages/monarch/_src/actor/pickle.py", line 86, in flatten
    pickler.dump(obj)
  File "/home/dxie/.conda/envs/forge/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1303, in dump
    return super().dump(obj)
RuntimeError: error spawning actor mesh: statuses: Timeout(30.000849209s)=0..1

Versions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions