Skip to content

[BUG] when I run the mix_chord.yaml, An Error appears "Trainer synchronizing weights failed.". It seems like there are problems in synchronizing. How to fix? #352

@chongzi1990

Description

@chongzi1990

Bug Description

Please provide a detailed description of the issue you encountered.

Environment Information

  • Operating System: [e.g., Ubuntu 20.04]
  • Python Version: [e.g., 3.10.12]
  • GPU: [e.g., NVIDIA A100-80G]
  • CUDA Version: [e.g., 12.4]
  • Installation Method: [e.g., pypi, source]
  • Trinity-RFT Version: [e.g., 0.2.1]
  • Other relevant dependencies or configurations you think might be helpful

Actual Behavior

(pid=101782) INFO 10-30 11:10:30 [importing.py:53] Triton module has been replaced with a placeholder. [repeated 31x across cluster]
(pid=101782) INFO 10-30 11:10:30 [init.py:239] Automatically detected platform cuda. [repeated 31x across cluster]
(Explorer pid=89687) INFO 10-30 11:10:38 [explorer.py:337] Log metrics of step 0
(Trainer pid=89688) INFO 10-30 11:10:38 [trainer.py:152] Trainer synchronizing weights at step 0 starting..
(WorkflowRunner pid=101780) Using blocking ray.get inside async actor. This blocks the event loop. Please use await on object ref with asyncio.gather if you want to yield execution to the event loop instead.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s] [repeated 3x across cluster]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.42it/s] [repeated 11x across cluster]
(vLLMRolloutModel pid=91482) 2025-10-30 11:09:56,729 INFO worker.py:1694 -- Connecting to existing Ray cluster at address: 10.166.70.58:6379... [repeated 3x across cluster]
(vLLMRolloutModel pid=91476) 2025-10-30 11:09:56,790 INFO worker.py:1879 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 [repeated 3x across cluster]
(Synchronizer pid=90977) ERROR 10-30 11:30:38 [synchronizer.py:302] Explorer is not ready for model weight sync.
(Trainer pid=89688) ERROR 10-30 11:30:38 [trainer.py:160] Trainer synchronizing weights failed.
(Trainer pid=89688) INFO 10-30 11:30:39 [trainer.py:170] Trainer synchronizing weights at step 0 end.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions