-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When launching an instance of any model with pipeline and RDMA, it gets stuck on WARMING UP.
To Reproduce
Steps to reproduce the behavior:
- Launch an instance of any model with pipeline and RDMA
- It will get stuck on WARMING UP
Expected behavior
Instance should pass warm up and reach READY state.
Actual behavior
Gets stuck in WARMING UP. Logs don't show any tokens generated at all, which indicates it's probably stuck in communicating. Perhaps an ordering issue.
Environment
- macOS Version: 26.3
- EXO Version: Latest main
007eb8002919182e3c2149c7a089ef8f44ffcab4 - Hardware:
- 2 x 512GB M3 Ultra
- Interconnection:
- TB5 + Ethernet switch (all-to-all)
Annoying thing about this is it gets stuck communicating, i.e. GPU shows 100% utilization, which then ends up with a GPU lock when you kill exo.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working