Skip to content

[BUG] All models get stuck on WARMING UP with pipeline/RDMA #1137

@AlexCheema

Description

@AlexCheema

Describe the bug

When launching an instance of any model with pipeline and RDMA, it gets stuck on WARMING UP.

To Reproduce

Steps to reproduce the behavior:

  1. Launch an instance of any model with pipeline and RDMA
  2. It will get stuck on WARMING UP

Expected behavior

Instance should pass warm up and reach READY state.

Actual behavior

Gets stuck in WARMING UP. Logs don't show any tokens generated at all, which indicates it's probably stuck in communicating. Perhaps an ordering issue.

Environment

  • macOS Version: 26.3
  • EXO Version: Latest main 007eb8002919182e3c2149c7a089ef8f44ffcab4
  • Hardware:
    • 2 x 512GB M3 Ultra
  • Interconnection:
    • TB5 + Ethernet switch (all-to-all)

Annoying thing about this is it gets stuck communicating, i.e. GPU shows 100% utilization, which then ends up with a GPU lock when you kill exo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions