Skip to content

mDNS discovery: nodes visible but fail to establish connections #1305

@doogie-bigmack

Description

@doogie-bigmack

Description

Running a 4-node M3 Ultra Mac Studio cluster, nodes successfully discover each other via mDNS but fail to establish actual connections for distributed inference.

Environment

  • OS: macOS Sonoma 15.x
  • Hardware: 4x M3 Ultra Mac Studios (192GB RAM each)
  • Network: Same LAN, mDNS enabled
  • Node names: cluster-1, cluster-3, cluster-4, cluster-6

Observed Behavior

  1. Start EXO on all 4 nodes
  2. Nodes appear in each other's peer lists (mDNS discovery works)
  3. When initiating distributed inference, connections between nodes fail/timeout
  4. Work is not distributed - only local node processes the request

Expected Behavior

Once nodes discover each other, they should successfully establish connections and distribute inference workload.

Debug Observations

When investigating the discovery mechanism in exo/networking/discovery.rs, we noticed the TTL configuration:

Duration::from_secs(2_500)  // This equals ~41 minutes

We suspected this might be a typo and should be from_millis(2_500) (2.5 seconds), which is more typical for mDNS refresh intervals. After making this change locally, nodes were able to connect successfully.

Questions

  1. Is the 2500-second TTL intentional? Seems very long for dynamic peer discovery
  2. Could the long TTL cause stale peer information that breaks connection establishment?
  3. Are there recommended network/firewall settings we should verify?

Workaround

Building from source with modified TTL values appeared to resolve the issue, but want to understand if this is the actual root cause or if something else is happening.

Logs

Happy to provide debug logs if you can point me to how to enable verbose mDNS/networking logging.


Related to closed PR #1297 - opening as issue per maintainer request to investigate further.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions