-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Description
Running a 4-node M3 Ultra Mac Studio cluster, nodes successfully discover each other via mDNS but fail to establish actual connections for distributed inference.
Environment
- OS: macOS Sonoma 15.x
- Hardware: 4x M3 Ultra Mac Studios (192GB RAM each)
- Network: Same LAN, mDNS enabled
- Node names: cluster-1, cluster-3, cluster-4, cluster-6
Observed Behavior
- Start EXO on all 4 nodes
- Nodes appear in each other's peer lists (mDNS discovery works)
- When initiating distributed inference, connections between nodes fail/timeout
- Work is not distributed - only local node processes the request
Expected Behavior
Once nodes discover each other, they should successfully establish connections and distribute inference workload.
Debug Observations
When investigating the discovery mechanism in exo/networking/discovery.rs, we noticed the TTL configuration:
Duration::from_secs(2_500) // This equals ~41 minutesWe suspected this might be a typo and should be from_millis(2_500) (2.5 seconds), which is more typical for mDNS refresh intervals. After making this change locally, nodes were able to connect successfully.
Questions
- Is the 2500-second TTL intentional? Seems very long for dynamic peer discovery
- Could the long TTL cause stale peer information that breaks connection establishment?
- Are there recommended network/firewall settings we should verify?
Workaround
Building from source with modified TTL values appeared to resolve the issue, but want to understand if this is the actual root cause or if something else is happening.
Logs
Happy to provide debug logs if you can point me to how to enable verbose mDNS/networking logging.
Related to closed PR #1297 - opening as issue per maintainer request to investigate further.