-
Notifications
You must be signed in to change notification settings - Fork 483
Description
Summary
mlx_lm share (#871) is an incredibly useful tool — thank you @angeloskath for creating it. It's saved us enormous amounts of time distributing models across our 5-node Mac Studio cluster. The broadcast capability over RDMA at 5-6 GB/s opens up use cases (dynamic model loading, expert parallelism, hot-swapping) that weren't viable before.
However, we initially missed that it existed because there's no documentation beyond the --help output. We spent time building custom RDMA transfer scripts before discovering mlx_lm share does the same thing 40% faster with built-in broadcast support. Others may be in the same situation.
Below is a proposed guide based on our experience running mlx_lm share extensively on a 5-node M3 Ultra Thunderbolt 5 cluster. Happy to submit this as a PR to mlx_lm/SHARE.md (alongside the existing LORA.md, SERVER.md, etc.) if that's useful, or it can serve as a reference for however you'd like to document it.
Proposed Documentation
Quick Start
# Broadcast a local model to all nodes
mlx_lm share --path /models/Llama-405B-4bit --hostfile cluster.json
# Download from HuggingFace and distribute in one step
mlx_lm share --model mlx-community/Llama-3.1-8B-Instruct-4bit --hostfile cluster.jsonCritical: Do Not Use mlx.launch
mlx_lm share has its own built-in launcher. Wrapping it in mlx.launch causes hangs or silent failures:
# CORRECT:
mlx_lm share --path ./my-model --hostfile hosts.json
# WRONG — will hang:
mlx.launch --hostfile hosts.json -- python -m mlx_lm.share --path ./my-modelThis was our most time-consuming mistake to diagnose. A note in the --help output or README would save others the same debugging.
Setup: RDMA Mesh (once per boot)
RDMA device names are ephemeral — they change after every reboot. The mesh must be reconfigured:
mlx.distributed_config \
--hostfile ethernet-hosts.json \
--backend jaccl \
--over thunderbolt \
--auto-setup \
--output-hostfile cluster.jsonThen distribute the hostfile to all nodes (each process reads it locally):
for host in node1 node2 node3 node4 node5; do
scp cluster.json ${host}:cluster.json
doneVerify RDMA health:
for host in node1 node2 node3 node4 node5; do
count=$(ssh $host "ibv_devinfo 2>/dev/null | grep -c PORT_ACTIVE")
echo "$host: $count active RDMA ports"
doneHow It Works
mlx_lm share uses all_sum to broadcast: rank 0 contributes the file data, other ranks contribute zeros, and the result is the data on every rank. async_eval pipelines RDMA transfer with disk I/O. Because it's a collective, adding receivers doesn't slow the sender — a 5-node broadcast runs at the same speed as a 2-node copy.
Performance (measured on M3 Ultra, TB5 full mesh, JACCL)
2-node transfer:
| Model Size | Time | Throughput |
|---|---|---|
| 9.5 GB | ~2s | 5.4 GB/s |
| 17 GB | ~3s | 5.5 GB/s |
| 213 GB | 37s | 6.1 GB/s |
5-node broadcast (1 sender → 4 receivers simultaneously):
| Model Size | Time | Throughput/link |
|---|---|---|
| 17 GB | ~4s | 5.3 GB/s |
| 213 GB | 51s | 5.3 GB/s |
For comparison, rsync over the same Thunderbolt link achieves 300-500 MB/s — mlx_lm share is 10-20x faster.
Use Cases Unlocked by 5+ GB/s Transfer
At these speeds (comparable to local NVMe reads), several workflows become practical:
- Dynamic model loading — load a 17GB model onto a node on-demand in 3 seconds, enabling request-based scheduling
- Expert Parallelism for MoE — distribute different expert subsets to different nodes at runtime without pre-staging
- Quantization pipelines — quantize on one node, broadcast the result immediately
- Checkpoint distribution — broadcast fine-tuning checkpoints for parallel evaluation
- Hot-swap in production — transfer the new model at 5+ GB/s while the old one keeps serving
Hostfile Format
{
"backend": "jaccl",
"envs": [],
"hosts": [
{"ssh": "node1", "ips": ["10.0.0.1", "10.0.0.1"], "rdma": [null, "rdma_en5", "rdma_en3"]},
{"ssh": "node2", "ips": ["10.0.0.2", "10.0.0.2"], "rdma": ["rdma_en2", null, "rdma_en4"]},
{"ssh": "node3", "ips": ["10.0.0.3", "10.0.0.3"], "rdma": ["rdma_en3", "rdma_en5", null]}
]
}hosts[0]is rank 0 (sender), all others receiverdma[i]= local RDMA device to reach ranki;nullfor self- Device names are ephemeral — regenerate with
distributed_configafter any reboot
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
RTR errno 60 (ETIMEDOUT) |
Mesh not configured or PD exhaustion | Re-run distributed_config --auto-setup. If that also fails, reboot the nodes. |
| Hangs silently | Called via mlx.launch |
Call mlx_lm share directly |
Couldn't allocate protection domain |
PD exhaustion (see below) | Reboot the affected nodes |
OSError: Directory not empty |
Destination exists | Delete first, or ignore (transfer succeeded) |
RTR errno 22 (EINVAL) |
Thunderbolt Bridge active | Remove TB interfaces from bridge0 |
| No progress bar over SSH | tqdm buffered | Use python -u -m mlx_lm.share |
Protection Domain (PD) Exhaustion
Each JACCL session allocates kernel RDMA resources (Protection Domains) that aren't fully released on teardown. After many init/teardown cycles in one boot session, the kernel refuses new allocations.
Symptoms: Couldn't allocate protection domain, or RTR errno 60 on previously working operations.
Recovery: Reboot the affected nodes. There is no other way to reclaim PDs. After reboot, re-run distributed_config and redistribute the hostfile.
Prevention: Prefer one N-node broadcast (1 init/teardown cycle) over N-1 sequential 2-node transfers. For heavy usage, schedule periodic reboots.
Happy to submit any of this as a PR if helpful. Thanks again for building mlx_lm share — it's a game-changer for multi-node Apple Silicon clusters.