Skip to content

Documentation for mlx_lm share — setup, performance, troubleshooting #955

@guruswami-ai

Description

@guruswami-ai

Summary

mlx_lm share (#871) is an incredibly useful tool — thank you @angeloskath for creating it. It's saved us enormous amounts of time distributing models across our 5-node Mac Studio cluster. The broadcast capability over RDMA at 5-6 GB/s opens up use cases (dynamic model loading, expert parallelism, hot-swapping) that weren't viable before.

However, we initially missed that it existed because there's no documentation beyond the --help output. We spent time building custom RDMA transfer scripts before discovering mlx_lm share does the same thing 40% faster with built-in broadcast support. Others may be in the same situation.

Below is a proposed guide based on our experience running mlx_lm share extensively on a 5-node M3 Ultra Thunderbolt 5 cluster. Happy to submit this as a PR to mlx_lm/SHARE.md (alongside the existing LORA.md, SERVER.md, etc.) if that's useful, or it can serve as a reference for however you'd like to document it.


Proposed Documentation

Quick Start

# Broadcast a local model to all nodes
mlx_lm share --path /models/Llama-405B-4bit --hostfile cluster.json

# Download from HuggingFace and distribute in one step
mlx_lm share --model mlx-community/Llama-3.1-8B-Instruct-4bit --hostfile cluster.json

Critical: Do Not Use mlx.launch

mlx_lm share has its own built-in launcher. Wrapping it in mlx.launch causes hangs or silent failures:

# CORRECT:
mlx_lm share --path ./my-model --hostfile hosts.json

# WRONG — will hang:
mlx.launch --hostfile hosts.json -- python -m mlx_lm.share --path ./my-model

This was our most time-consuming mistake to diagnose. A note in the --help output or README would save others the same debugging.

Setup: RDMA Mesh (once per boot)

RDMA device names are ephemeral — they change after every reboot. The mesh must be reconfigured:

mlx.distributed_config \
  --hostfile ethernet-hosts.json \
  --backend jaccl \
  --over thunderbolt \
  --auto-setup \
  --output-hostfile cluster.json

Then distribute the hostfile to all nodes (each process reads it locally):

for host in node1 node2 node3 node4 node5; do
  scp cluster.json ${host}:cluster.json
done

Verify RDMA health:

for host in node1 node2 node3 node4 node5; do
  count=$(ssh $host "ibv_devinfo 2>/dev/null | grep -c PORT_ACTIVE")
  echo "$host: $count active RDMA ports"
done

How It Works

mlx_lm share uses all_sum to broadcast: rank 0 contributes the file data, other ranks contribute zeros, and the result is the data on every rank. async_eval pipelines RDMA transfer with disk I/O. Because it's a collective, adding receivers doesn't slow the sender — a 5-node broadcast runs at the same speed as a 2-node copy.

Performance (measured on M3 Ultra, TB5 full mesh, JACCL)

2-node transfer:

Model Size Time Throughput
9.5 GB ~2s 5.4 GB/s
17 GB ~3s 5.5 GB/s
213 GB 37s 6.1 GB/s

5-node broadcast (1 sender → 4 receivers simultaneously):

Model Size Time Throughput/link
17 GB ~4s 5.3 GB/s
213 GB 51s 5.3 GB/s

For comparison, rsync over the same Thunderbolt link achieves 300-500 MB/s — mlx_lm share is 10-20x faster.

Use Cases Unlocked by 5+ GB/s Transfer

At these speeds (comparable to local NVMe reads), several workflows become practical:

  • Dynamic model loading — load a 17GB model onto a node on-demand in 3 seconds, enabling request-based scheduling
  • Expert Parallelism for MoE — distribute different expert subsets to different nodes at runtime without pre-staging
  • Quantization pipelines — quantize on one node, broadcast the result immediately
  • Checkpoint distribution — broadcast fine-tuning checkpoints for parallel evaluation
  • Hot-swap in production — transfer the new model at 5+ GB/s while the old one keeps serving

Hostfile Format

{
  "backend": "jaccl",
  "envs": [],
  "hosts": [
    {"ssh": "node1", "ips": ["10.0.0.1", "10.0.0.1"], "rdma": [null, "rdma_en5", "rdma_en3"]},
    {"ssh": "node2", "ips": ["10.0.0.2", "10.0.0.2"], "rdma": ["rdma_en2", null, "rdma_en4"]},
    {"ssh": "node3", "ips": ["10.0.0.3", "10.0.0.3"], "rdma": ["rdma_en3", "rdma_en5", null]}
  ]
}
  • hosts[0] is rank 0 (sender), all others receive
  • rdma[i] = local RDMA device to reach rank i; null for self
  • Device names are ephemeral — regenerate with distributed_config after any reboot

Troubleshooting

Symptom Cause Fix
RTR errno 60 (ETIMEDOUT) Mesh not configured or PD exhaustion Re-run distributed_config --auto-setup. If that also fails, reboot the nodes.
Hangs silently Called via mlx.launch Call mlx_lm share directly
Couldn't allocate protection domain PD exhaustion (see below) Reboot the affected nodes
OSError: Directory not empty Destination exists Delete first, or ignore (transfer succeeded)
RTR errno 22 (EINVAL) Thunderbolt Bridge active Remove TB interfaces from bridge0
No progress bar over SSH tqdm buffered Use python -u -m mlx_lm.share

Protection Domain (PD) Exhaustion

Each JACCL session allocates kernel RDMA resources (Protection Domains) that aren't fully released on teardown. After many init/teardown cycles in one boot session, the kernel refuses new allocations.

Symptoms: Couldn't allocate protection domain, or RTR errno 60 on previously working operations.

Recovery: Reboot the affected nodes. There is no other way to reclaim PDs. After reboot, re-run distributed_config and redistribute the hostfile.

Prevention: Prefer one N-node broadcast (1 init/teardown cycle) over N-1 sequential 2-node transfers. For heavy usage, schedule periodic reboots.


Happy to submit any of this as a PR if helpful. Thanks again for building mlx_lm share — it's a game-changer for multi-node Apple Silicon clusters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions