Documentation for mlx_lm share — setup, performance, troubleshooting

## Summary

`mlx_lm share` ([#871](https://github.com/ml-explore/mlx-lm/pull/871)) is an incredibly useful tool — thank you @angeloskath for creating it. It's saved us enormous amounts of time distributing models across our 5-node Mac Studio cluster. The broadcast capability over RDMA at 5-6 GB/s opens up use cases (dynamic model loading, expert parallelism, hot-swapping) that weren't viable before.

However, we initially missed that it existed because there's no documentation beyond the `--help` output. We spent time building custom RDMA transfer scripts before discovering `mlx_lm share` does the same thing 40% faster with built-in broadcast support. Others may be in the same situation.

Below is a proposed guide based on our experience running `mlx_lm share` extensively on a 5-node M3 Ultra Thunderbolt 5 cluster. Happy to submit this as a PR to `mlx_lm/SHARE.md` (alongside the existing `LORA.md`, `SERVER.md`, etc.) if that's useful, or it can serve as a reference for however you'd like to document it.

---

## Proposed Documentation

### Quick Start

```bash
# Broadcast a local model to all nodes
mlx_lm share --path /models/Llama-405B-4bit --hostfile cluster.json

# Download from HuggingFace and distribute in one step
mlx_lm share --model mlx-community/Llama-3.1-8B-Instruct-4bit --hostfile cluster.json
```

### Critical: Do Not Use `mlx.launch`

`mlx_lm share` has its own built-in launcher. Wrapping it in `mlx.launch` causes hangs or silent failures:

```bash
# CORRECT:
mlx_lm share --path ./my-model --hostfile hosts.json

# WRONG — will hang:
mlx.launch --hostfile hosts.json -- python -m mlx_lm.share --path ./my-model
```

This was our most time-consuming mistake to diagnose. A note in the `--help` output or README would save others the same debugging.

### Setup: RDMA Mesh (once per boot)

RDMA device names are ephemeral — they change after every reboot. The mesh must be reconfigured:

```bash
mlx.distributed_config \
  --hostfile ethernet-hosts.json \
  --backend jaccl \
  --over thunderbolt \
  --auto-setup \
  --output-hostfile cluster.json
```

Then distribute the hostfile to all nodes (each process reads it locally):

```bash
for host in node1 node2 node3 node4 node5; do
  scp cluster.json ${host}:cluster.json
done
```

Verify RDMA health:

```bash
for host in node1 node2 node3 node4 node5; do
  count=$(ssh $host "ibv_devinfo 2>/dev/null | grep -c PORT_ACTIVE")
  echo "$host: $count active RDMA ports"
done
```

### How It Works

`mlx_lm share` uses `all_sum` to broadcast: rank 0 contributes the file data, other ranks contribute zeros, and the result is the data on every rank. `async_eval` pipelines RDMA transfer with disk I/O. Because it's a collective, adding receivers doesn't slow the sender — a 5-node broadcast runs at the same speed as a 2-node copy.

### Performance (measured on M3 Ultra, TB5 full mesh, JACCL)

**2-node transfer:**

| Model Size | Time | Throughput |
|-----------|------|-----------|
| 9.5 GB | ~2s | 5.4 GB/s |
| 17 GB | ~3s | 5.5 GB/s |
| 213 GB | 37s | 6.1 GB/s |

**5-node broadcast (1 sender → 4 receivers simultaneously):**

| Model Size | Time | Throughput/link |
|-----------|------|----------------|
| 17 GB | ~4s | 5.3 GB/s |
| 213 GB | 51s | 5.3 GB/s |

For comparison, rsync over the same Thunderbolt link achieves 300-500 MB/s — `mlx_lm share` is 10-20x faster.

### Use Cases Unlocked by 5+ GB/s Transfer

At these speeds (comparable to local NVMe reads), several workflows become practical:

- **Dynamic model loading** — load a 17GB model onto a node on-demand in 3 seconds, enabling request-based scheduling
- **Expert Parallelism for MoE** — distribute different expert subsets to different nodes at runtime without pre-staging
- **Quantization pipelines** — quantize on one node, broadcast the result immediately
- **Checkpoint distribution** — broadcast fine-tuning checkpoints for parallel evaluation
- **Hot-swap in production** — transfer the new model at 5+ GB/s while the old one keeps serving

### Hostfile Format

```json
{
  "backend": "jaccl",
  "envs": [],
  "hosts": [
    {"ssh": "node1", "ips": ["10.0.0.1", "10.0.0.1"], "rdma": [null, "rdma_en5", "rdma_en3"]},
    {"ssh": "node2", "ips": ["10.0.0.2", "10.0.0.2"], "rdma": ["rdma_en2", null, "rdma_en4"]},
    {"ssh": "node3", "ips": ["10.0.0.3", "10.0.0.3"], "rdma": ["rdma_en3", "rdma_en5", null]}
  ]
}
```

- `hosts[0]` is rank 0 (sender), all others receive
- `rdma[i]` = local RDMA device to reach rank `i`; `null` for self
- Device names are ephemeral — regenerate with `distributed_config` after any reboot

### Troubleshooting

| Symptom | Cause | Fix |
|---------|-------|-----|
| `RTR errno 60` (ETIMEDOUT) | Mesh not configured or PD exhaustion | Re-run `distributed_config --auto-setup`. If that also fails, reboot the nodes. |
| Hangs silently | Called via `mlx.launch` | Call `mlx_lm share` directly |
| `Couldn't allocate protection domain` | PD exhaustion (see below) | Reboot the affected nodes |
| `OSError: Directory not empty` | Destination exists | Delete first, or ignore (transfer succeeded) |
| `RTR errno 22` (EINVAL) | Thunderbolt Bridge active | Remove TB interfaces from bridge0 |
| No progress bar over SSH | tqdm buffered | Use `python -u -m mlx_lm.share` |

### Protection Domain (PD) Exhaustion

Each JACCL session allocates kernel RDMA resources (Protection Domains) that aren't fully released on teardown. After many init/teardown cycles in one boot session, the kernel refuses new allocations.

**Symptoms:** `Couldn't allocate protection domain`, or `RTR errno 60` on previously working operations.

**Recovery:** Reboot the affected nodes. There is no other way to reclaim PDs. After reboot, re-run `distributed_config` and redistribute the hostfile.

**Prevention:** Prefer one N-node broadcast (1 init/teardown cycle) over N-1 sequential 2-node transfers. For heavy usage, schedule periodic reboots.

---

Happy to submit any of this as a PR if helpful. Thanks again for building `mlx_lm share` — it's a game-changer for multi-node Apple Silicon clusters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation for mlx_lm share — setup, performance, troubleshooting #955

Summary

Proposed Documentation

Quick Start

Critical: Do Not Use `mlx.launch`

Setup: RDMA Mesh (once per boot)

How It Works

Performance (measured on M3 Ultra, TB5 full mesh, JACCL)

Use Cases Unlocked by 5+ GB/s Transfer

Hostfile Format

Troubleshooting

Protection Domain (PD) Exhaustion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Symptom	Cause	Fix
`RTR errno 60` (ETIMEDOUT)	Mesh not configured or PD exhaustion	Re-run `distributed_config --auto-setup`. If that also fails, reboot the nodes.
Hangs silently	Called via `mlx.launch`	Call `mlx_lm share` directly
`Couldn't allocate protection domain`	PD exhaustion (see below)	Reboot the affected nodes
`OSError: Directory not empty`	Destination exists	Delete first, or ignore (transfer succeeded)
`RTR errno 22` (EINVAL)	Thunderbolt Bridge active	Remove TB interfaces from bridge0
No progress bar over SSH	tqdm buffered	Use `python -u -m mlx_lm.share`

Documentation for mlx_lm share — setup, performance, troubleshooting #955

Description

Summary

Proposed Documentation

Quick Start

Critical: Do Not Use mlx.launch

Setup: RDMA Mesh (once per boot)

How It Works

Performance (measured on M3 Ultra, TB5 full mesh, JACCL)

Use Cases Unlocked by 5+ GB/s Transfer

Hostfile Format

Troubleshooting

Protection Domain (PD) Exhaustion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Critical: Do Not Use `mlx.launch`