Guide: RDMA file transfer over Thunderbolt 5 with JACCL (3.5+ GB/s)

## RDMA File Transfer over Thunderbolt 5 with JACCL

We built a high-speed file transfer tool using MLX's distributed primitives over RDMA/Thunderbolt 5, achieving **3.5–3.8 GB/s** sustained throughput — roughly 23× faster than rsync over the same link. Sharing the working solution and the critical workarounds we discovered, since there's no existing documentation for this use case.

Tested on Mac Studio M3 Ultra (macOS 26.3.1) with MLX 0.30+ and JACCL backend.

### The Problem

JACCL provides RDMA over Thunderbolt 5, but `mx.distributed.send`/`recv` is designed for tensor operations during training and inference, not file I/O. Using it naively for file transfer hits several issues that crash or hang the process. After extensive debugging, we found three critical workarounds and one macOS-level fix that are required.

### Results

```
[RDMA-DIR][rank=0] DONE: 16 files, 49.63 GB in 13.0s (3.80 GB/s)
```

| Method | Speed | 235 GB Transfer |
|--------|-------|-----------------|
| RDMA (this approach) | 3.5–3.8 GB/s | ~67 seconds |
| rsync over 10GbE | ~100–200 MB/s | ~30–40 minutes |
| rsync over TB5 IP | ~300–500 MB/s | ~10–15 minutes |

We distributed 705 GB of quantized LLM weights across 3 nodes in ~3.5 minutes using sequential RDMA transfers between directly-connected pairs.

---

## Critical Workaround 1: Use `stream=mx.cpu` for All Send/Recv

**Symptom:** Receiver crashes with `[METAL] Command buffer execution failed: Caused GPU Timeout Error`. Sender reports high throughput but receiver writes zero files.

**Root cause:** By default, `mx.distributed.send`/`recv` are scheduled on the GPU stream. Metal has an unconfigurable command buffer timeout (~5 seconds). RDMA recv operations that take longer than this trigger the timeout and kill the process. The sender doesn't crash because send operations complete quickly once data is in the RDMA buffer, but the receiver must wait for data to arrive.

**Fix:** Pass `stream=mx.cpu` on every `send()` and `recv()` call:

```python
s = mx.cpu

# Sender
mx.eval(mx.distributed.send(data, peer, group=group, stream=s))

# Receiver
result = mx.distributed.recv(shape, dtype, peer, group=group, stream=s)
mx.eval(result)
```

This bypasses Metal entirely. The RDMA transfer still happens at full hardware speed — the CPU stream just means MLX doesn't create a Metal command buffer for the operation.

**What doesn't work:**
- Smaller chunks (16 MB, 4 MB) — timeout still triggers
- `mx.eval()` after every recv — can deadlock with the sender
- `MLX_METAL_FAST_SYNCH=0` — no effect
- Converting to numpy immediately — crash happens before conversion

Related: #3142

## Critical Workaround 2: Init Barrier After `distributed.init()`

**Symptom:** First `send`/`recv` after init returns garbage data or wrong shapes. Subsequent operations may hang or produce corrupted results.

**Root cause:** JACCL needs all ranks to synchronize after initialization before point-to-point operations work reliably. Without a barrier, one rank may start sending before the other has finished setting up its RDMA resources.

**Fix:** Run an `all_sum` barrier immediately after init, before any send/recv:

```python
world = mx.distributed.init()

# REQUIRED: init barrier. Must use mx.ones(10), NOT mx.ones(1).
barrier = mx.distributed.all_sum(mx.ones(10), group=world, stream=mx.cpu)
mx.eval(barrier)
```

**Important:** Use `mx.ones(10)`, not `mx.ones(1)`. The single-element variant doesn't reliably trigger the full synchronization path in JACCL.

Related: #3149

## Critical Workaround 3: Single JACCL Session (PD Exhaustion)

**Symptom:** After ~60 successful file transfers, JACCL fails with `Couldn't allocate protection domain`.

**Root cause:** Each `mx.distributed.init()` / teardown cycle allocates and (incompletely) releases RDMA protection domains. The kernel has a hard limit. An approach that launches a separate JACCL session per file hits this limit quickly.

**Fix:** Transfer all files in a single JACCL session. One init, all transfers, one teardown.

**Recovery:** If you hit PD exhaustion, the only fix is to **reboot the affected node**. There is no way to release protection domains without a reboot.

## macOS Thunderbolt Bridge (Required Fix)

macOS creates a "Thunderbolt Bridge" network service that absorbs all TB ports into a single `bridge0` interface. This completely breaks RDMA — JACCL cannot address individual Thunderbolt ports through a bridge.

**Symptom:** Every RDMA device combination fails with `RTR errno 22 (EINVAL)` during queue pair setup.

**Critical:** macOS re-enables Thunderbolt Bridge on **every reboot and OS update**. You must run the fix after every boot.

```bash
#!/bin/bash
# Fix Thunderbolt Bridge for RDMA — create individual TB port services.
# Run as root: sudo bash fix-tb-bridge.sh
set -euo pipefail

PREFS="/Library/Preferences/SystemConfiguration/preferences.plist"

# 1. Destroy bridge0
if ifconfig bridge0 &>/dev/null; then
  ifconfig bridge0 | awk '/member/ {print $2}' | while read m; do
    ifconfig bridge0 deletem "$m" 2>/dev/null || true
  done
  ifconfig bridge0 destroy 2>/dev/null || true
fi

# 2. Remove bridge from system preferences
/usr/libexec/PlistBuddy \
    -c "Delete :VirtualNetworkInterfaces:Bridge:bridge0" "$PREFS" 2>/dev/null || true

# 3. Create individual network services per Thunderbolt port
networksetup -listallhardwareports \
    | awk -F': ' '/Hardware Port: Thunderbolt [0-9]/ {print $2}' \
    | while read port; do
  svc="RDMA $port"
  if ! networksetup -listallnetworkservices | grep -q "$svc"; then
    networksetup -createnetworkservice "$svc" "$port" 2>/dev/null || true
  fi
  networksetup -setv6automatic "$svc" 2>/dev/null || true
done

# 4. Disable the bridge service
networksetup -setnetworkserviceenabled "Thunderbolt Bridge" off 2>/dev/null || true

# 5. Verify
ibv_devinfo 2>/dev/null | grep -E 'hca_id|state'
```

After running this, `ibv_devinfo` should show individual RDMA devices with `PORT_ACTIVE` state for each connected Thunderbolt cable.

---

## Complete Transfer Script

A working script that transfers an entire directory in a single JACCL session with all workarounds applied:

```python
"""
Single-session RDMA directory transfer via JACCL.

Uses CPU stream for all send/recv to avoid Metal GPU timeouts (#3142).
Uses all_sum barrier after init to avoid shape corruption (#3149).

Usage:
  mlx.launch --hostfile hostfile.json --backend jaccl -- \
    rdma_dir_transfer.py /source/dir /dest/dir
"""

import gc
import os
import pathlib
import struct
import sys
import time

import mlx.core as mx

CHUNK_SIZE = 64 * 1024 * 1024  # 64 MB


def log(msg):
    r = mx.distributed.init().rank()
    print(f"[RDMA-DIR][rank={r}] {msg}", flush=True)


def send_directory(src_dir, peer, group):
    files = sorted(f for f in src_dir.iterdir() if f.is_file())

    s = mx.cpu
    cnt = mx.distributed.send(mx.array([len(files)], dtype=mx.int64), peer, group=group, stream=s)
    mx.eval(cnt)

    total = 0
    t0 = time.time()

    for i, fp in enumerate(files):
        sz = fp.stat().st_size
        nm = fp.name.encode()

        hdr = mx.array(list(struct.pack("<QQ", len(nm), sz)), dtype=mx.uint8)
        mx.eval(mx.distributed.send(hdr, peer, group=group, stream=s))
        mx.eval(mx.distributed.send(mx.array(list(nm), dtype=mx.uint8), peer, group=group, stream=s))

        if sz > 0:
            with fp.open("rb") as f:
                sent = 0
                while sent < sz:
                    chunk = f.read(CHUNK_SIZE)
                    if not chunk:
                        break
                    arr = mx.array(memoryview(chunk), dtype=mx.uint8)
                    mx.eval(mx.distributed.send(arr, peer, group=group, stream=s))
                    sent += len(chunk)

        total += sz
        el = time.time() - t0
        spd = total / el / 1e9 if el > 0 else 0
        log(f"  [{i+1}/{len(files)}] {fp.name} ({sz/1e6:.1f} MB) {spd:.2f} GB/s")

    el = time.time() - t0
    spd = total / el / 1e9 if el > 0 else 0
    log(f"DONE: {len(files)} files, {total/1e9:.2f} GB in {el:.1f}s ({spd:.2f} GB/s)")


def recv_directory(dst_dir, peer, group):
    dst_dir.mkdir(parents=True, exist_ok=True)
    s = mx.cpu

    cnt = mx.distributed.recv((1,), mx.int64, peer, group=group, stream=s)
    mx.eval(cnt)
    file_count = int(cnt[0].item())
    log(f"Expecting {file_count} files")

    total = 0
    t0 = time.time()

    for i in range(file_count):
        hdr = mx.distributed.recv((16,), mx.uint8, peer, group=group, stream=s)
        mx.eval(hdr)
        nlen, sz = struct.unpack("<QQ", bytes(memoryview(hdr)))

        nm = mx.distributed.recv((nlen,), mx.uint8, peer, group=group, stream=s)
        mx.eval(nm)
        name = bytes(memoryview(nm)).decode()

        path = dst_dir / name

        if sz > 0:
            fd = os.open(str(path), os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o644)
            try:
                received = 0
                while received < sz:
                    csz = min(CHUNK_SIZE, sz - received)
                    arr = mx.distributed.recv((csz,), mx.uint8, peer, group=group, stream=s)
                    mx.eval(arr)
                    os.write(fd, memoryview(arr))
                    received += csz
                    del arr
                os.fsync(fd)
            finally:
                os.close(fd)
        else:
            path.touch()

        total += sz
        el = time.time() - t0
        spd = total / el / 1e9 if el > 0 else 0
        log(f"  [{i+1}/{file_count}] {name} ({sz/1e6:.1f} MB) {spd:.2f} GB/s")

    el = time.time() - t0
    spd = total / el / 1e9 if el > 0 else 0
    log(f"DONE: {file_count} files, {total/1e9:.2f} GB in {el:.1f}s ({spd:.2f} GB/s)")


def main():
    if len(sys.argv) != 3:
        print(f"Usage: {sys.argv[0]} SRC_DIR DST_DIR", file=sys.stderr)
        sys.exit(1)

    src = pathlib.Path(sys.argv[1])
    dst = pathlib.Path(sys.argv[2])

    world = mx.distributed.init()
    if world.size() < 2:
        sys.exit(1)

    # Init barrier — required for JACCL (#3149). Must use mx.ones(10), not mx.ones(1).
    barrier = mx.distributed.all_sum(mx.ones(10), group=world, stream=mx.cpu)
    mx.eval(barrier)

    try:
        if world.rank() == 0:
            if not src.is_dir():
                print(f"ERROR: Source not found: {src}", file=sys.stderr)
                sys.exit(1)
            send_directory(src, 1, world)
        elif world.rank() == 1:
            recv_directory(dst, 0, world)
    except Exception as e:
        if "protection domain" in str(e).lower():
            log(f"PD EXHAUSTION: {e}")
            sys.exit(2)
        log(f"ERROR: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)

    gc.collect()


if __name__ == "__main__":
    main()
```

## Wire Protocol

```
Sender (rank 0)                    Receiver (rank 1)
─────────────────                  ──────────────────
send(file_count: int64)     →      recv(file_count)

For each file:
  send(header: 16 bytes)    →      recv(header)
    [name_len: uint64]                [parse name_len, file_size]
    [file_size: uint64]
  send(filename: uint8[])   →      recv(filename)
  For each 64MB chunk:
    send(chunk: uint8[])    →      recv(chunk) → write to disk
                                   fsync()
```

Header is packed as `struct.pack("<QQ", name_len, file_size)` — two little-endian 64-bit unsigned integers. The last chunk of each file may be smaller than CHUNK_SIZE.

## Troubleshooting

| Error | Cause | Fix |
|-------|-------|-----|
| `RTR errno 22 (EINVAL)` | Thunderbolt Bridge active | Run `fix-tb-bridge.sh` |
| `RTR errno 16 (EBUSY)` | Stale RDMA resources from crash | Reboot the node |
| `Caused GPU Timeout Error` | Missing `stream=mx.cpu` | Add CPU stream to all send/recv |
| `Couldn't allocate protection domain` | Too many init/teardown cycles | Reboot; use single-session |
| Shape corruption on first recv | Missing init barrier | Add `all_sum(mx.ones(10))` after init |
| `ibv_devinfo` shows PORT_DOWN | No cable / wrong port / bridge active | Check cables, run fix script |

## Hostfile Notes

RDMA device names (`rdma_en2`, `rdma_en5`, etc.) and TB5 interface IPs are **dynamically assigned** every time `mlx.distributed` configures the mesh. They change after every reboot or mesh reconfiguration. The only stable identifiers are the 10GbE IPs in the hostfile's `ips` field. The hostfile must be present on **all participating nodes** — not just the launcher.

## Limitations

- **Point-to-point only:** Rank 0 sends, rank 1 receives. For multi-node distribution, sequential transfers between directly-connected pairs.
- **Flat directories only:** Does not recurse into subdirectories.
- **No resume:** If interrupted, re-run the entire directory. Files are overwritten atomically (O_TRUNC).
- **Reboot required for PD exhaustion:** No programmatic recovery.

## Hardware

- 5× Mac Studio M3 Ultra (512 GB unified memory)
- Thunderbolt 5 full mesh (4 cables per node, direct links between all pairs)
- `ibv_devinfo` reports 8X width at 10.0 Gbps per lane = 80 Gbps per port
- Measured ~37% utilization of theoretical single-port bandwidth

Hope this helps others working with MLX distributed and Thunderbolt 5!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guide: RDMA file transfer over Thunderbolt 5 with JACCL (3.5+ GB/s) #3207

RDMA File Transfer over Thunderbolt 5 with JACCL

The Problem

Results

Critical Workaround 1: Use `stream=mx.cpu` for All Send/Recv

Critical Workaround 2: Init Barrier After `distributed.init()`

Critical Workaround 3: Single JACCL Session (PD Exhaustion)

macOS Thunderbolt Bridge (Required Fix)

Complete Transfer Script

Wire Protocol

Troubleshooting

Hostfile Notes

Limitations

Hardware

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	Speed	235 GB Transfer
RDMA (this approach)	3.5–3.8 GB/s	~67 seconds
rsync over 10GbE	~100–200 MB/s	~30–40 minutes
rsync over TB5 IP	~300–500 MB/s	~10–15 minutes

Error	Cause	Fix
`RTR errno 22 (EINVAL)`	Thunderbolt Bridge active	Run `fix-tb-bridge.sh`
`RTR errno 16 (EBUSY)`	Stale RDMA resources from crash	Reboot the node
`Caused GPU Timeout Error`	Missing `stream=mx.cpu`	Add CPU stream to all send/recv
`Couldn't allocate protection domain`	Too many init/teardown cycles	Reboot; use single-session
Shape corruption on first recv	Missing init barrier	Add `all_sum(mx.ones(10))` after init
`ibv_devinfo` shows PORT_DOWN	No cable / wrong port / bridge active	Check cables, run fix script

Guide: RDMA file transfer over Thunderbolt 5 with JACCL (3.5+ GB/s) #3207

Description

RDMA File Transfer over Thunderbolt 5 with JACCL

The Problem

Results

Critical Workaround 1: Use stream=mx.cpu for All Send/Recv

Critical Workaround 2: Init Barrier After distributed.init()

Critical Workaround 3: Single JACCL Session (PD Exhaustion)

macOS Thunderbolt Bridge (Required Fix)

Complete Transfer Script

Wire Protocol

Troubleshooting

Hostfile Notes

Limitations

Hardware

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Critical Workaround 1: Use `stream=mx.cpu` for All Send/Recv

Critical Workaround 2: Init Barrier After `distributed.init()`