-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
RDMA File Transfer over Thunderbolt 5 with JACCL
We built a high-speed file transfer tool using MLX's distributed primitives over RDMA/Thunderbolt 5, achieving 3.5–3.8 GB/s sustained throughput — roughly 23× faster than rsync over the same link. Sharing the working solution and the critical workarounds we discovered, since there's no existing documentation for this use case.
Tested on Mac Studio M3 Ultra (macOS 26.3.1) with MLX 0.30+ and JACCL backend.
The Problem
JACCL provides RDMA over Thunderbolt 5, but mx.distributed.send/recv is designed for tensor operations during training and inference, not file I/O. Using it naively for file transfer hits several issues that crash or hang the process. After extensive debugging, we found three critical workarounds and one macOS-level fix that are required.
Results
[RDMA-DIR][rank=0] DONE: 16 files, 49.63 GB in 13.0s (3.80 GB/s)
| Method | Speed | 235 GB Transfer |
|---|---|---|
| RDMA (this approach) | 3.5–3.8 GB/s | ~67 seconds |
| rsync over 10GbE | ~100–200 MB/s | ~30–40 minutes |
| rsync over TB5 IP | ~300–500 MB/s | ~10–15 minutes |
We distributed 705 GB of quantized LLM weights across 3 nodes in ~3.5 minutes using sequential RDMA transfers between directly-connected pairs.
Critical Workaround 1: Use stream=mx.cpu for All Send/Recv
Symptom: Receiver crashes with [METAL] Command buffer execution failed: Caused GPU Timeout Error. Sender reports high throughput but receiver writes zero files.
Root cause: By default, mx.distributed.send/recv are scheduled on the GPU stream. Metal has an unconfigurable command buffer timeout (~5 seconds). RDMA recv operations that take longer than this trigger the timeout and kill the process. The sender doesn't crash because send operations complete quickly once data is in the RDMA buffer, but the receiver must wait for data to arrive.
Fix: Pass stream=mx.cpu on every send() and recv() call:
s = mx.cpu
# Sender
mx.eval(mx.distributed.send(data, peer, group=group, stream=s))
# Receiver
result = mx.distributed.recv(shape, dtype, peer, group=group, stream=s)
mx.eval(result)This bypasses Metal entirely. The RDMA transfer still happens at full hardware speed — the CPU stream just means MLX doesn't create a Metal command buffer for the operation.
What doesn't work:
- Smaller chunks (16 MB, 4 MB) — timeout still triggers
mx.eval()after every recv — can deadlock with the senderMLX_METAL_FAST_SYNCH=0— no effect- Converting to numpy immediately — crash happens before conversion
Related: #3142
Critical Workaround 2: Init Barrier After distributed.init()
Symptom: First send/recv after init returns garbage data or wrong shapes. Subsequent operations may hang or produce corrupted results.
Root cause: JACCL needs all ranks to synchronize after initialization before point-to-point operations work reliably. Without a barrier, one rank may start sending before the other has finished setting up its RDMA resources.
Fix: Run an all_sum barrier immediately after init, before any send/recv:
world = mx.distributed.init()
# REQUIRED: init barrier. Must use mx.ones(10), NOT mx.ones(1).
barrier = mx.distributed.all_sum(mx.ones(10), group=world, stream=mx.cpu)
mx.eval(barrier)Important: Use mx.ones(10), not mx.ones(1). The single-element variant doesn't reliably trigger the full synchronization path in JACCL.
Related: #3149
Critical Workaround 3: Single JACCL Session (PD Exhaustion)
Symptom: After ~60 successful file transfers, JACCL fails with Couldn't allocate protection domain.
Root cause: Each mx.distributed.init() / teardown cycle allocates and (incompletely) releases RDMA protection domains. The kernel has a hard limit. An approach that launches a separate JACCL session per file hits this limit quickly.
Fix: Transfer all files in a single JACCL session. One init, all transfers, one teardown.
Recovery: If you hit PD exhaustion, the only fix is to reboot the affected node. There is no way to release protection domains without a reboot.
macOS Thunderbolt Bridge (Required Fix)
macOS creates a "Thunderbolt Bridge" network service that absorbs all TB ports into a single bridge0 interface. This completely breaks RDMA — JACCL cannot address individual Thunderbolt ports through a bridge.
Symptom: Every RDMA device combination fails with RTR errno 22 (EINVAL) during queue pair setup.
Critical: macOS re-enables Thunderbolt Bridge on every reboot and OS update. You must run the fix after every boot.
#!/bin/bash
# Fix Thunderbolt Bridge for RDMA — create individual TB port services.
# Run as root: sudo bash fix-tb-bridge.sh
set -euo pipefail
PREFS="/Library/Preferences/SystemConfiguration/preferences.plist"
# 1. Destroy bridge0
if ifconfig bridge0 &>/dev/null; then
ifconfig bridge0 | awk '/member/ {print $2}' | while read m; do
ifconfig bridge0 deletem "$m" 2>/dev/null || true
done
ifconfig bridge0 destroy 2>/dev/null || true
fi
# 2. Remove bridge from system preferences
/usr/libexec/PlistBuddy \
-c "Delete :VirtualNetworkInterfaces:Bridge:bridge0" "$PREFS" 2>/dev/null || true
# 3. Create individual network services per Thunderbolt port
networksetup -listallhardwareports \
| awk -F': ' '/Hardware Port: Thunderbolt [0-9]/ {print $2}' \
| while read port; do
svc="RDMA $port"
if ! networksetup -listallnetworkservices | grep -q "$svc"; then
networksetup -createnetworkservice "$svc" "$port" 2>/dev/null || true
fi
networksetup -setv6automatic "$svc" 2>/dev/null || true
done
# 4. Disable the bridge service
networksetup -setnetworkserviceenabled "Thunderbolt Bridge" off 2>/dev/null || true
# 5. Verify
ibv_devinfo 2>/dev/null | grep -E 'hca_id|state'After running this, ibv_devinfo should show individual RDMA devices with PORT_ACTIVE state for each connected Thunderbolt cable.
Complete Transfer Script
A working script that transfers an entire directory in a single JACCL session with all workarounds applied:
"""
Single-session RDMA directory transfer via JACCL.
Uses CPU stream for all send/recv to avoid Metal GPU timeouts (#3142).
Uses all_sum barrier after init to avoid shape corruption (#3149).
Usage:
mlx.launch --hostfile hostfile.json --backend jaccl -- \
rdma_dir_transfer.py /source/dir /dest/dir
"""
import gc
import os
import pathlib
import struct
import sys
import time
import mlx.core as mx
CHUNK_SIZE = 64 * 1024 * 1024 # 64 MB
def log(msg):
r = mx.distributed.init().rank()
print(f"[RDMA-DIR][rank={r}] {msg}", flush=True)
def send_directory(src_dir, peer, group):
files = sorted(f for f in src_dir.iterdir() if f.is_file())
s = mx.cpu
cnt = mx.distributed.send(mx.array([len(files)], dtype=mx.int64), peer, group=group, stream=s)
mx.eval(cnt)
total = 0
t0 = time.time()
for i, fp in enumerate(files):
sz = fp.stat().st_size
nm = fp.name.encode()
hdr = mx.array(list(struct.pack("<QQ", len(nm), sz)), dtype=mx.uint8)
mx.eval(mx.distributed.send(hdr, peer, group=group, stream=s))
mx.eval(mx.distributed.send(mx.array(list(nm), dtype=mx.uint8), peer, group=group, stream=s))
if sz > 0:
with fp.open("rb") as f:
sent = 0
while sent < sz:
chunk = f.read(CHUNK_SIZE)
if not chunk:
break
arr = mx.array(memoryview(chunk), dtype=mx.uint8)
mx.eval(mx.distributed.send(arr, peer, group=group, stream=s))
sent += len(chunk)
total += sz
el = time.time() - t0
spd = total / el / 1e9 if el > 0 else 0
log(f" [{i+1}/{len(files)}] {fp.name} ({sz/1e6:.1f} MB) {spd:.2f} GB/s")
el = time.time() - t0
spd = total / el / 1e9 if el > 0 else 0
log(f"DONE: {len(files)} files, {total/1e9:.2f} GB in {el:.1f}s ({spd:.2f} GB/s)")
def recv_directory(dst_dir, peer, group):
dst_dir.mkdir(parents=True, exist_ok=True)
s = mx.cpu
cnt = mx.distributed.recv((1,), mx.int64, peer, group=group, stream=s)
mx.eval(cnt)
file_count = int(cnt[0].item())
log(f"Expecting {file_count} files")
total = 0
t0 = time.time()
for i in range(file_count):
hdr = mx.distributed.recv((16,), mx.uint8, peer, group=group, stream=s)
mx.eval(hdr)
nlen, sz = struct.unpack("<QQ", bytes(memoryview(hdr)))
nm = mx.distributed.recv((nlen,), mx.uint8, peer, group=group, stream=s)
mx.eval(nm)
name = bytes(memoryview(nm)).decode()
path = dst_dir / name
if sz > 0:
fd = os.open(str(path), os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o644)
try:
received = 0
while received < sz:
csz = min(CHUNK_SIZE, sz - received)
arr = mx.distributed.recv((csz,), mx.uint8, peer, group=group, stream=s)
mx.eval(arr)
os.write(fd, memoryview(arr))
received += csz
del arr
os.fsync(fd)
finally:
os.close(fd)
else:
path.touch()
total += sz
el = time.time() - t0
spd = total / el / 1e9 if el > 0 else 0
log(f" [{i+1}/{file_count}] {name} ({sz/1e6:.1f} MB) {spd:.2f} GB/s")
el = time.time() - t0
spd = total / el / 1e9 if el > 0 else 0
log(f"DONE: {file_count} files, {total/1e9:.2f} GB in {el:.1f}s ({spd:.2f} GB/s)")
def main():
if len(sys.argv) != 3:
print(f"Usage: {sys.argv[0]} SRC_DIR DST_DIR", file=sys.stderr)
sys.exit(1)
src = pathlib.Path(sys.argv[1])
dst = pathlib.Path(sys.argv[2])
world = mx.distributed.init()
if world.size() < 2:
sys.exit(1)
# Init barrier — required for JACCL (#3149). Must use mx.ones(10), not mx.ones(1).
barrier = mx.distributed.all_sum(mx.ones(10), group=world, stream=mx.cpu)
mx.eval(barrier)
try:
if world.rank() == 0:
if not src.is_dir():
print(f"ERROR: Source not found: {src}", file=sys.stderr)
sys.exit(1)
send_directory(src, 1, world)
elif world.rank() == 1:
recv_directory(dst, 0, world)
except Exception as e:
if "protection domain" in str(e).lower():
log(f"PD EXHAUSTION: {e}")
sys.exit(2)
log(f"ERROR: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
gc.collect()
if __name__ == "__main__":
main()Wire Protocol
Sender (rank 0) Receiver (rank 1)
───────────────── ──────────────────
send(file_count: int64) → recv(file_count)
For each file:
send(header: 16 bytes) → recv(header)
[name_len: uint64] [parse name_len, file_size]
[file_size: uint64]
send(filename: uint8[]) → recv(filename)
For each 64MB chunk:
send(chunk: uint8[]) → recv(chunk) → write to disk
fsync()
Header is packed as struct.pack("<QQ", name_len, file_size) — two little-endian 64-bit unsigned integers. The last chunk of each file may be smaller than CHUNK_SIZE.
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
RTR errno 22 (EINVAL) |
Thunderbolt Bridge active | Run fix-tb-bridge.sh |
RTR errno 16 (EBUSY) |
Stale RDMA resources from crash | Reboot the node |
Caused GPU Timeout Error |
Missing stream=mx.cpu |
Add CPU stream to all send/recv |
Couldn't allocate protection domain |
Too many init/teardown cycles | Reboot; use single-session |
| Shape corruption on first recv | Missing init barrier | Add all_sum(mx.ones(10)) after init |
ibv_devinfo shows PORT_DOWN |
No cable / wrong port / bridge active | Check cables, run fix script |
Hostfile Notes
RDMA device names (rdma_en2, rdma_en5, etc.) and TB5 interface IPs are dynamically assigned every time mlx.distributed configures the mesh. They change after every reboot or mesh reconfiguration. The only stable identifiers are the 10GbE IPs in the hostfile's ips field. The hostfile must be present on all participating nodes — not just the launcher.
Limitations
- Point-to-point only: Rank 0 sends, rank 1 receives. For multi-node distribution, sequential transfers between directly-connected pairs.
- Flat directories only: Does not recurse into subdirectories.
- No resume: If interrupted, re-run the entire directory. Files are overwritten atomically (O_TRUNC).
- Reboot required for PD exhaustion: No programmatic recovery.
Hardware
- 5× Mac Studio M3 Ultra (512 GB unified memory)
- Thunderbolt 5 full mesh (4 cables per node, direct links between all pairs)
ibv_devinforeports 8X width at 10.0 Gbps per lane = 80 Gbps per port- Measured ~37% utilization of theoretical single-port bandwidth
Hope this helps others working with MLX distributed and Thunderbolt 5!