Skip to content

Guide: RDMA file transfer over Thunderbolt 5 with JACCL (3.5+ GB/s) #3207

@guruswami-ai

Description

@guruswami-ai

RDMA File Transfer over Thunderbolt 5 with JACCL

We built a high-speed file transfer tool using MLX's distributed primitives over RDMA/Thunderbolt 5, achieving 3.5–3.8 GB/s sustained throughput — roughly 23× faster than rsync over the same link. Sharing the working solution and the critical workarounds we discovered, since there's no existing documentation for this use case.

Tested on Mac Studio M3 Ultra (macOS 26.3.1) with MLX 0.30+ and JACCL backend.

The Problem

JACCL provides RDMA over Thunderbolt 5, but mx.distributed.send/recv is designed for tensor operations during training and inference, not file I/O. Using it naively for file transfer hits several issues that crash or hang the process. After extensive debugging, we found three critical workarounds and one macOS-level fix that are required.

Results

[RDMA-DIR][rank=0] DONE: 16 files, 49.63 GB in 13.0s (3.80 GB/s)
Method Speed 235 GB Transfer
RDMA (this approach) 3.5–3.8 GB/s ~67 seconds
rsync over 10GbE ~100–200 MB/s ~30–40 minutes
rsync over TB5 IP ~300–500 MB/s ~10–15 minutes

We distributed 705 GB of quantized LLM weights across 3 nodes in ~3.5 minutes using sequential RDMA transfers between directly-connected pairs.


Critical Workaround 1: Use stream=mx.cpu for All Send/Recv

Symptom: Receiver crashes with [METAL] Command buffer execution failed: Caused GPU Timeout Error. Sender reports high throughput but receiver writes zero files.

Root cause: By default, mx.distributed.send/recv are scheduled on the GPU stream. Metal has an unconfigurable command buffer timeout (~5 seconds). RDMA recv operations that take longer than this trigger the timeout and kill the process. The sender doesn't crash because send operations complete quickly once data is in the RDMA buffer, but the receiver must wait for data to arrive.

Fix: Pass stream=mx.cpu on every send() and recv() call:

s = mx.cpu

# Sender
mx.eval(mx.distributed.send(data, peer, group=group, stream=s))

# Receiver
result = mx.distributed.recv(shape, dtype, peer, group=group, stream=s)
mx.eval(result)

This bypasses Metal entirely. The RDMA transfer still happens at full hardware speed — the CPU stream just means MLX doesn't create a Metal command buffer for the operation.

What doesn't work:

  • Smaller chunks (16 MB, 4 MB) — timeout still triggers
  • mx.eval() after every recv — can deadlock with the sender
  • MLX_METAL_FAST_SYNCH=0 — no effect
  • Converting to numpy immediately — crash happens before conversion

Related: #3142

Critical Workaround 2: Init Barrier After distributed.init()

Symptom: First send/recv after init returns garbage data or wrong shapes. Subsequent operations may hang or produce corrupted results.

Root cause: JACCL needs all ranks to synchronize after initialization before point-to-point operations work reliably. Without a barrier, one rank may start sending before the other has finished setting up its RDMA resources.

Fix: Run an all_sum barrier immediately after init, before any send/recv:

world = mx.distributed.init()

# REQUIRED: init barrier. Must use mx.ones(10), NOT mx.ones(1).
barrier = mx.distributed.all_sum(mx.ones(10), group=world, stream=mx.cpu)
mx.eval(barrier)

Important: Use mx.ones(10), not mx.ones(1). The single-element variant doesn't reliably trigger the full synchronization path in JACCL.

Related: #3149

Critical Workaround 3: Single JACCL Session (PD Exhaustion)

Symptom: After ~60 successful file transfers, JACCL fails with Couldn't allocate protection domain.

Root cause: Each mx.distributed.init() / teardown cycle allocates and (incompletely) releases RDMA protection domains. The kernel has a hard limit. An approach that launches a separate JACCL session per file hits this limit quickly.

Fix: Transfer all files in a single JACCL session. One init, all transfers, one teardown.

Recovery: If you hit PD exhaustion, the only fix is to reboot the affected node. There is no way to release protection domains without a reboot.

macOS Thunderbolt Bridge (Required Fix)

macOS creates a "Thunderbolt Bridge" network service that absorbs all TB ports into a single bridge0 interface. This completely breaks RDMA — JACCL cannot address individual Thunderbolt ports through a bridge.

Symptom: Every RDMA device combination fails with RTR errno 22 (EINVAL) during queue pair setup.

Critical: macOS re-enables Thunderbolt Bridge on every reboot and OS update. You must run the fix after every boot.

#!/bin/bash
# Fix Thunderbolt Bridge for RDMA — create individual TB port services.
# Run as root: sudo bash fix-tb-bridge.sh
set -euo pipefail

PREFS="/Library/Preferences/SystemConfiguration/preferences.plist"

# 1. Destroy bridge0
if ifconfig bridge0 &>/dev/null; then
  ifconfig bridge0 | awk '/member/ {print $2}' | while read m; do
    ifconfig bridge0 deletem "$m" 2>/dev/null || true
  done
  ifconfig bridge0 destroy 2>/dev/null || true
fi

# 2. Remove bridge from system preferences
/usr/libexec/PlistBuddy \
    -c "Delete :VirtualNetworkInterfaces:Bridge:bridge0" "$PREFS" 2>/dev/null || true

# 3. Create individual network services per Thunderbolt port
networksetup -listallhardwareports \
    | awk -F': ' '/Hardware Port: Thunderbolt [0-9]/ {print $2}' \
    | while read port; do
  svc="RDMA $port"
  if ! networksetup -listallnetworkservices | grep -q "$svc"; then
    networksetup -createnetworkservice "$svc" "$port" 2>/dev/null || true
  fi
  networksetup -setv6automatic "$svc" 2>/dev/null || true
done

# 4. Disable the bridge service
networksetup -setnetworkserviceenabled "Thunderbolt Bridge" off 2>/dev/null || true

# 5. Verify
ibv_devinfo 2>/dev/null | grep -E 'hca_id|state'

After running this, ibv_devinfo should show individual RDMA devices with PORT_ACTIVE state for each connected Thunderbolt cable.


Complete Transfer Script

A working script that transfers an entire directory in a single JACCL session with all workarounds applied:

"""
Single-session RDMA directory transfer via JACCL.

Uses CPU stream for all send/recv to avoid Metal GPU timeouts (#3142).
Uses all_sum barrier after init to avoid shape corruption (#3149).

Usage:
  mlx.launch --hostfile hostfile.json --backend jaccl -- \
    rdma_dir_transfer.py /source/dir /dest/dir
"""

import gc
import os
import pathlib
import struct
import sys
import time

import mlx.core as mx

CHUNK_SIZE = 64 * 1024 * 1024  # 64 MB


def log(msg):
    r = mx.distributed.init().rank()
    print(f"[RDMA-DIR][rank={r}] {msg}", flush=True)


def send_directory(src_dir, peer, group):
    files = sorted(f for f in src_dir.iterdir() if f.is_file())

    s = mx.cpu
    cnt = mx.distributed.send(mx.array([len(files)], dtype=mx.int64), peer, group=group, stream=s)
    mx.eval(cnt)

    total = 0
    t0 = time.time()

    for i, fp in enumerate(files):
        sz = fp.stat().st_size
        nm = fp.name.encode()

        hdr = mx.array(list(struct.pack("<QQ", len(nm), sz)), dtype=mx.uint8)
        mx.eval(mx.distributed.send(hdr, peer, group=group, stream=s))
        mx.eval(mx.distributed.send(mx.array(list(nm), dtype=mx.uint8), peer, group=group, stream=s))

        if sz > 0:
            with fp.open("rb") as f:
                sent = 0
                while sent < sz:
                    chunk = f.read(CHUNK_SIZE)
                    if not chunk:
                        break
                    arr = mx.array(memoryview(chunk), dtype=mx.uint8)
                    mx.eval(mx.distributed.send(arr, peer, group=group, stream=s))
                    sent += len(chunk)

        total += sz
        el = time.time() - t0
        spd = total / el / 1e9 if el > 0 else 0
        log(f"  [{i+1}/{len(files)}] {fp.name} ({sz/1e6:.1f} MB) {spd:.2f} GB/s")

    el = time.time() - t0
    spd = total / el / 1e9 if el > 0 else 0
    log(f"DONE: {len(files)} files, {total/1e9:.2f} GB in {el:.1f}s ({spd:.2f} GB/s)")


def recv_directory(dst_dir, peer, group):
    dst_dir.mkdir(parents=True, exist_ok=True)
    s = mx.cpu

    cnt = mx.distributed.recv((1,), mx.int64, peer, group=group, stream=s)
    mx.eval(cnt)
    file_count = int(cnt[0].item())
    log(f"Expecting {file_count} files")

    total = 0
    t0 = time.time()

    for i in range(file_count):
        hdr = mx.distributed.recv((16,), mx.uint8, peer, group=group, stream=s)
        mx.eval(hdr)
        nlen, sz = struct.unpack("<QQ", bytes(memoryview(hdr)))

        nm = mx.distributed.recv((nlen,), mx.uint8, peer, group=group, stream=s)
        mx.eval(nm)
        name = bytes(memoryview(nm)).decode()

        path = dst_dir / name

        if sz > 0:
            fd = os.open(str(path), os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o644)
            try:
                received = 0
                while received < sz:
                    csz = min(CHUNK_SIZE, sz - received)
                    arr = mx.distributed.recv((csz,), mx.uint8, peer, group=group, stream=s)
                    mx.eval(arr)
                    os.write(fd, memoryview(arr))
                    received += csz
                    del arr
                os.fsync(fd)
            finally:
                os.close(fd)
        else:
            path.touch()

        total += sz
        el = time.time() - t0
        spd = total / el / 1e9 if el > 0 else 0
        log(f"  [{i+1}/{file_count}] {name} ({sz/1e6:.1f} MB) {spd:.2f} GB/s")

    el = time.time() - t0
    spd = total / el / 1e9 if el > 0 else 0
    log(f"DONE: {file_count} files, {total/1e9:.2f} GB in {el:.1f}s ({spd:.2f} GB/s)")


def main():
    if len(sys.argv) != 3:
        print(f"Usage: {sys.argv[0]} SRC_DIR DST_DIR", file=sys.stderr)
        sys.exit(1)

    src = pathlib.Path(sys.argv[1])
    dst = pathlib.Path(sys.argv[2])

    world = mx.distributed.init()
    if world.size() < 2:
        sys.exit(1)

    # Init barrier — required for JACCL (#3149). Must use mx.ones(10), not mx.ones(1).
    barrier = mx.distributed.all_sum(mx.ones(10), group=world, stream=mx.cpu)
    mx.eval(barrier)

    try:
        if world.rank() == 0:
            if not src.is_dir():
                print(f"ERROR: Source not found: {src}", file=sys.stderr)
                sys.exit(1)
            send_directory(src, 1, world)
        elif world.rank() == 1:
            recv_directory(dst, 0, world)
    except Exception as e:
        if "protection domain" in str(e).lower():
            log(f"PD EXHAUSTION: {e}")
            sys.exit(2)
        log(f"ERROR: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)

    gc.collect()


if __name__ == "__main__":
    main()

Wire Protocol

Sender (rank 0)                    Receiver (rank 1)
─────────────────                  ──────────────────
send(file_count: int64)     →      recv(file_count)

For each file:
  send(header: 16 bytes)    →      recv(header)
    [name_len: uint64]                [parse name_len, file_size]
    [file_size: uint64]
  send(filename: uint8[])   →      recv(filename)
  For each 64MB chunk:
    send(chunk: uint8[])    →      recv(chunk) → write to disk
                                   fsync()

Header is packed as struct.pack("<QQ", name_len, file_size) — two little-endian 64-bit unsigned integers. The last chunk of each file may be smaller than CHUNK_SIZE.

Troubleshooting

Error Cause Fix
RTR errno 22 (EINVAL) Thunderbolt Bridge active Run fix-tb-bridge.sh
RTR errno 16 (EBUSY) Stale RDMA resources from crash Reboot the node
Caused GPU Timeout Error Missing stream=mx.cpu Add CPU stream to all send/recv
Couldn't allocate protection domain Too many init/teardown cycles Reboot; use single-session
Shape corruption on first recv Missing init barrier Add all_sum(mx.ones(10)) after init
ibv_devinfo shows PORT_DOWN No cable / wrong port / bridge active Check cables, run fix script

Hostfile Notes

RDMA device names (rdma_en2, rdma_en5, etc.) and TB5 interface IPs are dynamically assigned every time mlx.distributed configures the mesh. They change after every reboot or mesh reconfiguration. The only stable identifiers are the 10GbE IPs in the hostfile's ips field. The hostfile must be present on all participating nodes — not just the launcher.

Limitations

  • Point-to-point only: Rank 0 sends, rank 1 receives. For multi-node distribution, sequential transfers between directly-connected pairs.
  • Flat directories only: Does not recurse into subdirectories.
  • No resume: If interrupted, re-run the entire directory. Files are overwritten atomically (O_TRUNC).
  • Reboot required for PD exhaustion: No programmatic recovery.

Hardware

  • 5× Mac Studio M3 Ultra (512 GB unified memory)
  • Thunderbolt 5 full mesh (4 cables per node, direct links between all pairs)
  • ibv_devinfo reports 8X width at 10.0 Gbps per lane = 80 Gbps per port
  • Measured ~37% utilization of theoretical single-port bandwidth

Hope this helps others working with MLX distributed and Thunderbolt 5!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions