|
| 1 | +# RDMA Test Configuration Guide |
| 2 | + |
| 3 | +This document explains how to configure the RDMA environment and run tests for `MooncakeTransferEngineConnector`. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | + |
| 7 | +- [Docker Container Permissions](#docker-container-permissions) |
| 8 | +- [Single-Node Testing](#single-node-testing) |
| 9 | +- [Multi-Node Testing](#multi-node-testing) |
| 10 | +- [Running Tests](#running-tests) |
| 11 | +- [Cross-Node Testing](#cross-node-testing) |
| 12 | +- [Troubleshooting](#troubleshooting) |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Docker Container Permissions |
| 17 | + |
| 18 | +RDMA tests require access to InfiniBand/RoCE devices and system topology. Add the following permissions when running `docker run`. |
| 19 | + |
| 20 | +### Option 1: Minimal Permissions (Recommended) |
| 21 | + |
| 22 | +```bash |
| 23 | +docker run -it \ |
| 24 | + --cap-add=SYS_PTRACE \ |
| 25 | + --cap-add=IPC_LOCK \ |
| 26 | + --security-opt seccomp=unconfined \ |
| 27 | + --network=host \ |
| 28 | + --device=/dev/infiniband \ |
| 29 | + -v /sys/class/infiniband:/sys/class/infiniband:ro \ |
| 30 | + your-image:tag |
| 31 | +``` |
| 32 | + |
| 33 | +Parameter explanation: |
| 34 | +- `--cap-add=SYS_PTRACE`: Allow reading system topology information |
| 35 | +- `--cap-add=IPC_LOCK`: Allow memory locking (required for RDMA memory registration) |
| 36 | +- `--security-opt seccomp=unconfined`: Disable seccomp restrictions |
| 37 | +- `--network=host`: Use host network (required for RDMA) |
| 38 | +- `--device=/dev/infiniband`: Mount InfiniBand devices |
| 39 | +- `-v /sys/class/infiniband`: Mount IB device info (read-only) |
| 40 | + |
| 41 | +### Option 2: Full Permissions (Quick but not recommended for production) |
| 42 | + |
| 43 | +```bash |
| 44 | +docker run -it \ |
| 45 | + --privileged \ |
| 46 | + --network=host \ |
| 47 | + your-image:tag |
| 48 | +``` |
| 49 | + |
| 50 | +`--privileged` grants full host permissions. Suitable for quick testing but not recommended for production. |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +## Single-Node Testing |
| 55 | + |
| 56 | +When running single-node tests (producer and consumer on the same machine), ensure they use the **same RDMA device**. |
| 57 | + |
| 58 | +### Problem Background |
| 59 | + |
| 60 | +InfiniBand devices use LID (Local Identifier) for routing. Different devices have different LIDs and cannot communicate directly. If no device is specified, Mooncake may assign different devices to connectors, causing handshake failures. |
| 61 | + |
| 62 | +Common error: |
| 63 | +``` |
| 64 | +[Handshake] Failed to modify QP to RTR, check mtu, gid, peer lid, peer qp num: Invalid argument [22] |
| 65 | +``` |
| 66 | + |
| 67 | +### Solution |
| 68 | + |
| 69 | +**Method 1: Set Environment Variable (Recommended)** |
| 70 | + |
| 71 | +```bash |
| 72 | +# List available RDMA devices |
| 73 | +ibstat |
| 74 | + |
| 75 | +# Select a device (e.g., mlx5_0) |
| 76 | +export RDMA_DEVICE_NAME='mlx5_0' |
| 77 | + |
| 78 | +# Run tests |
| 79 | +pytest test_mooncake_transfer_engine_rdma.py -v -s |
| 80 | +``` |
| 81 | + |
| 82 | +**Method 2: Use RoCE Devices** |
| 83 | + |
| 84 | +If the system has RoCE devices (using IPv4 routing), the test code will automatically detect and prefer them. RoCE device GIDs start with `00:00:00:00:00:00:00:00:00:00:ff:ff` (IPv4-mapped). |
| 85 | + |
| 86 | +**Method 3: Ensure MTU Consistency** |
| 87 | + |
| 88 | +Make sure both endpoints use the same MTU: |
| 89 | + |
| 90 | +```bash |
| 91 | +# Check device MTU |
| 92 | +ibstatus mlx5_0 |
| 93 | +``` |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## Multi-Node Testing |
| 98 | + |
| 99 | +For multi-node tests, producer and consumer run on different machines connected via InfiniBand switch. |
| 100 | + |
| 101 | +### Prerequisites |
| 102 | + |
| 103 | +1. Both machines have Mooncake and RDMA drivers installed |
| 104 | +2. Both machines are in the same InfiniBand subnet |
| 105 | +3. Switch is properly configured |
| 106 | + |
| 107 | +### Configuration |
| 108 | + |
| 109 | +**Machine A (Producer):** |
| 110 | + |
| 111 | +```bash |
| 112 | +# Set RDMA host IP (InfiniBand interface IP) |
| 113 | +export RDMA_TEST_HOST='10.0.0.1' |
| 114 | + |
| 115 | +# Optional: Specify device |
| 116 | +export RDMA_DEVICE_NAME='mlx5_0' |
| 117 | +``` |
| 118 | + |
| 119 | +**Machine B (Consumer):** |
| 120 | + |
| 121 | +```bash |
| 122 | +# Set RDMA host IP |
| 123 | +export RDMA_TEST_HOST='10.0.0.2' |
| 124 | + |
| 125 | +# Optional: Specify device |
| 126 | +export RDMA_DEVICE_NAME='mlx5_0' |
| 127 | +``` |
| 128 | + |
| 129 | +### Verify Connectivity |
| 130 | + |
| 131 | +```bash |
| 132 | +# Ping IB interface |
| 133 | +ping 10.0.0.2 |
| 134 | + |
| 135 | +# Test RDMA connectivity with ibping |
| 136 | +# On Machine B (server) |
| 137 | +ibping -S |
| 138 | + |
| 139 | +# On Machine A (client) |
| 140 | +ibping -G <Machine_B_GID> |
| 141 | +``` |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## Running Tests |
| 146 | + |
| 147 | +### Run All RDMA Tests (Single-Node, fast suite) |
| 148 | + |
| 149 | +Slow tests (large payloads, stress, concurrency integrity) are marked `@pytest.mark.slow`. Use `-m "not slow"` to skip them in quick CI or local fast iteration. |
| 150 | + |
| 151 | +```bash |
| 152 | +cd tests/distributed/omni_connectors |
| 153 | + |
| 154 | +# Fast suite only (excludes slow/stress tests) |
| 155 | +pytest test_mooncake_transfer_engine_rdma.py test_mooncake_transfer_engine_buffer.py -v -s -m "not slow" |
| 156 | +``` |
| 157 | + |
| 158 | +### Run Including Slow Tests |
| 159 | + |
| 160 | +```bash |
| 161 | +# Run ALL tests including slow/stress tests |
| 162 | +pytest test_mooncake_transfer_engine_rdma.py test_mooncake_transfer_engine_buffer.py -v -s |
| 163 | + |
| 164 | +# Run ONLY the slow/stress tests |
| 165 | +pytest test_mooncake_transfer_engine_rdma.py test_mooncake_transfer_engine_buffer.py -v -s -m slow |
| 166 | +``` |
| 167 | + |
| 168 | +### Run Buffer Management Tests |
| 169 | + |
| 170 | +```bash |
| 171 | +# Fast only |
| 172 | +pytest test_mooncake_transfer_engine_buffer.py -v -s -m "not slow" |
| 173 | + |
| 174 | +# Including allocator invariant tests (double-free, overlap, merge) |
| 175 | +pytest test_mooncake_transfer_engine_buffer.py -v -s |
| 176 | +``` |
| 177 | + |
| 178 | +### Run Specific Test Classes |
| 179 | + |
| 180 | +```bash |
| 181 | +# Basic connector tests |
| 182 | +pytest test_mooncake_transfer_engine_rdma.py::TestBasicConnector -v -s |
| 183 | + |
| 184 | +# End-to-end RDMA transfer tests |
| 185 | +pytest test_mooncake_transfer_engine_rdma.py::TestEndToEnd -v -s |
| 186 | + |
| 187 | +# Lifecycle & resource management tests |
| 188 | +pytest test_mooncake_transfer_engine_rdma.py::TestLifecycle -v -s |
| 189 | + |
| 190 | +# GPU memory pool tests (requires CUDA) |
| 191 | +pytest test_mooncake_transfer_engine_rdma.py::TestGPUPool -v -s |
| 192 | + |
| 193 | +# Stress / correctness tests (slow) |
| 194 | +pytest test_mooncake_transfer_engine_rdma.py::TestStressCorrectness -v -s |
| 195 | +``` |
| 196 | + |
| 197 | +### RDMA Environment Diagnostics |
| 198 | + |
| 199 | +For quick diagnostics (device status, Mooncake availability, env vars, etc.), |
| 200 | +see the [Troubleshooting section](../../../docs/design/feature/omni_connectors/mooncake_transfer_engine_connector.md#troubleshooting) |
| 201 | +in the connector documentation. |
| 202 | + |
| 203 | +--- |
| 204 | + |
| 205 | +## Cross-Node Testing |
| 206 | + |
| 207 | +The `cross_node_mooncake_transfer_engine.py` script enables testing RDMA transfers between two separate physical machines. This script is **not** auto-discovered by `pytest` (it does not start with `test_`) — it must be run manually on each node. |
| 208 | + |
| 209 | +### Prerequisites |
| 210 | + |
| 211 | +1. Both machines have Mooncake installed |
| 212 | +2. Both machines are connected via InfiniBand/RoCE switch |
| 213 | +3. Firewall allows ZMQ ports (default: 15500, 15501) |
| 214 | +4. Same RDMA device name on both nodes (if multiple devices exist) |
| 215 | + |
| 216 | +### Running Cross-Node Tests |
| 217 | + |
| 218 | +**On Machine A (Producer) — start first:** |
| 219 | + |
| 220 | +```bash |
| 221 | +cd benchmarks/distributed/omni_connectors/ |
| 222 | + |
| 223 | +# Optional: specify device if multiple exist |
| 224 | +export RDMA_DEVICE_NAME='mlx5_0' |
| 225 | + |
| 226 | +python cross_node_mooncake_transfer_engine.py \ |
| 227 | + --role producer \ |
| 228 | + --local-host <PRODUCER_IP> \ |
| 229 | + --remote-host <CONSUMER_IP> \ |
| 230 | + --tensor-size-mb 100 \ |
| 231 | + --num-transfers 3 |
| 232 | +``` |
| 233 | + |
| 234 | +**On Machine B (Consumer) — start after producer:** |
| 235 | + |
| 236 | +```bash |
| 237 | +cd benchmarks/distributed/omni_connectors/ |
| 238 | + |
| 239 | +export RDMA_DEVICE_NAME='mlx5_0' |
| 240 | + |
| 241 | +python cross_node_mooncake_transfer_engine.py \ |
| 242 | + --role consumer \ |
| 243 | + --local-host <CONSUMER_IP> \ |
| 244 | + --remote-host <PRODUCER_IP> \ |
| 245 | + --tensor-size-mb 100 \ |
| 246 | + --num-transfers 3 |
| 247 | +``` |
| 248 | + |
| 249 | +### Transfer Modes |
| 250 | + |
| 251 | +| Mode | Description | Example | |
| 252 | +|------|-------------|---------| |
| 253 | +| `copy` | Normal path — tensor copied to RDMA pool (default) | `--mode copy` | |
| 254 | +| `zerocopy` | Zero-copy path — data created directly in RDMA pool | `--mode zerocopy` | |
| 255 | +| `gpu` | GPU transfer — RDMA pool on GPU, uses GPUDirect | `--mode gpu --gpu-id 0` | |
| 256 | + |
| 257 | +### Benchmark Mode |
| 258 | + |
| 259 | +Skip MD5 verification and measure pure RDMA throughput: |
| 260 | + |
| 261 | +```bash |
| 262 | +# Producer |
| 263 | +python cross_node_mooncake_transfer_engine.py \ |
| 264 | + --role producer \ |
| 265 | + --local-host <PRODUCER_IP> \ |
| 266 | + --remote-host <CONSUMER_IP> \ |
| 267 | + --tensor-size-mb 1024 \ |
| 268 | + --num-transfers 20 \ |
| 269 | + --benchmark |
| 270 | + |
| 271 | +# Consumer |
| 272 | +python cross_node_mooncake_transfer_engine.py \ |
| 273 | + --role consumer \ |
| 274 | + --local-host <CONSUMER_IP> \ |
| 275 | + --remote-host <PRODUCER_IP> \ |
| 276 | + --tensor-size-mb 1024 \ |
| 277 | + --num-transfers 20 \ |
| 278 | + --benchmark |
| 279 | +``` |
| 280 | + |
| 281 | +### Cross-Node Test Options |
| 282 | + |
| 283 | +| Option | Description | Default | |
| 284 | +|--------|-------------|---------| |
| 285 | +| `--role` | `producer` or `consumer` | Required | |
| 286 | +| `--local-host` | Local RDMA IP address | Required | |
| 287 | +| `--remote-host` | Remote RDMA IP address | Required | |
| 288 | +| `--local-port` | Local ZMQ port for RDMA data | 15500 | |
| 289 | +| `--remote-port` | Remote ZMQ port for RDMA data | 15500 | |
| 290 | +| `--ctrl-port` | Control channel port | 15501 | |
| 291 | +| `--tensor-size-mb` | Tensor size in MB | 100 | |
| 292 | +| `--num-transfers` | Number of transfers | 3 | |
| 293 | +| `--mode` | `copy`, `zerocopy`, or `gpu` | `copy` | |
| 294 | +| `--gpu-id` | GPU ID for GPU mode | 0 | |
| 295 | +| `--benchmark` | Skip MD5, pure performance test | off | |
| 296 | + |
| 297 | +--- |
| 298 | + |
| 299 | +## Troubleshooting |
| 300 | + |
| 301 | +### 1. "Failed to modify QP to RTR" Error |
| 302 | + |
| 303 | +**Cause**: QP handshake failed, usually due to device configuration mismatch. |
| 304 | + |
| 305 | +**Solution**: |
| 306 | +```bash |
| 307 | +# Force using the same device |
| 308 | +export RDMA_DEVICE_NAME='mlx5_0' |
| 309 | +``` |
| 310 | + |
| 311 | +### 2. "Mooncake TransferEngine is not available" |
| 312 | + |
| 313 | +**Cause**: Mooncake not installed or import failed. |
| 314 | + |
| 315 | +**Solution**: |
| 316 | +```bash |
| 317 | +# Check Mooncake installation |
| 318 | +python -c "from mooncake.engine import TransferEngine; print('OK')" |
| 319 | + |
| 320 | +# Reinstall if needed |
| 321 | +pip install mooncake-transfer-engine |
| 322 | +# Or using uv |
| 323 | +uv pip install mooncake-transfer-engine |
| 324 | + |
| 325 | +``` |
| 326 | + |
| 327 | +### 3. "Permission denied" accessing /dev/infiniband |
| 328 | + |
| 329 | +**Cause**: Container lacks IB device access permissions. |
| 330 | + |
| 331 | +**Solution**: |
| 332 | +```bash |
| 333 | +docker run --device=/dev/infiniband --cap-add=IPC_LOCK ... |
| 334 | +``` |
| 335 | + |
| 336 | +### 4. Test Timeout |
| 337 | + |
| 338 | +**Cause**: RDMA connection establishment failed or network latency. |
| 339 | + |
| 340 | +**Solution**: |
| 341 | +```bash |
| 342 | +# Check network status |
| 343 | +ibstat |
| 344 | +ibstatus |
| 345 | +``` |
| 346 | + |
| 347 | +### 5. GPU Test Failed "CUDA is not available" |
| 348 | + |
| 349 | +**Cause**: CUDA environment not configured or GPU unavailable. |
| 350 | + |
| 351 | +**Solution**: |
| 352 | +```bash |
| 353 | +# Check CUDA |
| 354 | +python -c "import torch; print(torch.cuda.is_available())" |
| 355 | + |
| 356 | +# Docker needs NVIDIA runtime |
| 357 | +docker run --gpus all ... |
| 358 | +``` |
| 359 | + |
| 360 | +--- |
| 361 | + |
| 362 | +## Environment Variables Reference |
| 363 | + |
| 364 | +| Variable | Description | Example | |
| 365 | +|----------|-------------|---------| |
| 366 | +| `RDMA_DEVICE_NAME` | Specify RDMA device name | `mlx5_0` | |
| 367 | +| `RDMA_TEST_HOST` | Specify test host IP | `10.0.0.1` | |
| 368 | +| `MC_TE_METRIC` | Enable Mooncake metrics | `1` | |
| 369 | +| `MC_IB_PCI_RELAXED_ORDERING` | Enable PCIe relaxed ordering | `1` | |
| 370 | + |
| 371 | +--- |
| 372 | + |
| 373 | +## Test Files Overview |
| 374 | + |
| 375 | +| File | Description | Auto-discovered by pytest | |
| 376 | +|------|-------------|--------------------------| |
| 377 | +| `test_mooncake_transfer_engine_rdma.py` | Integration tests for MooncakeTransferEngineConnector (basic, E2E, lifecycle, GPU) | Yes | |
| 378 | +| `test_mooncake_transfer_engine_buffer.py` | Memory pool and buffer management unit tests | Yes | |
| 379 | +| `cross_node_mooncake_transfer_engine.py` | Cross-node (multi-machine) testing script — run manually | No (filename does not start with `test_`) | |
| 380 | + |
| 381 | +### test_mooncake_transfer_engine_rdma.py — Test Classes |
| 382 | + |
| 383 | +| Test Class | Memory Pool | Marker | Description | |
| 384 | +|------------|-------------|--------|-------------| |
| 385 | +| `TestBasicConnector` | CPU | — | Initialization, put tensor/bytes/object, cleanup, pool exhaustion | |
| 386 | +| `TestEndToEnd` | CPU | — | E2E RDMA transfer: tensor, bytes, object, zero-copy, large payload (100MB), mixed types, concurrency | |
| 387 | +| `TestLifecycle` | CPU | — | Close, context manager, double-close safety | |
| 388 | +| `TestGPUPool` | GPU | — | GPU pool init, put CPU/GPU tensor, GPU E2E transfer | |
| 389 | +| `TestStressCorrectness` | CPU | `slow` | Concurrent put+get with MD5 integrity, bidirectional concurrency, edge cases (1-element tensor, empty bytes), 500MB payload, rapid alloc/free cycles | |
| 390 | + |
| 391 | +### test_mooncake_transfer_engine_buffer.py — Test Classes |
| 392 | + |
| 393 | +| Test Class | Marker | Description | |
| 394 | +|------------|--------|-------------| |
| 395 | +| `TestBufferAllocator` | — | Basic alloc/free, alignment, exhaustion/recovery, thread safety | |
| 396 | +| `TestAllocatorInvariants` | `slow` | Double-free safety, overlap corruption detection, adjacent-block merging, fragmentation/defrag | |
| 397 | +| `TestManagedBuffer` | — | Tensor views, context manager | |
0 commit comments