Skip to content

Commit 1589931

Browse files
[Feature] vLLM-Omni RDMA connector (#1019)
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
1 parent 7b02b85 commit 1589931

20 files changed

+3645
-63
lines changed
Lines changed: 397 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,397 @@
1+
# RDMA Test Configuration Guide
2+
3+
This document explains how to configure the RDMA environment and run tests for `MooncakeTransferEngineConnector`.
4+
5+
## Table of Contents
6+
7+
- [Docker Container Permissions](#docker-container-permissions)
8+
- [Single-Node Testing](#single-node-testing)
9+
- [Multi-Node Testing](#multi-node-testing)
10+
- [Running Tests](#running-tests)
11+
- [Cross-Node Testing](#cross-node-testing)
12+
- [Troubleshooting](#troubleshooting)
13+
14+
---
15+
16+
## Docker Container Permissions
17+
18+
RDMA tests require access to InfiniBand/RoCE devices and system topology. Add the following permissions when running `docker run`.
19+
20+
### Option 1: Minimal Permissions (Recommended)
21+
22+
```bash
23+
docker run -it \
24+
--cap-add=SYS_PTRACE \
25+
--cap-add=IPC_LOCK \
26+
--security-opt seccomp=unconfined \
27+
--network=host \
28+
--device=/dev/infiniband \
29+
-v /sys/class/infiniband:/sys/class/infiniband:ro \
30+
your-image:tag
31+
```
32+
33+
Parameter explanation:
34+
- `--cap-add=SYS_PTRACE`: Allow reading system topology information
35+
- `--cap-add=IPC_LOCK`: Allow memory locking (required for RDMA memory registration)
36+
- `--security-opt seccomp=unconfined`: Disable seccomp restrictions
37+
- `--network=host`: Use host network (required for RDMA)
38+
- `--device=/dev/infiniband`: Mount InfiniBand devices
39+
- `-v /sys/class/infiniband`: Mount IB device info (read-only)
40+
41+
### Option 2: Full Permissions (Quick but not recommended for production)
42+
43+
```bash
44+
docker run -it \
45+
--privileged \
46+
--network=host \
47+
your-image:tag
48+
```
49+
50+
`--privileged` grants full host permissions. Suitable for quick testing but not recommended for production.
51+
52+
---
53+
54+
## Single-Node Testing
55+
56+
When running single-node tests (producer and consumer on the same machine), ensure they use the **same RDMA device**.
57+
58+
### Problem Background
59+
60+
InfiniBand devices use LID (Local Identifier) for routing. Different devices have different LIDs and cannot communicate directly. If no device is specified, Mooncake may assign different devices to connectors, causing handshake failures.
61+
62+
Common error:
63+
```
64+
[Handshake] Failed to modify QP to RTR, check mtu, gid, peer lid, peer qp num: Invalid argument [22]
65+
```
66+
67+
### Solution
68+
69+
**Method 1: Set Environment Variable (Recommended)**
70+
71+
```bash
72+
# List available RDMA devices
73+
ibstat
74+
75+
# Select a device (e.g., mlx5_0)
76+
export RDMA_DEVICE_NAME='mlx5_0'
77+
78+
# Run tests
79+
pytest test_mooncake_transfer_engine_rdma.py -v -s
80+
```
81+
82+
**Method 2: Use RoCE Devices**
83+
84+
If the system has RoCE devices (using IPv4 routing), the test code will automatically detect and prefer them. RoCE device GIDs start with `00:00:00:00:00:00:00:00:00:00:ff:ff` (IPv4-mapped).
85+
86+
**Method 3: Ensure MTU Consistency**
87+
88+
Make sure both endpoints use the same MTU:
89+
90+
```bash
91+
# Check device MTU
92+
ibstatus mlx5_0
93+
```
94+
95+
---
96+
97+
## Multi-Node Testing
98+
99+
For multi-node tests, producer and consumer run on different machines connected via InfiniBand switch.
100+
101+
### Prerequisites
102+
103+
1. Both machines have Mooncake and RDMA drivers installed
104+
2. Both machines are in the same InfiniBand subnet
105+
3. Switch is properly configured
106+
107+
### Configuration
108+
109+
**Machine A (Producer):**
110+
111+
```bash
112+
# Set RDMA host IP (InfiniBand interface IP)
113+
export RDMA_TEST_HOST='10.0.0.1'
114+
115+
# Optional: Specify device
116+
export RDMA_DEVICE_NAME='mlx5_0'
117+
```
118+
119+
**Machine B (Consumer):**
120+
121+
```bash
122+
# Set RDMA host IP
123+
export RDMA_TEST_HOST='10.0.0.2'
124+
125+
# Optional: Specify device
126+
export RDMA_DEVICE_NAME='mlx5_0'
127+
```
128+
129+
### Verify Connectivity
130+
131+
```bash
132+
# Ping IB interface
133+
ping 10.0.0.2
134+
135+
# Test RDMA connectivity with ibping
136+
# On Machine B (server)
137+
ibping -S
138+
139+
# On Machine A (client)
140+
ibping -G <Machine_B_GID>
141+
```
142+
143+
---
144+
145+
## Running Tests
146+
147+
### Run All RDMA Tests (Single-Node, fast suite)
148+
149+
Slow tests (large payloads, stress, concurrency integrity) are marked `@pytest.mark.slow`. Use `-m "not slow"` to skip them in quick CI or local fast iteration.
150+
151+
```bash
152+
cd tests/distributed/omni_connectors
153+
154+
# Fast suite only (excludes slow/stress tests)
155+
pytest test_mooncake_transfer_engine_rdma.py test_mooncake_transfer_engine_buffer.py -v -s -m "not slow"
156+
```
157+
158+
### Run Including Slow Tests
159+
160+
```bash
161+
# Run ALL tests including slow/stress tests
162+
pytest test_mooncake_transfer_engine_rdma.py test_mooncake_transfer_engine_buffer.py -v -s
163+
164+
# Run ONLY the slow/stress tests
165+
pytest test_mooncake_transfer_engine_rdma.py test_mooncake_transfer_engine_buffer.py -v -s -m slow
166+
```
167+
168+
### Run Buffer Management Tests
169+
170+
```bash
171+
# Fast only
172+
pytest test_mooncake_transfer_engine_buffer.py -v -s -m "not slow"
173+
174+
# Including allocator invariant tests (double-free, overlap, merge)
175+
pytest test_mooncake_transfer_engine_buffer.py -v -s
176+
```
177+
178+
### Run Specific Test Classes
179+
180+
```bash
181+
# Basic connector tests
182+
pytest test_mooncake_transfer_engine_rdma.py::TestBasicConnector -v -s
183+
184+
# End-to-end RDMA transfer tests
185+
pytest test_mooncake_transfer_engine_rdma.py::TestEndToEnd -v -s
186+
187+
# Lifecycle & resource management tests
188+
pytest test_mooncake_transfer_engine_rdma.py::TestLifecycle -v -s
189+
190+
# GPU memory pool tests (requires CUDA)
191+
pytest test_mooncake_transfer_engine_rdma.py::TestGPUPool -v -s
192+
193+
# Stress / correctness tests (slow)
194+
pytest test_mooncake_transfer_engine_rdma.py::TestStressCorrectness -v -s
195+
```
196+
197+
### RDMA Environment Diagnostics
198+
199+
For quick diagnostics (device status, Mooncake availability, env vars, etc.),
200+
see the [Troubleshooting section](../../../docs/design/feature/omni_connectors/mooncake_transfer_engine_connector.md#troubleshooting)
201+
in the connector documentation.
202+
203+
---
204+
205+
## Cross-Node Testing
206+
207+
The `cross_node_mooncake_transfer_engine.py` script enables testing RDMA transfers between two separate physical machines. This script is **not** auto-discovered by `pytest` (it does not start with `test_`) — it must be run manually on each node.
208+
209+
### Prerequisites
210+
211+
1. Both machines have Mooncake installed
212+
2. Both machines are connected via InfiniBand/RoCE switch
213+
3. Firewall allows ZMQ ports (default: 15500, 15501)
214+
4. Same RDMA device name on both nodes (if multiple devices exist)
215+
216+
### Running Cross-Node Tests
217+
218+
**On Machine A (Producer) — start first:**
219+
220+
```bash
221+
cd benchmarks/distributed/omni_connectors/
222+
223+
# Optional: specify device if multiple exist
224+
export RDMA_DEVICE_NAME='mlx5_0'
225+
226+
python cross_node_mooncake_transfer_engine.py \
227+
--role producer \
228+
--local-host <PRODUCER_IP> \
229+
--remote-host <CONSUMER_IP> \
230+
--tensor-size-mb 100 \
231+
--num-transfers 3
232+
```
233+
234+
**On Machine B (Consumer) — start after producer:**
235+
236+
```bash
237+
cd benchmarks/distributed/omni_connectors/
238+
239+
export RDMA_DEVICE_NAME='mlx5_0'
240+
241+
python cross_node_mooncake_transfer_engine.py \
242+
--role consumer \
243+
--local-host <CONSUMER_IP> \
244+
--remote-host <PRODUCER_IP> \
245+
--tensor-size-mb 100 \
246+
--num-transfers 3
247+
```
248+
249+
### Transfer Modes
250+
251+
| Mode | Description | Example |
252+
|------|-------------|---------|
253+
| `copy` | Normal path — tensor copied to RDMA pool (default) | `--mode copy` |
254+
| `zerocopy` | Zero-copy path — data created directly in RDMA pool | `--mode zerocopy` |
255+
| `gpu` | GPU transfer — RDMA pool on GPU, uses GPUDirect | `--mode gpu --gpu-id 0` |
256+
257+
### Benchmark Mode
258+
259+
Skip MD5 verification and measure pure RDMA throughput:
260+
261+
```bash
262+
# Producer
263+
python cross_node_mooncake_transfer_engine.py \
264+
--role producer \
265+
--local-host <PRODUCER_IP> \
266+
--remote-host <CONSUMER_IP> \
267+
--tensor-size-mb 1024 \
268+
--num-transfers 20 \
269+
--benchmark
270+
271+
# Consumer
272+
python cross_node_mooncake_transfer_engine.py \
273+
--role consumer \
274+
--local-host <CONSUMER_IP> \
275+
--remote-host <PRODUCER_IP> \
276+
--tensor-size-mb 1024 \
277+
--num-transfers 20 \
278+
--benchmark
279+
```
280+
281+
### Cross-Node Test Options
282+
283+
| Option | Description | Default |
284+
|--------|-------------|---------|
285+
| `--role` | `producer` or `consumer` | Required |
286+
| `--local-host` | Local RDMA IP address | Required |
287+
| `--remote-host` | Remote RDMA IP address | Required |
288+
| `--local-port` | Local ZMQ port for RDMA data | 15500 |
289+
| `--remote-port` | Remote ZMQ port for RDMA data | 15500 |
290+
| `--ctrl-port` | Control channel port | 15501 |
291+
| `--tensor-size-mb` | Tensor size in MB | 100 |
292+
| `--num-transfers` | Number of transfers | 3 |
293+
| `--mode` | `copy`, `zerocopy`, or `gpu` | `copy` |
294+
| `--gpu-id` | GPU ID for GPU mode | 0 |
295+
| `--benchmark` | Skip MD5, pure performance test | off |
296+
297+
---
298+
299+
## Troubleshooting
300+
301+
### 1. "Failed to modify QP to RTR" Error
302+
303+
**Cause**: QP handshake failed, usually due to device configuration mismatch.
304+
305+
**Solution**:
306+
```bash
307+
# Force using the same device
308+
export RDMA_DEVICE_NAME='mlx5_0'
309+
```
310+
311+
### 2. "Mooncake TransferEngine is not available"
312+
313+
**Cause**: Mooncake not installed or import failed.
314+
315+
**Solution**:
316+
```bash
317+
# Check Mooncake installation
318+
python -c "from mooncake.engine import TransferEngine; print('OK')"
319+
320+
# Reinstall if needed
321+
pip install mooncake-transfer-engine
322+
# Or using uv
323+
uv pip install mooncake-transfer-engine
324+
325+
```
326+
327+
### 3. "Permission denied" accessing /dev/infiniband
328+
329+
**Cause**: Container lacks IB device access permissions.
330+
331+
**Solution**:
332+
```bash
333+
docker run --device=/dev/infiniband --cap-add=IPC_LOCK ...
334+
```
335+
336+
### 4. Test Timeout
337+
338+
**Cause**: RDMA connection establishment failed or network latency.
339+
340+
**Solution**:
341+
```bash
342+
# Check network status
343+
ibstat
344+
ibstatus
345+
```
346+
347+
### 5. GPU Test Failed "CUDA is not available"
348+
349+
**Cause**: CUDA environment not configured or GPU unavailable.
350+
351+
**Solution**:
352+
```bash
353+
# Check CUDA
354+
python -c "import torch; print(torch.cuda.is_available())"
355+
356+
# Docker needs NVIDIA runtime
357+
docker run --gpus all ...
358+
```
359+
360+
---
361+
362+
## Environment Variables Reference
363+
364+
| Variable | Description | Example |
365+
|----------|-------------|---------|
366+
| `RDMA_DEVICE_NAME` | Specify RDMA device name | `mlx5_0` |
367+
| `RDMA_TEST_HOST` | Specify test host IP | `10.0.0.1` |
368+
| `MC_TE_METRIC` | Enable Mooncake metrics | `1` |
369+
| `MC_IB_PCI_RELAXED_ORDERING` | Enable PCIe relaxed ordering | `1` |
370+
371+
---
372+
373+
## Test Files Overview
374+
375+
| File | Description | Auto-discovered by pytest |
376+
|------|-------------|--------------------------|
377+
| `test_mooncake_transfer_engine_rdma.py` | Integration tests for MooncakeTransferEngineConnector (basic, E2E, lifecycle, GPU) | Yes |
378+
| `test_mooncake_transfer_engine_buffer.py` | Memory pool and buffer management unit tests | Yes |
379+
| `cross_node_mooncake_transfer_engine.py` | Cross-node (multi-machine) testing script — run manually | No (filename does not start with `test_`) |
380+
381+
### test_mooncake_transfer_engine_rdma.py — Test Classes
382+
383+
| Test Class | Memory Pool | Marker | Description |
384+
|------------|-------------|--------|-------------|
385+
| `TestBasicConnector` | CPU || Initialization, put tensor/bytes/object, cleanup, pool exhaustion |
386+
| `TestEndToEnd` | CPU || E2E RDMA transfer: tensor, bytes, object, zero-copy, large payload (100MB), mixed types, concurrency |
387+
| `TestLifecycle` | CPU || Close, context manager, double-close safety |
388+
| `TestGPUPool` | GPU || GPU pool init, put CPU/GPU tensor, GPU E2E transfer |
389+
| `TestStressCorrectness` | CPU | `slow` | Concurrent put+get with MD5 integrity, bidirectional concurrency, edge cases (1-element tensor, empty bytes), 500MB payload, rapid alloc/free cycles |
390+
391+
### test_mooncake_transfer_engine_buffer.py — Test Classes
392+
393+
| Test Class | Marker | Description |
394+
|------------|--------|-------------|
395+
| `TestBufferAllocator` || Basic alloc/free, alignment, exhaustion/recovery, thread safety |
396+
| `TestAllocatorInvariants` | `slow` | Double-free safety, overlap corruption detection, adjacent-block merging, fragmentation/defrag |
397+
| `TestManagedBuffer` || Tensor views, context manager |

0 commit comments

Comments
 (0)