Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 14 additions & 24 deletions benchmark/nixlbench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -528,6 +528,7 @@ NIXL Benchmark uses an ETCD key-value store for coordination between benchmark w

1. Ensure ETCD server is running (e.g., `docker run -p 2379:2379 quay.io/coreos/etcd`
2. Launch multiple nixlbench instances pointing to the same ETCD server
3. Multiple instances should be launched within the default timeout of 60s.

**For single-instance storage benchmarks:**
```bash
Expand All @@ -538,21 +539,15 @@ NIXL Benchmark uses an ETCD key-value store for coordination between benchmark w
./nixlbench --etcd_endpoints http://etcd-server:2379 --backend GDS --filepath /mnt/storage/testfile
```

Note: etcd can be installed directly on host as well:
```bash
apt install etcd-server
```

Example:
**For multi-instance storage benchmarks where ETCD is required:**
```bash
# On host 1
./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --initiator_seg_type VRAM --target_seg_type VRAM

# On host 2
./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --initiator_seg_type VRAM --target_seg_type VRAM
```

The workers automatically coordinate ranks through ETCD as they connect.
The workers automatically coordinate ranks through ETCD as they connect. Note, the second nixlbench should be started within 60s, otherwise the first instance will stop with an error in the barrier.

### Backend-Specific Examples

Expand All @@ -562,9 +557,12 @@ The workers automatically coordinate ranks through ETCD as they connect.
```bash
# Basic UCX benchmark
./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX
sleep 2
./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX

# UCX with specific devices
./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --device_list mlx5_0,mlx5_1
$ host1 > ./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --device_list mlx5_0,mlx5_1
$ host2 > ./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --device_list mlx5_0,mlx5_1
```

**GPUNETIO Backend:**
Expand Down Expand Up @@ -706,20 +704,6 @@ Transfer times are higher than local storage, so consider reducing iterations:
- Test read operations: `--op_type READ`
- Validate data consistency: `--check_consistency`

### Multi-Node Coordination

Launch multiple nixlbench instances pointing to the same ETCD server:

```bash
# On host 1
./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --initiator_seg_type VRAM --target_seg_type VRAM

# On host 2
./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --initiator_seg_type VRAM --target_seg_type VRAM
```

The workers automatically coordinate ranks through ETCD as they connect.

## Troubleshooting

### Common Build Issues
Expand Down Expand Up @@ -814,6 +798,12 @@ export UCX_LOG_LEVEL=DEBUG # Verbose UCX logging
export UCX_PROTO_INFO=y # See transport used by UCX
```

#### Etcd Cleanup
```bash
# If a nixlbench instance failed you need to cleanup the etcd instance before starting nixlbench again
ETCDCTL_API=3 etcdctl del "xferbench" --prefix=true
```

### Performance Tuning

#### CPU Affinity
Expand Down Expand Up @@ -842,4 +832,4 @@ sudo sysctl -p

---

*This guide covers NIXLBench build and usage procedures as of 2025. For the latest updates, please refer to the official repository.*
*This guide covers NIXLBench build and usage procedures as of 2025. For the latest updates, please refer to the official repository.*