Skip to content

Commit 8e27c74

Browse files
dstaay-fbmeta-codesync[bot]
authored andcommitted
NIC autodetection based on PCI distance for cuda:X and/or cpu:Y (meta-pytorch#1305)
Summary: Pull Request resolved: meta-pytorch#1305 Implements an automated RDMA Network Interface Card (NIC) selection system that automatically pairs compute devices (CUDA GPUs, CPU/NUMA nodes) with the optimal RDMA NICs based on PCI topology distance. ### **Key Components:** 1. **PCI Topology Discovery** (`parse_pci_topology()`) * Parses [`**/sys/bus/pci/devices/**`](command:code-compose.open?%5B%22%2Fsys%2Fbus%2Fpci%2Fdevices%2F%22%2Cnull%5D "/sys/bus/pci/devices/") to build a hierarchical tree of PCI devices * Establishes parent-child relationships by following symlinks in the filesystem * Creates a complete topology map of the system's PCI bus structure 2. **Distance Calculation Algorithm** (`PCIDevice::distance_to()`) * **Intra-NUMA Distance**: For devices on the same PCI tree, calculates hop count to the lowest common ancestor * **Cross-NUMA Penalty**: Uses penalty-based scoring for devices across different NUMA domains: * Base penalty: 20.0 (higher than typical intra-NUMA distances of 0-8 hops) * Cross-domain penalty: 1000.0 for different PCI domains * Bus distance scaling: 0.1 factor for tie-breaking between different bus numbers 3. **Device Resolution Strategy**: * **CUDA devices** (`cuda:N`): Maps GPU index to PCI address via [`**gpus**`](command:code-compose.open?%5B%22%2Fproc%2Fdriver%2Fnvidia%2Fgpus%22%2Cnull%5D "/proc/driver/nvidia/gpus") * **CPU/NUMA devices** (`cpu:N`): Finds representative PCI device for NUMA node N * **Direct NIC selection** (`nic:mlx5_N`): Bypasses topology and uses specified NIC directly 4. **Unified Selection Interface** (`select_optimal_rdma_device()`) * Supports format: `"type:id"` (e.g., `"cuda:0"`, `"cpu:1"`, `"nic:mlx5_3"`) * Automatically finds the RDMA NIC with minimum PCI distance to the specified compute device * Falls back to first available device if topology resolution fails ### **Algorithm Benefits:** * **Performance Optimization**: Ensures data paths use the shortest PCI routes, minimizing latency and maximizing bandwidth * **NUMA Awareness**: Heavily penalizes cross-NUMA communication to prefer local NICs * **Hardware Agnostic**: Works across different server configurations and PCI topologies * **Automatic Fallback**: Gracefully handles edge cases with sensible defaults ### **Validation Results:** The algorithm was validated on GT20 hardware and produces identical GPU-to-NIC mappings as the proven Python reference implementation, demonstrating correctness and reliability. This topology algorithm is critical for high-performance RDMA workloads where optimal device pairing can significantly impact communication performance in distributed computing scenarios. Reviewed By: zdevito Differential Revision: D83008226 fbshipit-source-id: cd94bca32815d4c10fbbdde2e93b23fd2cb44ef7
1 parent 0c0b420 commit 8e27c74

File tree

7 files changed

+865
-82
lines changed

7 files changed

+865
-82
lines changed

monarch_rdma/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ cuda-sys = { path = "../cuda-sys" }
1717
hyperactor = { version = "0.0.0", path = "../hyperactor" }
1818
rand = { version = "0.8", features = ["small_rng"] }
1919
rdmaxcel-sys = { path = "../rdmaxcel-sys" }
20+
regex = "1.11.1"
2021
serde = { version = "1.0.219", features = ["derive", "rc"] }
2122
tracing = { version = "0.1.41", features = ["attributes", "valuable"] }
2223

0 commit comments

Comments
 (0)