Measured PCIe peer-to-peer bandwidth, latency, and NCCL AllReduce performance on RTX PRO 6000 Blackwell systems. These numbers are the foundation for understanding inference throughput limits.
- P2P Bandwidth Measurements
- P2P Latency Measurements
- NCCL AllReduce Bus Bandwidth
- BAR1 Configuration
- How PCIe Bandwidth Affects Inference
- GRUB Kernel Parameters for PCIe Stability
- Debugging Tools
All measurements taken using the CUDA p2pBandwidthLatencyTest sample or luke's p2pmark tool.
Measured with P2P Writes, PCIe Gen5 x16 links.
| Source | Setup | Same-NUMA | Cross-NUMA |
|---|---|---|---|
| purplepow3r | Dual Turin, 4x 6000 Pro WS + 2x 5090 | ~55-56 GB/s | ~51 GB/s |
| orangezed | Dual EPYC 9374F, 8x 6000 Pro MaxQ | ~54 GB/s | ~39 GB/s |
| Festr | Dual Turin, 8x 6000 Pro Server | ~53 GB/s | -- |
| luke | Switches, 8x 6000 Pro MaxQ | ~54 GB/s | N/A (single CPU) |
| voipmonitor | ASUS ESC8000A-E13P, Broadcom PEX890xx switches, 8x 6000 Pro Server | ~54 GB/s | ~50 GB/s (cross-NUMA) |
Theoretical maximum: PCIe Gen5 x16 = 63 GB/s unidirectional. The ~56 GB/s measured represents ~89% efficiency.
| Source | Setup | Same-NUMA | Cross-NUMA |
|---|---|---|---|
| purplepow3r | Dual Turin | ~111 GB/s | ~99 GB/s |
| orangezed | Dual EPYC 9374F | ~103 GB/s | ~64 GB/s |
| voipmonitor | ASUS ESC8000A-E13P + Broadcom switches (ACS off) | ~103 GB/s | ~95 GB/s |
From purplepow3r's dual Turin system (4x RTX 6000 Pro WS + 2x RTX 5090):
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5
0 1488.10 56.57 56.57 50.97 55.61 55.59
1 56.57 1416.77 56.57 51.01 55.60 55.64
2 56.57 56.57 1415.11 50.97 55.62 55.64
3 50.97 50.97 50.97 1375.60 50.16 50.45
4 55.61 55.60 55.62 50.16 1408.87 55.60
5 55.60 55.64 55.64 50.45 55.61 1489.91
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5
0 1485.22 111.38 111.39 99.42 111.02 111.10
1 111.38 1416.86 111.38 99.59 111.06 111.15
2 111.39 111.39 1415.11 99.53 111.04 111.12
3 99.42 99.59 99.53 1377.73 99.09 99.43
4 111.02 111.06 111.04 99.09 1409.15 111.13
5 111.10 111.15 111.12 99.43 111.13 1487.77
Note: Device 3 shows lower bandwidth (~51/99 GB/s) -- this is the cross-NUMA path. Devices 0-2 are on NUMA0, devices 3-5 are on NUMA1.
luke's p2pmark tool provides a standardized comparison:
| System | PCIe Link Score | Dense Interconnect Score | Effective Latency |
|---|---|---|---|
| luke (3x switches, single CPU) | 0.86 (54.3 GB/s) | 0.44 (191.8 / 434.7 GB/s) | 6.79 us |
| Festr (dual Turin, direct-attach) | 0.84 (52.7 GB/s) | 0.41 (173.1 / 421.3 GB/s) | 6.03 us |
| Grimulkan (4x switches, single CPU) | 0.86 (53.9 GB/s) | 0.38 (164.3 / 431.2 GB/s) | 7.04 us |
| voipmonitor (ASUS ESC8000A-E13P, Broadcom, ACS off) | 0.85 (53.7 GB/s) | 0.12 (51.7 / 429.2 GB/s) | 7.39 us |
Note: voipmonitor's 8-GPU all-to-all score (0.12) is low because the dual-socket Infinity Fabric saturates under 56 concurrent flows. Same-NUMA 4-GPU performance is excellent (0.58, comparable to Festr). See detailed page for full analysis.
| System | PCIe Link Score | Dense Interconnect Score | Effective Latency |
|---|---|---|---|
| luke (1 switch) | 0.86 | 0.64 (138.3 / 217.7 GB/s) | 4.10 us |
| Festr (Turin, same NUMA) | 0.88 | 0.59 (129.7 / 220.6 GB/s) | 2.28 us |
| voipmonitor (ASUS ESC8000A-E13P, Broadcom, ACS off) | 0.88 | 0.58 (127.6 / 221.0 GB/s) | 2.12 us |
| Condition | Cross-GPU Latency |
|---|---|
| P2P Enabled (same NUMA) | 0.36-0.45 us |
| P2P Enabled (cross-NUMA, Turin) | 0.44 us |
| P2P Disabled | ~14 us |
P2P enablement provides a 30x latency reduction. This is why the nvidia_uvm uvm_disable_hmm=1 fix and proper NCCL P2P configuration are critical.
From purplepow3r's system:
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5
0 2.07 0.37 0.36 0.38 0.44 0.36
1 0.37 2.07 0.36 0.38 0.44 0.36
2 0.37 0.37 2.07 0.38 0.44 0.36
3 0.38 0.38 0.38 2.07 0.38 0.38
4 0.44 0.44 0.44 0.38 2.07 0.36
5 0.36 0.36 0.36 0.38 0.36 2.07
From luke's 8x RTX 6000 Pro on switches:
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.07 0.45 0.51 0.45 0.45 0.45 0.45 0.44
1 0.52 2.07 0.44 0.45 0.52 0.45 0.45 0.44
...
AllReduce is the dominant collective operation in tensor-parallel inference. Higher bus bandwidth means faster decode.
| Setup | NCCL Config | Message Size | Bus BW (GB/s) |
|---|---|---|---|
| luke (3x switches) | MIN_NCHANNELS=8 | Sweep 8M-2G | 41.1 |
| Grimulkan (4x switches) | MIN_NCHANNELS=8 | Sweep 8M-2G | 39.4 |
| Grimulkan (4x switches) | Default | 32M fixed | 40.1 |
| Festr (dual Turin) | MIN_NCHANNELS=8 | Sweep 8M-2G | 37.6 |
| Festr (dual Turin) | Default | 32M fixed | 22.2 |
| purplepow3r (7 GPUs, dual Turin) | P2P_LEVEL=SYS | 4G fixed | 41.3-41.7 |
# Basic test (32M message, 8 GPUs)
NCCL_P2P_LEVEL=SYS NCCL_NET_GDR_LEVEL=SYS ./all_reduce_perf -b 32M -g 8 -c 0
# Sweep test with tuned channels
NCCL_NET_GDR_LEVEL=SYS NCCL_MIN_NCHANNELS=8 ./all_reduce_perf -b 8M -e 2G -f 2 -g 8 -n 50
# Large message test
NCCL_P2P_LEVEL=SYS NCCL_IB_DISABLE=1 ./build/all_reduce_perf -b 4G -e 4G -f 2 -g 7 -n 20 -N 100NCCL tests are located at /usr/src/nccl-tests in NVIDIA containers.
luke's custom allreduce kernel (for PCIe switch topologies) compared to NCCL:
| Size | Custom (us) | NCCL (us) | Winner |
|---|---|---|---|
| 256 B | 7.5 | 24.6 | Custom 3.3x |
| 1 KB | 7.5 | 24.1 | Custom 3.2x |
| 8 KB | 9.2 | 24.2 | Custom 2.6x |
| 32 KB | 16.5 | 24.5 | Custom 1.5x |
| 64 KB | 25.9 | 24.1 | NCCL 1.1x |
| 256 KB | 73.6 | 28.0 | NCCL 2.6x |
Custom allreduce wins big for inference-relevant message sizes (<32 KB) but loses at larger sizes. This kernel is only effective on densely-interconnected PCIe switch topologies -- it was actually slower than NCCL on dual-CPU systems without switches.
Custom allreduce repo: github.com/lukealonso/sglang/commits/custom_ar/
BAR1 (Base Address Register 1) maps GPU memory into the CPU's address space for P2P transfers.
- Resizable BAR: Must be enabled in BIOS (default on most boards)
- Expected size: 96 GB (matching VRAM) per GPU
- Common issue: Some BIOS configurations default to 256 MB BAR1, which cripples P2P performance
- GPU Display Mode: Set to headless via NVIDIA Display Mode Selector for largest BAR1 allocation. Required for 16-GPU setups.
Verify BAR1 in nvidia-smi:
$ nvidia-smi -q | grep "BAR1"
BAR1 Memory Usage
Total : 98304 MiB
During decode (token generation), each TP step requires:
- AllReduce after attention layer (~32-256 KB per layer)
- AllReduce after MoE/FFN layer (~32-256 KB per layer)
For a 397B MoE model with ~60 layers, that is ~120 AllReduce operations per token. At 25 us per AllReduce, that is 3 ms of pure communication overhead per token, limiting decode to ~333 tok/s even with zero compute time.
Inference AllReduce messages are small (32-256 KB). At these sizes:
- Bandwidth is irrelevant (56 GB/s can transfer 256 KB in 4.5 us)
- Latency dominates (NCCL ring setup, protocol negotiation, synchronization)
This is why:
- NCCL's LL (Low Latency) protocol gives 1.5-1.9x speedup
- Custom allreduce kernels that bypass NCCL overhead give 2-3x speedup
- PCIe switch cut-through latency (~100 ns) matters more than raw bandwidth
For high-concurrency workloads (many simultaneous requests), disabling P2P and routing through DRAM can actually be faster:
| Workload | P2P Enabled | P2P Disabled | Winner |
|---|---|---|---|
| Single batch (low latency) | 90 tok/s | 70 tok/s | P2P |
| 100 concurrent requests | 5000 tok/s | 10000 tok/s | No-P2P |
This is because DRAM routing can use more channels and higher aggregate bandwidth for large batched operations.
Add to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub:
pcie_aspm=off pcie_port_pm=off| Parameter | Purpose |
|---|---|
pcie_aspm=off |
Disables Active State Power Management on all PCIe links |
pcie_port_pm=off |
Disables PCIe port runtime power management. CRITICAL: prevents root port from suspending during GPU link retrain Gen1<->Gen5 |
Without pcie_port_pm=off, GPU DynamicPowerManagement=3 causes "Surprise Link Down" errors (aer_uncor_status: 0x00000020) leading to system lockups.
# Full recommended GRUB line (Festr's Turin system):
GRUB_CMDLINE_LINUX="rd.auto=1 rd.md=1 rd.md.conf=1 mitigations=off spectre_v2=off spec_store_bypass_disable=off l1tf=off mds=off tsx_async_abort=off srbds=off mmio_stale_data=off retbleed=off amd_iommu=off iommu=off"| Parameter | Purpose |
|---|---|
iommu=off |
Prevents NCCL hangs. Without this, NCCL P2P may deadlock. |
amd_iommu=off |
AMD-specific IOMMU disable |
mitigations=off |
Disable CPU security mitigations for maximum performance |
nvme_core.default_ps_max_latency_us=0 |
Prevent NVMe power state transitions (stability) |
After editing, run:
update-grub
reboot
# Verify:
cat /proc/cmdline# /etc/modprobe.d/uvm.conf
options nvidia_uvm uvm_disable_hmm=1Without this, NCCL P2P operations lock up. This is required on virtually all RTX PRO 6000 multi-GPU setups.
- Resizable BAR: Enabled (default)
- Above 4G Decoding: Enabled
- SR-IOV: Enabled
- Slot 6 Warning: On WRX90E-SAGE SE, slot 6 is limited to Gen5 x8 speed
ZFS caused system freezes for some users on these GPU workloads. EXT4 is recommended for the OS filesystem.
| Tool | Purpose |
|---|---|
nvidia-smi topo -m |
Display GPU topology matrix |
nvidia-smi -q |
Detailed GPU info including BAR1, power, thermals |
| p2pmark | P2P bandwidth, latency, and allreduce benchmarks |
p2pBandwidthLatencyTest |
CUDA samples P2P test |
| amd-epyc-gpu-fabric-monitor | Real-time AMD xGMI fabric transfer monitoring |
memtest86 |
RAM testing |
| Intel MLC | Memory Latency Checker (works on AMD) |
rasdaemon |
AER error monitoring |
nvitop / nvtop |
GPU monitoring |
| PCIe AER counters | Check aer_dev_correctable and aer_dev_fatal in sysfs |