GPU Configurations & Rig Builds
Practical hardware configurations for RTX PRO 6000 Blackwell (96 GB GDDR7) inference rigs, including model sizing, power management, and specific community builds.
VRAM Requirements by GPU Count
GPUs
NVFP4 Models That Fit
FP8 Models That Fit
1x 96GB
Qwen3.5-27B, smaller models
--
2x 96GB
MiniMax-M2.5 NVFP4, Qwen3.5-122B NVFP4
--
4x 96GB
Qwen3.5-397B NVFP4, GLM-4.7 NVFP4, MiniMax-M2.5 FP8
MiniMax-M2.5 FP8
6x 96GB
GLM-5 NVFP4 (TP2 PP3)
--
8x 96GB
All current models incl. Kimi K2.5, GLM-5
GLM-4.7 FP8, Qwen3.5-397B FP8
16x 96GB
All models with massive KV cache headroom
GLM-5 FP8
Per-GPU VRAM Breakdown Examples
Model
Quant
GPUs
Weights/GPU
KV Cache/GPU
Total/GPU
Qwen3.5-397B
NVFP4
4x
~82 GB
varies
~82 GB at max ctx
MiniMax-M2.5
FP8
4x
~90 GB
varies
~90 GB
GLM-5 (744B)
NVFP4
8x
57.06 GB
29.32 GB (bf16)
~86.38 GB
Kimi K2.5 (530B)
INT4
8x
~60 GB
varies
depends on KV dtype
Tensor Parallel Constraints
TP must be power-of-2 for vLLM/SGLang: 2, 4, 8, 16
For odd GPU counts, use Pipeline Parallel: e.g., TP2 PP3 = 6 GPUs
ExLlama v2/v3 supports arbitrary GPU counts
16 cards: only slightly slower single-batch than 8 cards due to comms overhead, but ~6x more KV cache tokens
The most common setup for running large MoE models at NVFP4 quantization.
Model
Quant
Status
Notes
Qwen3.5-397B-A17B
NVFP4
Works well
~82 GB/GPU, TP=4
MiniMax-M2.5
FP8
Works well
~90 GB/GPU, TP=4
MiniMax-M2.5
NVFP4
Works well
Fits on 2x, TP=2
GLM-4.7
NVFP4
Works well
TP=4
GLM-4.7
FP8
Works
Tight on VRAM
Qwen3.5-122B-A10B
NVFP4
Works
Fits on 2x
What Does NOT Fit on 4x 96GB
Model
Quant
Why
Kimi K2.5 (530B)
INT4 (native)
Already INT4 quantized, still needs 8 cards
GLM-5 (744B)
NVFP4
~440 GB weights alone exceeds 384 GB
GLM-5 (744B)
FP8
Far too large
Qwen3.5-397B
FP8
Needs 8x
Required for the largest models and for FP8 precision on 397B-class models.
What Fits on 8x 96GB (768 GB total)
Model
Quant
KV Cache Dtype
KV Cache Capacity
Notes
Kimi K2.5
INT4 (native)
BF16
~190K tokens
Fastest decode (90 tok/s)
Kimi K2.5
INT4 (native)
FP8
~450K tokens
Slightly slower, much more context
Kimi K2.5
INT4 + DCP=8
FP8
~3.6M tokens
Best for long context
GLM-5
NVFP4 + MTP
BF16
~314K tokens
FP8 KV broken on SM120
Qwen3.5-397B
FP8
auto
large
TP=8
Qwen3.5-397B
NVFP4
FP8
large
TP=4 + DP=2 possible
2 nodes x 4x RTX 5090 connected via 2x 10GbE running Qwen3-235B-A22B-NVFP4 via vLLM Ray
Occasional instability every 2-3 days requiring restart
Required building Ray from source with PR #58866 patch
For maximum KV cache capacity and running the largest models in higher precision.
Grimulkan runs 16x RTX 6000 Pro on 4x PCIe switches on a single CPU
Only slightly slower single-batch than 8 cards due to communication overhead
~6x more KV cache tokens than 8 cards (no weight replication overhead in TP)
GPU display mode must be set to headless for 16 GPUs (BAR1 allocation)
RTX PRO 6000 Blackwell Variants
Variant
TDP
Form Factor
Notes
Workstation Edition
600W
Dual-slot, blower
Standard desktop/workstation card
Server Edition
600W
Passive cooling
Requires chassis airflow
MaxQ
300W
Compact
Designed for dense packing
MaxQ vs Workstation Edition Performance
Metric
MaxQ (300W)
Workstation (600W)
Delta
Prefill speed
Baseline
~20% faster
Compute-bound
Decode speed (single user)
~96% of 600W
Baseline
VRAM/PCIe limited
Decode speed (64 concurrent)
~70% of 600W
Baseline
Significant loss
Power Limit Scaling (MiniMax-M2.5 NVFP4, 4x cards)
Power Limit
4 Concurrent
16 Concurrent
32 Concurrent
64 Concurrent
300W
Baseline
Baseline
Baseline
1206 tok/s
400W
~+2%
~+10%
~+18%
~+22%
500W
~+3%
~+16%
~+25%
1558 tok/s (+29%)
600W
~+4%
~+17%
~+26%
~+30%
Key findings:
Power-limiting 600W to 300W loses only ~4% single-user performance
Loss increases to ~30% at 64 concurrent users
500W performs nearly identically to 600W
Performance drop from 400W to 300W is significant at high concurrency
Detailed wattage-performance analysis: shihanqu.github.io/Blackwell-Wattage-Performance
Memory Overclocking (MaxQ Only)
MaxQ cards accept memory clock offset. Server Edition cards return an error.
import pynvml
pynvml .nvmlInit ()
count = pynvml .nvmlDeviceGetCount ()
for i in range (count ):
h = pynvml .nvmlDeviceGetHandleByIndex (i )
pynvml .nvmlDeviceSetMemClkVfOffset (h , 4000 )
pynvml .nvmlShutdown ()
"fwiw - overclocking my GDDR7 got me a few percentage points, it's what got me over the hill to 101." -- luke
Community Rig Builds
Festr -- Dual Turin Server (8x Server Edition)
Component
Spec
CPUs
2x AMD EPYC 9575F 64-Core (5 GHz boost)
Motherboard
K15PG-D24 Series, 60SB0D94-SB0A01
RAM
24x96 GB Samsung DDR5-6400 (2.2 TB total, all 12 channels/CPU)
GPUs
8x RTX PRO 6000 Blackwell Server Edition
Topology
Direct-attach, 4 GPUs per NUMA node
xGMI
3x links (192 GB/s fabric)
OS
Ubuntu 24.04.3 LTS, Kernel 6.17.0-14-generic
CUDA
13.1.1, Driver 590.48.01
luke -- PCIe Switch Setup (8x MaxQ)
Component
Spec
CPU
AMD Threadripper Pro (single socket)
Motherboard
ASUS WRX90E (7 PCIe slots)
Switches
3x c-payne Microchip Switchtec PM50100 (100-lane Gen5)
GPUs
8x RTX 6000 Pro MaxQ
Topology
2 leaf switches (partitioned, 2x x16 uplinks each), 1 root switch
Case
Open-air mining chassis
Special
Overclocked GDDR7 memory (+4000 offset)
Grimulkan -- 16-GPU Switch Setup
Component
Spec
CPU
AMD Turin (single socket)
Switches
4x c-payne PCIe Gen5 switches (star topology)
GPUs
16x RTX PRO 6000 Blackwell
Notes
Highest total GPU count in community
orangezed -- Budget Dual EPYC (8x MaxQ)
Component
Spec
CPUs
2x AMD EPYC 9374F 32-Core (128 threads)
Motherboard
ASRockRack TURIN2D24G-2L+/500W
RAM
10x48 GB DDR5-4800 (472 GB, only 5 channels/CPU)
GPUs
8x RTX PRO 6000 Blackwell MaxQ Workstation Edition
xGMI
2x links only
Notes
Lower cross-NUMA bandwidth (~64 GB/s bidir vs ~99+ on Turin)
Ixtrix -- Desktop 4-GPU Build
Component
Spec
Motherboard
ASUS PRO WS WRX90E-SAGE SE
GPUs
4x RTX 6000 Pro MaxQ
Virtualization
Proxmox
Notes
Provided critical BIOS/GRUB stability settings
Qu (shihanqu) -- 4-GPU Workstation
Component
Spec
GPUs
4x RTX 6000 Pro Workstation (600W)
Notes
Wattage-performance benchmark creator
Power Consumption & Electrical
Per-Card Power by Workload
Phase
Power per Card
Notes
Idle
~30-50W
With VLLM_SLEEP_WHEN_IDLE=1
Decode
~300W
Memory/PCIe bound
Prefill
400-600W
Compute bound
Peak (GLM-5 prefill)
640W observed
All 8 cards simultaneously
Circuit: 220V 30A recommended for 8-GPU rigs
Power distribution: Boards with 14x 12VHPWR connections (1000A @ 12V)
8x 600W cards: 4800W GPU power alone, plus CPU/RAM/fans/overhead = ~5500-6000W total
8x 300W cards (MaxQ): 2400W GPU power, ~3000-3500W total
Power-limit 600W cards to 500W: nearly identical performance, saves 800W across 8 cards
Power-limit to 300W: only ~4% single-user loss, saves 2400W across 8 cards
Use nvidia-smi -pl 500 or pynvml to set power limits
Threshold
Impact
< 80C
Optimal for longevity
80-85C
Acceptable for sustained operation
88C
PCIe slot temps can cause system instability
95C
Thermal throttle point (per PNY)
Noctua IndustrialPPC fans: Standard choice for air cooling, 6x keeps cards under 90C
Watercooling: For single-slot density configurations
uCoustic 24U active cooling cabinet: Model 9210i for silent rackmount cooling
Despite PNY's claim that fan control requires a GUI/X server, headless fan control is possible:
pynvml library (no GUI required):
import pynvml
# Fan control via pynvml
pynvml .nvmlInit ()
handle = pynvml .nvmlDeviceGetHandleByIndex (0 )
# Set fan speed as needed
LACT tool (fan curves via config file):
# /etc/lact/config.yaml
profiles :
Max :
gpus :
10DE:2BB1-10DE:204B-0000:01:00.0 :
fan_control_enabled : true
fan_control_settings :
mode : curve
static_speed : 0.5
temperature_key : gpu
interval_ms : 1000
curve :
70 : 0.6
75 : 0.7
80 : 0.75
82 : 0.80
85 : 1.0
spindown_delay_ms : 10000
change_threshold : 3
auto_threshold : 70
power_cap : 600.0
Supermicro AS-4124GS-TNR: 8-GPU rackmount server
ASUS ESC8000A-E13P: 8-GPU rackmount (reported working with GLM-5)
Desktop / Open-Air Options
Chinese 8/10/12/16 GPU cases from Alibaba:
Open-air mining chassis: Used by luke for 8x MaxQ with switches
Custom aluminum rigs: Built by purplepow3r for mixed GPU setups
Motherboard
Slots
CPU
Notes
ASUS PRO WS WRX90E-SAGE SE
7x PCIe
Threadripper
Slot 6 limited to Gen5 x8
Supermicro AS-4124GS-TNR
8x PCIe
Dual EPYC
Rackmount
ASRockRack TURIN2D24G-2L+
8x PCIe
Dual Turin EPYC
Budget option, only 2x xGMI
K15PG-D24 Series
8x PCIe
Dual Turin EPYC
Festr's board, 3x xGMI
CPU
Sockets
PCIe Lanes
xGMI BW
Best For
AMD EPYC 9575F Turin
Dual
128/socket
192-256 GB/s
Best dual-CPU direct-attach
AMD EPYC 9374F Genoa
Dual
128/socket
~96 GB/s
Budget dual-CPU
AMD Threadripper 9985WX
Single
128
N/A
Single-CPU + switches
AMD Threadripper 5975WX
Single
128
N/A
Budget single-CPU
Turin: DDR5-6400, populate all 12 channels per CPU
Genoa: DDR5-4800, populate all 12 channels per CPU
Minimum: 256 GB for 4-GPU rigs, 512 GB+ for 8-GPU rigs
Festr's recommendation: 2+ TB for KV cache offload experiments
Warning: Under-populating DRAM channels causes severe cross-NUMA performance degradation