NVFP4 is NVIDIA's native 4-bit floating-point format for Blackwell GPUs (SM120). It enables running 400B+ parameter models on just 4x RTX PRO 6000 Blackwell cards with minimal quality loss.
- What is NVFP4
- Available NVFP4 Models
- Calibration and Quantization Process
- Performance: NVFP4 vs FP8
- KV Cache Considerations
- CARVE: Unlocked/Abliterated Models
- SM120f Compilation and GEMM Backends
- Known Issues
NVFP4 (NVIDIA FP4) is a 4-bit floating-point format native to Blackwell architecture (SM120). It uses the E2M1 (2-bit exponent, 1-bit mantissa) format with blockwise quantization and FP8 scaling factors.
Key properties:
- 4 bits per weight element (vs 8 bits for FP8, 16 bits for BF16)
- Blockwise quantization with calibrated FP8 scales
- Native hardware support on SM120 via
cvt.rn.satfinite.e2m1x2.f32PTX instruction - Requires SM120f family-conditional compilation for optimal performance
- Typically quantized using NVIDIA ModelOpt toolkit
Why it matters: A 397B-parameter MoE model that requires 8 GPUs at FP8 fits on just 4 GPUs at NVFP4, halving hardware cost while maintaining ~99% of FP8 quality.
| Checkpoint | Source | KV Cache Scales | Notes |
|---|---|---|---|
nvidia/Qwen3.5-397B-A17B-NVFP4 |
NVIDIA ModelOpt | Yes (FP8 calibrated) | Official, best quality |
vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 |
Community | No | Early quant |
Sehyo/Qwen3.5-397B-A17B-NVFP4 |
llm-compressor | No (defaults to bf16 KV) | 2x KV cache memory |
vpyn/Qwen3.5-397B-A17B-CARVE-v1-NVFP4 |
Abliterated | Yes | Uncensored, better >300K context |
| Checkpoint | Notes |
|---|---|
Sehyo/Qwen3.5-27B-NVFP4 |
Multimodal + MTP support |
Sehyo/Qwen3.5-35B-A3B-NVFP4 |
Multimodal + MTP support |
Sehyo/Qwen3.5-122B-A10B-NVFP4 |
Multimodal + MTP support |
| Checkpoint | Notes |
|---|---|
nvidia/Kimi-K2.5-NVFP4 |
Slower than native INT4 Marlin on this model |
lukealonso/GLM-5-NVFP4 |
No MTP weights |
festr2/GLM-5-NVFP4-MTP |
With MTP layer 78 in BF16 (~19 GB) |
The key difference is KV cache calibration:
| Property | NVIDIA (ModelOpt) | Sehyo (llm-compressor) |
|---|---|---|
| KV cache scheme | {num_bits: 8, type: float, dynamic: false} |
null |
| KV cache scales | Calibrated k_scale/v_scale tensors | None (defaults to scale=1.0) |
| Runtime KV dtype | FP8 with proper calibration | BF16 (2x memory) or uncalibrated FP8 |
| VRAM for KV cache | 1x | 2x (if bf16) |
Recommendation: Use NVIDIA ModelOpt checkpoints when available for the best VRAM efficiency and quality.
It's possible to replace quality-sensitive layers with full-precision BF16 from the original model. No SGLang patches required — layer exclusion is handled entirely through config.json ignore patterns.
What to keep in BF16:
- Shared expert (all 60 layers): Runs on every token, outsized quality impact. +1 GB (+0.4%).
- Layer 0 routed experts (512 experts): First layer, sets representations for all subsequent layers. +3 GB (+1.3%).
The key is adding glob patterns to config.json → quantization_config → ignore:
"ignore": ["*.mlp.shared_expert.*", "*.layers.0.mlp.experts*", "..."]SGLang's is_layer_excluded() converts these to regex and returns UnquantizedLinearMethod / UnquantizedFusedMoEMethod for matching layers, which allocate BF16 buffers and load weights correctly.
See Hybrid NVFP4 Assembly Guide for the assembly script, config.json setup, and full instructions.
NVIDIA's official NVFP4 checkpoints are produced using the ModelOpt toolkit, which performs:
- Weight quantization: BF16 weights -> NVFP4 (E2M1) with blockwise FP8 scales
- KV cache calibration: Runs calibration data through the model to compute per-layer k_scale and v_scale tensors for FP8 KV cache
- Activation calibration: Computes scaling factors for activations
The CARVE (abliterated) model was created by:
- Starting from BF16 weights
- Applying abliteration (removing refusal behavior)
- Quantizing back to NVFP4 using ModelOpt
This preserves KV cache calibration quality while removing censorship.
SGLang:
--quantization modelopt_fp4
--kv-cache-dtype fp8_e4m3 # Only with NVIDIA checkpointvLLM:
--quantization modelopt # or leave unset for auto-detection
--kv-cache-dtype fp8 # Only with NVIDIA checkpoint| Model | Quant | GPUs | Decode tok/s | Notes |
|---|---|---|---|---|
| Qwen3.5-397B | NVFP4 | 4x | 70-86 | vLLM, no MTP |
| Qwen3.5-397B | NVFP4 + MTP=2 | 4x | 130 | vLLM |
| Qwen3.5-397B | FP8 | 8x | 75-125 | SGLang |
| GLM-4.7 | FP8 | 4x | 90-120 | Fastest |
| GLM-4.7 | NVFP4 | 4x | 60-90 | 20-30 tok/s slower than FP8 |
| Kimi K2.5 | INT4 (native) | 8x | 90 | Faster than NVFP4 variant |
| Kimi K2.5 | NVFP4 | 8x | 53-55 | Slower due to PTQ overhead |
General rule: NVFP4 is consistently 20-30 tok/s slower than FP8 for the same model, due to slower fused MoE kernels and GEMM operations. The trade-off is halved GPU count.
| Model | NVFP4 | FP8 | Delta |
|---|---|---|---|
| Qwen3.5-397B (MMLU-Pro) | 90.0% | ~90% | Within noise |
| GLM-5 (MMLU) | 0.873 | 0.877 (official BF16) | -0.004 |
| MiniMax-M2.5 (MMLU-Pro) | Higher than FP8 | Baseline | NVFP4 +0.4% |
"nvfp4 has 1% degradation and sometimes its even better than fp8 so the differences are really within noise probability" -- Festr
- MiniMax-M2.5: NVFP4 outperformed official FP8 by 0.4% on MMLU-Pro
- NVFP4 on 2 GPUs competes well with FP8 on 4 GPUs for throughput
- Half the hardware cost for similar quality
- Higher decode throughput when you have enough GPUs
- Kimi K2.5: native INT4 Marlin is faster than NVFP4 PTQ
- GLM-4.7: FP8 is noticeably faster in decode
NVIDIA ModelOpt checkpoints include calibrated k_scale and v_scale tensors, enabling FP8 KV cache with proper quality:
# SGLang
--kv-cache-dtype fp8_e4m3
# vLLM
--kv-cache-dtype fp8This uses half the memory of BF16 KV cache, allowing roughly 2x the context length.
Checkpoints without calibrated KV scales default to BF16 KV cache:
- 2x memory usage for KV cache
- Shorter maximum context length
- No quality risk from uncalibrated scales
GLM-5: FP8 KV cache is broken on SM120 -- produces garbled output or emits 1 token and stops. Only BF16 KV cache works.
Kimi K2.5 (SGLang): FP8 KV on the original INT4 checkpoint drops to 16 tok/s. The NVFP4 checkpoint supports FP8 KV at 55 tok/s.
CARVE models have been "abliterated" -- their refusal behavior has been surgically removed without retraining. The model is first converted to BF16, abliteration is applied, then it is quantized back to NVFP4.
vpyn/Qwen3.5-397B-A17B-CARVE-v1-NVFP4
| Context Length | CARVE tok/s | REF tok/s | Winner |
|---|---|---|---|
| 10K | 76.9 | 92.3 | REF +20% |
| 50K | 75.5 | 91.3 | REF +21% |
| 100K | 74.8 | 73.6 | ~tied |
| 200K | 74.3 | 95.5 | REF +29% |
| 300K | 73.3 | 43.8 | CARVE +67% |
| 400K | 67.9 | 42.3 | CARVE +61% |
| 500K | 67.0 | 42.2 | CARVE +59% |
Key finding: CARVE maintains much better performance at >300K context than the NVIDIA reference NVFP4.
CARVE supports YaRN rope scaling up to ~900K context:
--hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}'
--max-model-len 921600Requires VLLM_ALLOW_LONG_MAX_MODEL_LEN=1.
- Do NOT use MTP with CARVE -- "MTP was trained on censored content" and the model is abliterated
- Slightly slower than reference at short context (<200K)
NVFP4 on vLLM was consistently slower than INT4 AWQ due to the absence of the cvt.rn.satfinite.e2m1x2.f32 PTX instruction for FP32->FP4 conversion. This instruction is available on SM120 family but only when compiled for sm120f (family-conditional instructions).
FlashInfer initially did not compile for sm120f, resulting in suboptimal NVFP4 performance. This was fixed in FlashInfer PR #2650 and #2716.
SGLang FP4 GEMM backends:
| Backend | Notes |
|---|---|
flashinfer_cutlass |
Default. Has a race condition bug causing silent memory corruption at high concurrency. |
flashinfer_cudnn |
Faster and more stable. Recommended. |
# Recommended:
--fp4-gemm-backend flashinfer_cudnn
# Requires:
pip install nvidia-cudnn-cu13==9.19.1.2vLLM FP4 GEMM backends:
VLLM_NVFP4_GEMM_BACKEND=cutlassvLLM internal CUTLASS GEMMs compiled for sm120f achieve ~67 tok/s vs FlashInfer with CUDA 13.1 at ~65 tok/s.
| Backend | Speed | Notes |
|---|---|---|
cutlass |
Fastest for MTP | Only compatible SM120 MoE backend |
flashinfer_cutlass |
Default, slightly slower | |
deep_gemm |
Falls back to cutlass on SM120 | DeepGemm requires WGMMA/TCGEN05 |
The default flashinfer_cutlass FP4 GEMM backend has a race condition that silently corrupts memory, leading to crashes or token degradation under high concurrency.
Fix: Use --fp4-gemm-backend flashinfer_cudnn instead.
Reference: flashinfer#2708
Some vLLM nightly builds exhibit NVFP4 cache corruption.
Fix: Pin to known-good builds or use the orthozany/vllm-qwen35-mtp Docker image.
Subjective reports of NVFP4 producing lower-quality code compared to FP8, though this has not been formally benchmarked.
On SM120, the DeepGemm scale format detection incorrectly assumes ue8m0 scales. NVFP4 uses float8_e4m3fn scales, causing NaN output.
Fix:
sed -i "s/DEEPGEMM_SCALE_UE8M0 = DEEPGEMM_BLACKWELL/DEEPGEMM_SCALE_UE8M0 = False/" \
/sgl-workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/configurer.pyFor Kimi K2.5, the NVFP4 variant (PTQ from BF16) is slower than the native INT4 with Marlin kernels.
"There's no point in doing nvfp4 kimi imo, the source weights were int4." -- luke
NVIDIA's NVFP4 Qwen3.5 checkpoint requires adding "mtp.fc" to the quantization_config.ignore list in config.json:
"ignore": [
"...existing entries...",
"mtp.fc"
]Also add "model.language_model.layers..mlp.gate" to both config.json and hf_quant_config.json.
Related PRs: vLLM #35156, #35675.