Improving the accuracy of NVFP4 quantization with Adaptive Block Scaling.
This repository contains kernels for efficient NVFP4 quantization and matrix multiplication, and fast post-training quantization with our method, 4/6. If you have any questions, please get in touch or submit an issue.
Requirements:
- Python version 3.10 or newer
- CUDA toolkit 12.8 or newer
- PyTorch version 2.8 or newer
Install dependencies:
pip install ninja packaging psutil "setuptools>=77.0.3"Install fouroversix:
pip install fouroversix --no-build-isolationAlternatively, you can compile from source:
pip install --no-build-isolation -e .To speed up build times, set CUDA_ARCHS=100 to only compile kernels for B-series GPUs (i.e. B200, GB200, GB300), or CUDA_ARCHS=120 for RTX 50 and 60 Series GPUs (i.e. RTX 5090, RTX 6000).
Also, if you don't have a Blackwell GPU, you may use our reference implementation, which is slow but helpful for testing, by setting SKIP_CUDA_BUILD=1 before running pip install.
To run PTQ experiments, make sure to install our test dependencies using either:
pip install "fouroversix[evals]" --no-build-isolation
# Or, if installing from source:
pip install --no-build-isolation -e ".[evals]"Also, make sure all submodules are pulled and up to date:
git submodule update --initThen, install dependencies for each PTQ method as needed, following the instructions here.
from fouroversix import ModelQuantizationConfig, quantize_model
from transformers import AutoModelForCausalLM
# NVFP4 using 4/6 with MSE block selection
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
quantize_model(model)
# Standard NVFP4 round-to-nearest quantization
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
config = ModelQuantizationConfig(scale_rule="static_6")
quantize_model(model, config)Check the quantize_to_fp4 arguments for more details about how you can enable certain features during quantization, such as stochastic rounding or 2D block quantization.
import torch
from fouroversix import QuantizationConfig, quantize_to_fp4
x = torch.randn(1024, 1024, dtype=torch.bfloat16, device="cuda")
x_quantized = quantize_to_fp4(x)
# Standard NVFP4 round-to-nearest quantization
config = QuantizationConfig(scale_rule="static_6")
x_quantized = quantize_to_fp4(x, config)from fouroversix import fp4_matmul
# a and b can be either high-precision BF16 tensors, in which case they will be
# quantized, or low-precision QuantizedTensors if you've already quantized them
# yourself.
out = fp4_matmul(a, b)# Round-to-nearest quantization with 4/6:
python -m scripts.ptq --model-name meta-llama/Llama-3.2-1B --ptq-method rtn --task wikitext
# Standard NVFP4 round-to-nearest (RTN) quantization:
python -m scripts.ptq --model-name meta-llama/Llama-3.2-1B --ptq-method rtn --task wikitext --a-scale-rule static_6 --w-scale-rule static_6
# AWQ with 4/6:
python -m scripts.ptq --model-name meta-llama/Llama-3.2-1B --ptq-method awq --task wikitext
# High-precision baseline, no NVFP4 quantization:
python -m scripts.ptq --model-name meta-llama/Llama-3.2-1B --ptq-method high_precision --task wikitextIf you would prefer not to worry about setting up your local environment, or about acquiring a Blackwell GPU to run your experiments faster, you may run PTQ experiments on Modal by adding the --modal flag, and optionally the --detach flag which will enable you to CTRL+C.
The first time you launch experiments on Modal, it may take several minutes to build everything, but following commands will reuse the cached images.
This repository contains three implementations of NVFP4 quantization, each of which has various limitations:
- CUDA: Supports most but not all operations needed for efficient NVFP4 training. More operations will be added soon. Requires a Blackwell GPU.
- Triton: Supports all operations needed for efficient NVFP4 training, including stochastic rounding, the random Hadamard transform, transposed inputs, and 2D block scaling. Requires a Blackwell GPU.
- PyTorch: A reference implementation written in PyTorch that can run on any GPU. May have some educational value. Should not be used in real-world use cases.
When used with 4/6, these implementations have subtle numerical differences which can cause results to differ slightly, but not in a way that should cause uniformly worse performance for any of them. For more details, see here.
Our quantize_to_fp4 function will automatically select one of these backends based on your GPU and the quantization parameters you select.
If you would like to force selection of a specific backend, you may specify it by setting backend=QuantizeBackend.cuda in the quantization config passed to quantize_to_fp4, or quantize_backend=QuantizeBackend.cuda in the layer and model configs passed to quantize_model.
We welcome contributions to our repository, but get in touch before making any substantial changes. Also, please make sure any code changes are compliant with our linter:
ruff checkPlease use the following BibTeX entry to cite this work:
@misc{cook2025sixaccuratenvfp4quantization,
title={Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling},
author={Jack Cook and Junxian Guo and Guangxuan Xiao and Yujun Lin and Song Han},
year={2025},
eprint={2512.02010},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.02010},
}This repository is available under the MIT license. See the LICENSE.md file for details.
