[WC][OV] NVFP4 support#3967
Conversation
bf89153 to
7c22e95
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces NVFP4 (NVIDIA FP4) as a new weight compression dtype for OpenVINO backend only. NVFP4 uses f4e2m1 (E2M1) values with a fixed group size of 16, where each group has an f8e4m3 (E4M3) scale that is further quantized using a per-weight FP32 second-degree scale.
Changes:
CompressWeightsMode.NVFP4is added as a new mode with group size 16, f4e2m1 compressed weights, and a two-level scale (f8e4m3 group scale + FP32 per-weight second-degree scale).do_float_quantization/do_float_dequantizationanddo_integer_quantization/do_integer_dequantizationare refactored to return/acceptCompressedWeightinstead of tuples of tensors, andCompressedWeightis extended with asecond_degree_scalefield.- The OpenVINO backend's
_create_compression_subgraphinserts the NVFP4 two-scale dequantization subgraph.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/nncf/parameters.py |
Adds NVFP4 = "nvfp4" to CompressWeightsMode enum |
src/nncf/quantization/algorithms/weight_compression/parameters.py |
Adds second_degree_scale field to CompressedWeight; removes is_codebook() method |
src/nncf/quantization/algorithms/weight_compression/config.py |
Maps NVFP4 to f4e2m1 compression dtype; marks it as non-integer |
src/nncf/quantization/algorithms/weight_compression/weight_lowering.py |
Implements NVFP4 two-scale quantization/dequantization; refactors do_float/integer_quantization/dequantization return types |
src/nncf/quantization/algorithms/weight_compression/algorithm.py |
Sets default group size 16 and validates NVFP4 group size constraint |
src/nncf/quantization/algorithms/weight_compression/openvino_backend.py |
Inserts two-scale dequantization subgraph for NVFP4 |
src/nncf/quantization/algorithms/weight_compression/scale_estimation.py |
Updates call sites to use new CompressedWeight return types |
src/nncf/quantization/algorithms/weight_compression/lora_correction.py |
Updates dequantization call sites |
src/nncf/quantization/quantize_model.py |
Adds NVFP4 to unsupported mode lists for Torch/ONNX backends |
src/nncf/openvino/optimized_functions/functions.py |
Updates optimized quantization functions to return CompressedWeight |
tests/openvino/native/quantization/test_weights_compression.py |
Adds comprehensive NVFP4 tests: subgraph check, scale range check, unsupported params, mixed precision |
tests/openvino/native/data/2026.0/reference_scales/IntegerModel_compressed_weights_nvfp4.json |
Reference data for NVFP4 compressed weight test |
tests/openvino/optimized_functions/test_compression_functions.py |
Updates to use CompressedWeight object in assertions |
tests/torch/function_hook/quantization/test_weights_compression.py |
Adds NVFP4 to unsupported modes for Torch |
tests/onnx/quantization/test_weights_compression.py |
Adds NVFP4 to unsupported modes for ONNX |
docs/usage/post_training_compression/weights_compression/Usage.md |
Documents NVFP4 format in the modes table |
docs/Algorithms.md |
Mentions NVFP4 in supported types |
.ci/cspell_dict.txt |
Adds "nvfp" to the spellcheck dictionary |
Comments suppressed due to low confidence (3)
src/nncf/quantization/algorithms/weight_compression/weight_lowering.py:451
- The docstring for
do_integer_quantizationstill says:return: A tuple containing the compressed weights, scale, and zero point tensors., but the return type was changed fromtuple[Tensor, Tensor, Tensor]toCompressedWeight. The return description should be updated to match the new return type.
) -> CompressedWeight:
"""
Performs integer quantization on the given weight tensor.
:param weight: The weight tensor to quantize.
:param config: The weight compression configuration.
:param reduction_axes: Axes along which to reduce (collect) statistics (e.g., min, max). Not required if
precomputed scale (and zero point) are provided.
:param precomputed_scale: Optional precomputed scale tensor.
:param precomputed_zero_point: Optional precomputed zero point tensor.
:return: A tuple containing the compressed weights, scale, and zero point tensors.
src/nncf/quantization/algorithms/weight_compression/weight_lowering.py:515
- The docstring for
integer_quantize_dequantize_weightstill says:return: Dequantized weight tensor or a tuple containing the decompressed weight, compressed weight, scale, (and zero point)., but the return type was changed toTensor | tuple[Tensor, CompressedWeight]. The description should be updated to reflect that whenreturn_compressed_weight=True, only a(decompressed_weight, CompressedWeight)tuple is returned.
"""
First quantizes the given weight tensor to integer dtype and then dequantizes it back to obtain float32 values.
:param weight: The weight tensor to quantize-dequantize.
:param config: Compression configuration.
:param reduction_axes: Axes along which to reduce (collect) statistics (e.g., min, max). Not required if
precomputed scale (and zero point) are provided.
:param precomputed_scale: Optional precomputed scale tensor.
:param precomputed_zero_point: Optional precomputed zero point tensor.
:param return_compressed_weight: If True, besides decompressed weight will also return compressed weight, scale,
(and zero point).
:return: Dequantized weight tensor or a tuple containing the decompressed weight, compressed weight, scale,
(and zero point).
src/nncf/quantization/algorithms/weight_compression/weight_lowering.py:250
- The docstring for
float_quantize_dequantize_weightstill says:return: Dequantized weight tensor or a tuple containing the decompressed weight, compressed weight and scale.but the return type was changed toTensor | tuple[Tensor, CompressedWeight]. The description should be updated to reflect that the tuple now contains(decompressed_weight, CompressedWeight)rather than(decompressed_weight, compressed_weight, scale).
) -> Tensor | tuple[Tensor, CompressedWeight]:
"""
First quantizes the given weight tensor to float dtype and then dequantizes it back to obtain float32 values.
:param weight: The weight tensor to quantize-dequantize.
:param config: Compression configuration.
:param reduction_axes: Axes along which to reduce statistics. Not required if precomputed scale are provided.
:param precomputed_scale: Optional precomputed scale tensor.
:param return_compressed_weight: If True, besides decompressed weight will also return compressed weight and scale.
:return: Dequantized weight tensor or a tuple containing the decompressed weight, compressed weight and scale.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/nncf/quantization/algorithms/weight_compression/weight_lowering.py
Outdated
Show resolved
Hide resolved
src/nncf/quantization/algorithms/weight_compression/parameters.py
Outdated
Show resolved
Hide resolved
40d01d3 to
c4d3e38
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """ | ||
| The method dequantizes the given integer weights to float point data type in accordance with the scale and | ||
| zero_point data type. | ||
|
|
There was a problem hiding this comment.
The docstring for do_integer_dequantization (line 415) documents :param compressed_weight: but the actual function parameter is named compressed_weights (note the plural). This is a mismatch between the docstring and the function signature.
| :return: CompressedWeight instance containing the compressed weight tensor, scale, | ||
| and optionally second degree scale or codebook with indexes. |
There was a problem hiding this comment.
The docstring at line 167-168 says :return: Returns quantized (for codebook normalized) weight tensor and corresponding scale tensor, optional second degree scale and optional indexes for codebook. This return description describes individual tuple elements, but the function now returns a single CompressedWeight object. The docstring should be updated to accurately reflect the new return type.
| :return: CompressedWeight instance containing the compressed weight tensor, scale, | |
| and optionally second degree scale or codebook with indexes. | |
| :return: CompressedWeight instance encapsulating the compressed weight tensor and associated scale data. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/nncf/quantization/algorithms/weight_compression/parameters.py
Outdated
Show resolved
Hide resolved
|
|
||
| compressed_weight = do_integer_quantization(w, config, -1) | ||
|
|
||
| assert np.allclose(np.abs(compressed_weight.tensor.data), np.abs(w.data)) |
There was a problem hiding this comment.
I'd check that such 2-scale decompression subgraph can be inferred by OpenVINO on CPU. single layer test would be enough.
There was a problem hiding this comment.
Made a small sanity test with a reference output, please check
746bb1d to
d60d643
Compare
| :param zero_point: The zero-point, it is the value of the compression type corresponding to the value 0 | ||
| in the non-compression realm. Applicable for INT quantization. | ||
| :param codebook: The codebook (LUT) for the weight compression. Applicable for vector quantization | ||
| :param second_degree_scale: The second degree scale used when the decompression scale itself is compressed. |
There was a problem hiding this comment.
Is it official name for this kind of scale? I've seen the terms "super-scale" or "super-block-scale" before.
There was a problem hiding this comment.
With the clode I found:
NVIDIA Model Optimizer (NVIDIA/Model-Optimizer) — Implementation of NVFP4 quantization showing global_amax, global_scale, weights_scaling_factor_2, and _double_scale:
https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/qtensor/nvfp4_tensor.py
https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/triton/fp4_kernel_hopper.py
https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/nn/modules/tensor_quantizer.py (class NVFP4StaticQuantizer)
I like the global scale name, what do you think?
There was a problem hiding this comment.
Yes, global scale looks good for me.
| | MXFP8_E4M3 | E4M3 | E8M0 | Group-wise (32) | [MX-compliant FP8](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) | | ||
| | FP8_E4M3 | E4M3 | FP16 | Per-channel / Group-wise | [FP8](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) | | ||
| | FP4 | E2M1 | FP16 | Per-channel / Group-wise | [FP4](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) | | ||
| | NVFP4 | E2M1 | E4M3 per group / FP32 per weight | Group-wise (16) | [NVFP4](https://www.arxiv.org/pdf/2602.14582) | |
There was a problem hiding this comment.
Probably it is link to wrong paper:
"YOLO26: A Comprehensive Architecture Overview and Key Improvements"
There was a problem hiding this comment.
Ohh, nice catch!
NVFP4 dtype is introduced:
f4e2m1 weight compression with constant group size 16
Scale is compressed to f8e4m3 using single fp32 second degree scale
Changes
CompressedWeightcontainer is extended withsecond_degree_scaleattributeCompressedweightinstead of list of tensors to simplify the output of the functions (instead of returning 5 tensors the container with named attributes is returned)Reason for changes
Related tickets
Tests
** test_compare_compressed_weights checks the subgraph is correct and scales/compressed weight are calculated correctly
** test_float_compressed_weighs_range check the
do_float_quantizationanddo_float_dequantizationare correct with NVFP4** TestUnsupportedParams (+test_nvfp4_precomputed_scales) checks that no algorithm / group_size != 16 / fallback mode / precomputed scales are supported with NVFP4
** test_mixed_precision_fp checks the correctness of mixed precision algorithm with the NVFP4 (and correctness of group_size=16 param)