[WC][OV] NVFP4 support by daniil-lyakhov · Pull Request #3967 · openvinotoolkit/nncf

daniil-lyakhov · 2026-03-02T14:27:17Z

NVFP4 dtype is introduced:
f4e2m1 weight compression with constant group size 16
Scale is compressed to f8e4m3 using single fp32 second degree scale

Changes

CompressedWeight container is extended with second_degree_scale attribute
do float/int quantizaton/dequantization functions are updated to return Compressedweight instead of list of tensors to simplify the output of the functions (instead of returning 5 tensors the container with named attributes is returned)
OpenVINO WC backend is extended to insert NVFP4 compression subgraphs with 2 scales in it

Reason for changes

To support NVFP4 compression

Related tickets

Tests

tests/torch/function_hook/quantization/test_weights_compression.py and tests/onnx/quantization/test_weights_compression.py is updated to check that NVFP4 mode is raising non supported param error
In tests/openvino/native/quantization/test_weights_compression.py:
** test_compare_compressed_weights checks the subgraph is correct and scales/compressed weight are calculated correctly
** test_float_compressed_weighs_range check the do_float_quantization and do_float_dequantization are correct with NVFP4
** TestUnsupportedParams (+test_nvfp4_precomputed_scales) checks that no algorithm / group_size != 16 / fallback mode / precomputed scales are supported with NVFP4
** test_mixed_precision_fp checks the correctness of mixed precision algorithm with the NVFP4 (and correctness of group_size=16 param)

Copilot

Pull request overview

This PR introduces NVFP4 (NVIDIA FP4) as a new weight compression dtype for OpenVINO backend only. NVFP4 uses f4e2m1 (E2M1) values with a fixed group size of 16, where each group has an f8e4m3 (E4M3) scale that is further quantized using a per-weight FP32 second-degree scale.

Changes:

CompressWeightsMode.NVFP4 is added as a new mode with group size 16, f4e2m1 compressed weights, and a two-level scale (f8e4m3 group scale + FP32 per-weight second-degree scale).
do_float_quantization/do_float_dequantization and do_integer_quantization/do_integer_dequantization are refactored to return/accept CompressedWeight instead of tuples of tensors, and CompressedWeight is extended with a second_degree_scale field.
The OpenVINO backend's _create_compression_subgraph inserts the NVFP4 two-scale dequantization subgraph.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/nncf/parameters.py`	Adds `NVFP4 = "nvfp4"` to `CompressWeightsMode` enum
`src/nncf/quantization/algorithms/weight_compression/parameters.py`	Adds `second_degree_scale` field to `CompressedWeight`; removes `is_codebook()` method
`src/nncf/quantization/algorithms/weight_compression/config.py`	Maps NVFP4 to `f4e2m1` compression dtype; marks it as non-integer
`src/nncf/quantization/algorithms/weight_compression/weight_lowering.py`	Implements NVFP4 two-scale quantization/dequantization; refactors `do_float/integer_quantization/dequantization` return types
`src/nncf/quantization/algorithms/weight_compression/algorithm.py`	Sets default group size 16 and validates NVFP4 group size constraint
`src/nncf/quantization/algorithms/weight_compression/openvino_backend.py`	Inserts two-scale dequantization subgraph for NVFP4
`src/nncf/quantization/algorithms/weight_compression/scale_estimation.py`	Updates call sites to use new `CompressedWeight` return types
`src/nncf/quantization/algorithms/weight_compression/lora_correction.py`	Updates dequantization call sites
`src/nncf/quantization/quantize_model.py`	Adds NVFP4 to unsupported mode lists for Torch/ONNX backends
`src/nncf/openvino/optimized_functions/functions.py`	Updates optimized quantization functions to return `CompressedWeight`
`tests/openvino/native/quantization/test_weights_compression.py`	Adds comprehensive NVFP4 tests: subgraph check, scale range check, unsupported params, mixed precision
`tests/openvino/native/data/2026.0/reference_scales/IntegerModel_compressed_weights_nvfp4.json`	Reference data for NVFP4 compressed weight test
`tests/openvino/optimized_functions/test_compression_functions.py`	Updates to use `CompressedWeight` object in assertions
`tests/torch/function_hook/quantization/test_weights_compression.py`	Adds `NVFP4` to unsupported modes for Torch
`tests/onnx/quantization/test_weights_compression.py`	Adds `NVFP4` to unsupported modes for ONNX
`docs/usage/post_training_compression/weights_compression/Usage.md`	Documents NVFP4 format in the modes table
`docs/Algorithms.md`	Mentions NVFP4 in supported types
`.ci/cspell_dict.txt`	Adds "nvfp" to the spellcheck dictionary

Comments suppressed due to low confidence (3)

src/nncf/quantization/algorithms/weight_compression/weight_lowering.py:451

The docstring for do_integer_quantization still says :return: A tuple containing the compressed weights, scale, and zero point tensors., but the return type was changed from tuple[Tensor, Tensor, Tensor] to CompressedWeight. The return description should be updated to match the new return type.

) -> CompressedWeight:
    """
    Performs integer quantization on the given weight tensor.

    :param weight: The weight tensor to quantize.
    :param config: The weight compression configuration.
    :param reduction_axes: Axes along which to reduce (collect) statistics (e.g., min, max). Not required if
        precomputed scale (and zero point) are provided.
    :param precomputed_scale: Optional precomputed scale tensor.
    :param precomputed_zero_point: Optional precomputed zero point tensor.
    :return: A tuple containing the compressed weights, scale, and zero point tensors.

src/nncf/quantization/algorithms/weight_compression/weight_lowering.py:515

The docstring for integer_quantize_dequantize_weight still says :return: Dequantized weight tensor or a tuple containing the decompressed weight, compressed weight, scale, (and zero point)., but the return type was changed to Tensor | tuple[Tensor, CompressedWeight]. The description should be updated to reflect that when return_compressed_weight=True, only a (decompressed_weight, CompressedWeight) tuple is returned.

    """
    First quantizes the given weight tensor to integer dtype and then dequantizes it back to obtain float32 values.

    :param weight: The weight tensor to quantize-dequantize.
    :param config: Compression configuration.
    :param reduction_axes: Axes along which to reduce (collect) statistics (e.g., min, max). Not required if
        precomputed scale (and zero point) are provided.
    :param precomputed_scale: Optional precomputed scale tensor.
    :param precomputed_zero_point: Optional precomputed zero point tensor.
    :param return_compressed_weight: If True, besides decompressed weight will also return compressed weight, scale,
        (and zero point).
    :return: Dequantized weight tensor or a tuple containing the decompressed weight, compressed weight, scale,
        (and zero point).

src/nncf/quantization/algorithms/weight_compression/weight_lowering.py:250

The docstring for float_quantize_dequantize_weight still says :return: Dequantized weight tensor or a tuple containing the decompressed weight, compressed weight and scale. but the return type was changed to Tensor | tuple[Tensor, CompressedWeight]. The description should be updated to reflect that the tuple now contains (decompressed_weight, CompressedWeight) rather than (decompressed_weight, compressed_weight, scale).

) -> Tensor | tuple[Tensor, CompressedWeight]:
    """
    First quantizes the given weight tensor to float dtype and then dequantizes it back to obtain float32 values.

    :param weight: The weight tensor to quantize-dequantize.
    :param config: Compression configuration.
    :param reduction_axes: Axes along which to reduce statistics. Not required if precomputed scale are provided.
    :param precomputed_scale: Optional precomputed scale tensor.
    :param return_compressed_weight: If True, besides decompressed weight will also return compressed weight and scale.
    :return: Dequantized weight tensor or a tuple containing the decompressed weight, compressed weight and scale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/openvino/native/quantization/test_weights_compression.py

src/nncf/quantization/algorithms/weight_compression/weight_lowering.py

src/nncf/quantization/algorithms/weight_compression/parameters.py

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T10:40:56Z

src/nncf/quantization/algorithms/weight_compression/weight_lowering.py

    """
    The method dequantizes the given integer weights to float point data type in accordance with the scale and
    zero_point data type.



The docstring for do_integer_dequantization (line 415) documents :param compressed_weight: but the actual function parameter is named compressed_weights (note the plural). This is a mismatch between the docstring and the function signature.

src/nncf/quantization/algorithms/weight_compression/weight_lowering.py

Copilot · 2026-03-04T10:40:57Z

src/nncf/quantization/algorithms/weight_compression/weight_lowering.py

+    :return: CompressedWeight instance containing the compressed weight tensor, scale,
+        and optionally second degree scale or codebook with indexes.


The docstring at line 167-168 says :return: Returns quantized (for codebook normalized) weight tensor and corresponding scale tensor, optional second degree scale and optional indexes for codebook. This return description describes individual tuple elements, but the function now returns a single CompressedWeight object. The docstring should be updated to accurately reflect the new return type.

Suggested change

:return: CompressedWeight instance containing the compressed weight tensor, scale,

and optionally second degree scale or codebook with indexes.

:return: CompressedWeight instance encapsulating the compressed weight tensor and associated scale data.

docs/usage/post_training_compression/weights_compression/Usage.md

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/usage/post_training_compression/weights_compression/Usage.md

src/nncf/quantization/algorithms/weight_compression/parameters.py

ljaljushkin · 2026-03-06T12:55:25Z

tests/openvino/native/quantization/test_weights_compression.py

-
+    compressed_weight = do_integer_quantization(w, config, -1)
+
+    assert np.allclose(np.abs(compressed_weight.tensor.data), np.abs(w.data))


I'd check that such 2-scale decompression subgraph can be inferred by OpenVINO on CPU. single layer test would be enough.

Made a small sanity test with a reference output, please check

andreyanufr · 2026-03-06T13:48:41Z

src/nncf/quantization/algorithms/weight_compression/parameters.py

    :param zero_point: The zero-point, it is the value of the compression type corresponding to the value 0
        in the non-compression realm. Applicable for INT quantization.
    :param codebook: The codebook (LUT) for the weight compression. Applicable for vector quantization
+    :param second_degree_scale: The second degree scale used when the decompression scale itself is compressed.


Is it official name for this kind of scale? I've seen the terms "super-scale" or "super-block-scale" before.

With the clode I found:

NVIDIA Model Optimizer (NVIDIA/Model-Optimizer) — Implementation of NVFP4 quantization showing global_amax, global_scale, weights_scaling_factor_2, and _double_scale:

https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/qtensor/nvfp4_tensor.py
https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/triton/fp4_kernel_hopper.py
https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/nn/modules/tensor_quantizer.py (class NVFP4StaticQuantizer)

I like the global scale name, what do you think?

Yes, global scale looks good for me.

andreyanufr · 2026-03-09T10:11:43Z

docs/usage/post_training_compression/weights_compression/Usage.md

 | MXFP8_E4M3       | E4M3         | E8M0       | Group-wise (32)          | [MX-compliant FP8](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) |
 |   FP8_E4M3       | E4M3         | FP16       | Per-channel / Group-wise | [FP8](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) |
 |   FP4            | E2M1         | FP16       | Per-channel / Group-wise | [FP4](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) |
+| NVFP4            | E2M1         | E4M3 per group / FP32 per weight | Group-wise (16)          | [NVFP4](https://www.arxiv.org/pdf/2602.14582) |


Probably it is link to wrong paper:
"YOLO26: A Comprehensive Architecture Overview and Key Improvements"

Ohh, nice catch!

github-actions bot added documentation Improvements or additions to documentation NNCF PT Pull requests that updates NNCF PyTorch NNCF OpenVINO Pull requests that updates NNCF OpenVINO NNCF ONNX Pull requests that updates NNCF ONNX API Public API-impacting changes labels Mar 2, 2026

daniil-lyakhov force-pushed the dl/nvfp4_rev1 branch from bf89153 to 7c22e95 Compare March 2, 2026 17:56

[WC][OV] NVFP4 support

7c22e95

daniil-lyakhov marked this pull request as ready for review March 3, 2026 13:47

daniil-lyakhov requested a review from a team as a code owner March 3, 2026 13:47

Copilot AI review requested due to automatic review settings March 3, 2026 13:47

Copilot started reviewing on behalf of daniil-lyakhov March 3, 2026 13:48 View session

daniil-lyakhov requested a review from ljaljushkin March 3, 2026 13:54

Copilot AI reviewed Mar 3, 2026

View reviewed changes

daniil-lyakhov requested a review from Copilot March 4, 2026 10:31

Copilot started reviewing on behalf of daniil-lyakhov March 4, 2026 10:32 View session

daniil-lyakhov force-pushed the dl/nvfp4_rev1 branch from 40d01d3 to c4d3e38 Compare March 4, 2026 10:40

Copilot AI reviewed Mar 4, 2026

View reviewed changes

daniil-lyakhov requested a review from Copilot March 4, 2026 10:42

Copilot started reviewing on behalf of daniil-lyakhov March 4, 2026 10:43 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

docs/usage/post_training_compression/weights_compression/Usage.md Show resolved Hide resolved

Copilot comments

c4d3e38

daniil-lyakhov assigned ljaljushkin Mar 4, 2026

ljaljushkin reviewed Mar 6, 2026

View reviewed changes

src/nncf/quantization/algorithms/weight_compression/parameters.py Show resolved Hide resolved

ljaljushkin requested changes Mar 6, 2026

View reviewed changes

ljaljushkin requested a review from andreyanufr March 6, 2026 13:01

daniil-lyakhov force-pushed the dl/nvfp4_rev1 branch from 746bb1d to d60d643 Compare March 9, 2026 18:55

OpenVINO compilation sanity test

d60d643

MaximProshin added the Code Freeze label Mar 10, 2026

andreyanufr reviewed Mar 10, 2026

View reviewed changes

daniil-lyakhov requested a review from ljaljushkin March 10, 2026 12:31

daniil-lyakhov requested a review from andreyanufr March 10, 2026 12:53

Comments

b4da3ca

		:return: CompressedWeight instance containing the compressed weight tensor, scale,
		and optionally second degree scale or codebook with indexes.

	:return: CompressedWeight instance containing the compressed weight tensor, scale,
	and optionally second degree scale or codebook with indexes.
	:return: CompressedWeight instance encapsulating the compressed weight tensor and associated scale data.


		compressed_weight = do_integer_quantization(w, config, -1)

		assert np.allclose(np.abs(compressed_weight.tensor.data), np.abs(w.data))

Conversation

daniil-lyakhov commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Reason for changes

Related tickets

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ljaljushkin Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

daniil-lyakhov Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

andreyanufr Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

daniil-lyakhov Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

andreyanufr Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

andreyanufr Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

daniil-lyakhov Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

daniil-lyakhov commented Mar 2, 2026 •

edited

Loading