[PyTorch Debug] NVFP4 debug stats support #2296

pggPL · 2025-10-23T14:28:09Z

Description

This PR adds support for NVFP4 statistics: underflows and mse. I add them in seperate feature, because we may want to have a lot nvfp4-specific features added later.

Also, I renamed few variables from "fp8"-like to "quantization"-like. I cannot rename all of them - for example "is_fp8_gemm_enabled" which is an API call, so I left some of them.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL · 2025-10-23T17:47:58Z

/te-ci pytorch

timmoon10 · 2025-10-29T18:25:29Z

transformer_engine/debug/features/disable_fp8_layer.py


 @Registry.register_feature(namespace="transformer_engine")
-class DisableFP8Layer:
+class DisableFP8Layer(DisableQuantizationLayer):


It may be worth raising a deprecation warning in the constructor or something. DisableFP8GEMM would also benefit from this.

timmoon10 · 2025-10-29T18:34:19Z

transformer_engine/debug/features/utils/stats_computation.py

+    return total_elements - first_zeros - second_zeros
+
+
+def add_nvfp4_underflows_stats():


With RHT, it's possible that the NVFP4 data has fewer zeros than the high-precision data, so this stat would be negative. It's not possible yet (we currently only apply RHT to the NVFP4 column-wise data) and it's probably beyond the scope of this PR, but it's something to consider if we ever generalize.

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL · 2026-01-09T16:59:37Z

@greptileai

greptile-apps

Greptile Overview

Greptile Summary

This PR adds NVFP4 statistics support (underflows% and MSE) to the debug features and refactors FP8-specific naming to quantization-agnostic terminology. The implementation includes a new LogNvfp4TensorStats feature class, NVFP4-specific statistics computation functions, and comprehensive tests.

Major changes:

New LogNvfp4TensorStats feature for logging NVFP4 tensor statistics (underflows%, mse)
New quantization-agnostic features DisableQuantizationGEMM and DisableQuantizationLayer
Deprecated DisableFP8GEMM and DisableFP8Layer with proper warnings and backward compatibility
Refactored variable names from "fp8" to "quantization" throughout debug infrastructure
Added NVFP4-specific stat computation functions (count_nonzero_nvfp4, add_nvfp4_underflows_stats)
Updated LogFp8TensorStats to detect NVFP4 quantizers and provide clear error messages
Comprehensive test coverage for NVFP4 numerics and what-if analysis scenarios

Issues found:

Minor: Error message inconsistency in log_nvfp4_tensor_stats.py line 169 (says "QuantizedTensor" but checks for "NVFP4TensorStorage")

Confidence Score: 4/5

This PR is safe to merge with minor style improvements recommended
The implementation is well-structured with proper error handling, comprehensive tests, and good backward compatibility. The NVFP4 statistics computation logic appears correct after thorough analysis of the nibble unpacking and zero-value detection. The refactoring from FP8-specific to quantization-agnostic naming is clean and consistent. Only one minor style issue was found (error message inconsistency). The deprecation warnings are properly implemented with clear migration paths.
No files require special attention - all changes are well-tested and properly implemented

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/debug/features/log_nvfp4_tensor_stats.py	4/5	New file adds NVFP4 tensor statistics logging. Minor error message inconsistency found.
transformer_engine/debug/features/utils/stats_computation.py	3/5	Adds NVFP4-specific stats computation. Function appears correct after thorough analysis.
transformer_engine/debug/features/log_fp8_tensor_stats.py	4/5	Updated to handle NVFP4 quantizer detection. Good defensive error handling added.
transformer_engine/debug/features/disable_quantization_gemm.py	5/5	New quantization-agnostic feature, clean implementation with backward compatibility.
transformer_engine/debug/features/disable_quantization_layer.py	5/5	New quantization-agnostic feature, clean implementation with proper naming.
tests/pytorch/debug/test_log.py	5/5	Comprehensive NVFP4 tests added with proper numeric validation.

Sequence Diagram

sequenceDiagram
    participant User as Training Script
    participant API as debug_api
    participant Feature as LogNvfp4TensorStats
    participant Buffer as STATS_BUFFERS
    participant Compute as stats_computation
    
    User->>API: autocast(recipe=NVFP4BlockScaling())
    User->>User: forward/backward pass
    User->>API: debug_api.step()
    
    Note over API,Feature: For each layer with NVFP4 quantization
    
    API->>Feature: inspect_tensor_enabled()
    Feature->>Feature: check_if_stat_is_supported()
    Feature-->>API: return (run_current, next_iter)
    
    alt Feature enabled for this iteration
        API->>Feature: inspect_tensor()
        Feature->>Feature: verify NVFP4Quantizer & NVFP4TensorStorage
        Feature->>Feature: get_stat_with_prefix() (add nvfp4_ prefix)
        Feature->>Buffer: try_add_buffer()
        Feature->>Feature: update_aux_dict()
        Note over Feature: aux_dict = {"nvfp4": quantized_tensor, "original_tensor": tensor}
        Feature->>Buffer: feed(aux_dict)
        Buffer->>Compute: compute nvfp4_underflows%
        Compute->>Compute: count_nonzero_nvfp4(packed_fp4_data)
        Note over Compute: Unpack nibbles, count non-zeros
        Buffer->>Compute: compute nvfp4_mse
        Compute->>Compute: dequantize & MSE loss
        Buffer-->>Feature: stats collected
    end
    
    API->>Buffer: flush statistics to log
    Buffer-->>User: statistics logged to file

greptile-apps · 2026-01-09T17:04:22Z

transformer_engine/debug/features/log_nvfp4_tensor_stats.py

+        assert isinstance(
+            quantized_tensor, NVFP4TensorStorage
+        ), "[NVTORCH INSPECT ERROR] LogNvfp4TensorStats quantized_tensor must be a QuantizedTensor."


error message inconsistency: says "quantized_tensor must be a QuantizedTensor" but the type check is for NVFP4TensorStorage

Suggested change

assert isinstance(

quantized_tensor, NVFP4TensorStorage

), "[NVTORCH INSPECT ERROR] LogNvfp4TensorStats quantized_tensor must be a QuantizedTensor."

assert isinstance(

quantized_tensor, NVFP4TensorStorage

), "[NVTORCH INSPECT ERROR] LogNvfp4TensorStats quantized_tensor must be a NVFP4TensorStorage."

pggPL and others added 8 commits October 23, 2025 15:31

init

b463cd7

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

96622bc

for more information, see https://pre-commit.ci

fixes

4b48026

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

55ed4ee

for more information, see https://pre-commit.ci

fix

c5f1849

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

160b77b

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

9907081

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

23b2d1d

Signed-off-by: Pawel Gadzinski <[email protected]>

timmoon10 previously approved these changes Oct 29, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into debug_nvfp4

cfa4af2

pggPL dismissed timmoon10’s stale review via cfa4af2 December 11, 2025 11:49

[pre-commit.ci] auto fixes from pre-commit.com hooks

c80533e

for more information, see https://pre-commit.ci

ptrendx added the fp4 label Dec 11, 2025

pggPL and others added 3 commits January 9, 2026 16:57

Merge branch 'main' into debug_nvfp4

b120f05

year update in license

1abcb0b

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

857d495

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL marked this pull request as ready for review January 9, 2026 16:10

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch Debug] NVFP4 debug stats support #2296

[PyTorch Debug] NVFP4 debug stats support #2296

Uh oh!

pggPL commented Oct 23, 2025 •

edited

Loading

Uh oh!

pggPL commented Oct 23, 2025

Uh oh!

timmoon10 Oct 29, 2025

Uh oh!

timmoon10 Oct 29, 2025

Uh oh!

pggPL commented Jan 9, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return total_elements - first_zeros - second_zeros


		def add_nvfp4_underflows_stats():

[PyTorch Debug] NVFP4 debug stats support #2296

Are you sure you want to change the base?

[PyTorch Debug] NVFP4 debug stats support #2296

Uh oh!

Conversation

pggPL commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

pggPL commented Oct 23, 2025

Uh oh!

timmoon10 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

pggPL commented Jan 9, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pggPL commented Oct 23, 2025 •

edited

Loading