Add logic for block-scaled tensors with GEMM swizzled scales #2486

timmoon10 · 2025-12-06T02:48:10Z

Description

All of the supported block-scaled tensor formats (MXFP8, NVFP4, DSv3 FP8) have two ways of ordering their scaling factors:

"Compact" ordering for quantization, dequantization, and communication
"Swizzled" ordering for GEMM

The core infrastructure handles this in an ad hoc way, blindly assuming that the "right" scale ordering is used for the different operations. The PyTorch infrastructure only supports MXFP8 and NVFP4 scales are in compact order, although DSv3 FP8 does have awareness of "compact" and "GEMM-ready" formats. This situation makes it hard to implement fused kernels that can bypass the swizzle kernel.

This PR adds a with_gemm_swizzled_scales field in the C++ tensor class so that the core infrastructure can distinguish between the different scale orderings. It also adds this field in the PyTorch quantized tensor classes, and exposes a optimize_for_gemm option in the quantizer so that we can create tensors that do not need communication or checkpointing. Finally, it rips out all the DSv3 FP8 infrastructure for the compact format, which is no longer necessary.

Progress

Closes #2446.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Support GEMM swizzled scales in C++ tensor class
Support GEMM swizzled scales in PyTorch quantized tensor classes
Support optimize_for_gemm option in PyTorch quantizer
Expose PyTorch function to swizzle scales
Support MXFP8 quantization with pre-swizzled scales
Enable fused quantize+swizzle kernels in linear module and related
Remove DSv3 FP8 compact data format. It was used to avoid all-gather interleaving, which we can now fix with the swap-first-dims kernel.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

zhongbozhu · 2026-01-05T18:23:11Z

transformer_engine/common/transformer_engine.cpp

+  }
+}
+
+void nvte_set_tensor_param_v2(NVTETensor tensor, NVTETensorParam param, const void *buf,


Why do we need a v2 here?

I don't want to break the existing APIs. That said, this PR isn't fully backward-compatible because the GEMM no longer secretly assumes that MXFP8 scales are swizzled.

Signed-off-by: Tim Moon <[email protected]>

@greptile-apps

Update copyright years. Tweak comments. Fix various complaints from @greptile-apps. Signed-off-by: Tim Moon <[email protected]>

ptrendx · 2026-01-14T18:34:02Z

tests/pytorch/test_float8_blockwise_scaling_exact.py

-@pytest.mark.parametrize("quant_dtype", [torch.float8_e4m3fn, torch.float8_e5m2], ids=str)
-@pytest.mark.parametrize("eps", [0], ids=["eps_0"])
-@pytest.mark.parametrize("pow_2_scales", [True], ids=["pow2scales"])
-def test_quantization_1D_block_tiling_with_compact_data_and_scales(


Why don't we need this test anymore?

FP8 block-scaling doesn't require a compact format anymore. Now it's always GEMM-ready.

ptrendx · 2026-01-14T18:38:14Z

tests/pytorch/test_fusible_ops.py

-        as when chaining multiple modules it is hard to validate
-        numerical accuracy.
-        """
+        """LayerNorm/RMSNorm + Linear + SwiGLU + Linear"""


Not sure how this change is relevant here to be honest.

Before these changes, we didn't have a test for the SwiGLU kernel with quantized, GEMM-ready output. We test the unquantized SwiGLU kernel here:

TransformerEngine/tests/pytorch/test_numerics.py

Line 1633 in 69636a0

def test_layernorm_mlp_accuracy(dtype, bs, model, activation, normalization, return_bias, bias):

We test the quantized SwiGLU kernel with non-GEMM-ready output here:

TransformerEngine/tests/pytorch/test_fusible_ops.py

Line 1566 in 69636a0

def test_activation(

It's also a good sanity-check for te.Sequential.

ptrendx · 2026-01-14T18:44:05Z

transformer_engine/common/cast/dispatch/quantize.cuh

-                                Float8BlockScaleTensorFormat::COMPACT);
-        rowwise_option = rowwise_compact ? FP8BlockwiseRowwiseOption::ROWWISE_COMPACT
-                                         : FP8BlockwiseRowwiseOption::ROWWISE_GEMM_READY;
+        rowwise_option = FP8BlockwiseRowwiseOption::ROWWISE_GEMM_READY;


Why are you always choosing the gemm ready version here?

Shouldn't it have a similar logic to check the with_gemm_swizzled_scales?

The way FP8 block-scaling was implemented, the only advantage of the compact format over the GEMM-ready format was support for all-gathers. However, all-gather support only required a simple change to use the swap-first-dims kernel. Instead of propagating this PR's changes throughout the FP8 block-scaling logic, I found it simpler to just remove the compact format entirely.

ptrendx · 2026-01-14T18:49:43Z

transformer_engine/common/cast/mxfp8/gated_mxfp8.cuh

+
+              const size_t shmem_size = in_mem + out_mem + TMA_SHMEM_ALIGNMENT;
+
+              // Zero out swizzled scales if padding is needed


Don't we already do this though when we create the scaling factor tensors? This seems like a pessimization, as we will do this now every time instead of once - we should instead note the requirement for the scaling factors to be zeroed out before the quantization in the docs of the quantization call.

From what I can see, we just allocate scales with at::empty:

TransformerEngine/transformer_engine/pytorch/csrc/quantizer.cpp

Lines 886 to 897 in 28d08a7

if (rowwise_usage) {

const std::vector<int64_t> scale_inv_shape_int64(rowwise_scale_inv_shape.begin(),

rowwise_scale_inv_shape.end());

rowwise_data_tensor = at::empty(shape_int64, uint8_tensor_opts);

rowwise_scale_inv_tensor = at::empty(scale_inv_shape_int64, uint8_tensor_opts);

}

if (columnwise_usage) {

const std::vector<int64_t> scale_inv_shape_int64(columnwise_scale_inv_shape.begin(),

columnwise_scale_inv_shape.end());

columnwise_data_tensor = at::empty(shape_int64, uint8_tensor_opts);

columnwise_scale_inv_tensor = at::empty(scale_inv_shape_int64, uint8_tensor_opts);

}

Requiring zeroing out in this case is unintuitive, especially since it's not needed in the unpadded case. I figure that if the model is small enough for the padding to be relevant, then it's probably too small to see the full perf benefit of MXFP8 anyways.

ptrendx · 2026-01-14T18:50:24Z

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh

+
+              const size_t dshmem_size = in_mem + out_mem + TMA_SHMEM_ALIGNMENT;
+
+              // Zero out swizzled scales if padding is needed


Same comment as with the gated.

ptrendx · 2026-01-14T19:14:22Z

transformer_engine/common/common.h

+      sizeof(NVTEBasicTensor),  // kNVTERowwiseScaleInv
+      sizeof(NVTEBasicTensor),  // kNVTEColumnwiseScaleInv
+      sizeof(NVTEBasicTensor),  // kNVTEColumnwiseAmax
+      sizeof(bool)              // kNVTEWithGEMMSwizzledScales


This is implementation defined so we should not rely on this for our API - please use a type with a defined size for this.

Replaced bool with uint8_t and int with int32_t.

transformer_engine/common/common.h

ptrendx · 2026-01-14T19:21:12Z

transformer_engine/pytorch/csrc/extensions/swizzle.cpp

+
+namespace {
+
+void reset_tensor_data(transformer_engine::TensorWrapper &tensor, bool rowwise, bool columnwise) {


Why do we need that? We cache the tensors anyway, so there should not be overhead of just creating a new TensorWrapper.

ptrendx · 2026-01-14T19:23:33Z

transformer_engine/pytorch/csrc/extensions/swizzle.cpp

+
+}  // namespace
+
+std::tuple<std::optional<at::Tensor>, std::optional<at::Tensor>> swizzle_scales_for_gemm(


We could just output a new TensorWrapper here with those new scaling factors and then we would not need that reset thing I think?

ptrendx · 2026-01-14T19:25:10Z

transformer_engine/pytorch/csrc/common.cpp

  return ((value + multiple - 1) / multiple) * multiple;
 }

+size_t ceildiv(size_t numer, size_t denom) { return (numer + denom - 1) / denom; }


Don't we have a divup function already?

I prefer if the PyTorch extensions minimize their reliance on internal headers from the core lib (mainly common.h), but I don't have a strong opinion.

@ptrendx

Miscellaneous review suggestions from @ptrendx. Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps

Additional Comments (1)

transformer_engine/common/gemm/config.h, line 17-30 (link)

logic: Type mismatch between struct field declarations and attr_sizes array. The struct declares bool fields but attr_sizes specifies sizeof(uint8_t), and sm_count is int but array uses sizeof(int32_t). This will cause incorrect serialization/memory calculations if the attr_sizes array is used for byte layout.

_{67 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/tensor/nvfp4_tensor.py

transformer_engine/pytorch/csrc/type_converters.cpp

greptile-apps · 2026-01-15T20:45:49Z

transformer_engine/common/cast/dispatch/gated.cuh

+      if (is_fp8_dtype(output->dtype()) && output->has_columnwise_data()) {
+        // FP8 kernel only populates row-wise data, so perform
+        // transpose separately if needed
+        Tensor transpose_in, transpose_out, dummy;
+        transpose_in.scaling_mode = NVTE_DELAYED_TENSOR_SCALING;
+        transpose_in.data.dptr = output->data.dptr;
+        transpose_in.data.shape = {output->flat_first_dim(), output->flat_last_dim()};
+        transpose_in.data.dtype = output->data.dtype;
+        transpose_out.scaling_mode = NVTE_DELAYED_TENSOR_SCALING;
+        transpose_out.data.dptr = output->columnwise_data.dptr;
+        transpose_out.data.shape = {output->flat_last_dim(), output->flat_first_dim()};
+        transpose_out.data.dtype = output->data.dtype;
+        detail::transpose(transpose_in, /*noop=*/dummy, &transpose_out, stream);
+      }


style: This transpose logic is duplicated identically in the backward pass (lines 147-160). Consider extracting it into a helper function to reduce code duplication and improve maintainability.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

The extra layer of abstraction would hurt maintainability more than just duplicating <20 LoC.

greptile-apps · 2026-01-15T20:45:50Z

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh

  const size_t scales_offset_X_colwise = scales_block_offset_X_colwise + tid_X_colwise;

-  const bool rowwise_scale_is_within_bounds = scales_offset_X_rowwise < cols;
+  const bool rowwise_scale_is_within_bounds = SCALE_DIM_X * scales_offset_X_rowwise < cols;


logic: Fixed potential out-of-bounds access by multiplying SCALE_DIM_X * scales_offset_X_rowwise instead of just scales_offset_X_rowwise when checking column bounds.

Changing the kernel implementation is outside of scope for this PR. The tests already pass, so this is probably already correct, albeit with unclear variable names.

transformer_engine/common/include/transformer_engine/swizzle.h

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2026-01-15T20:55:50Z

/te-ci L1

jberchtold-nvidia

LGTM from a TE/JAX perspective, thanks! I can't see if TE/JAX CI passed since a new pipeline was recently launched, but looks like you're running JAX tests as part of your launch so all good

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2026-01-16T03:23:55Z

/te-ci pytorch

timmoon10 added 9 commits December 4, 2025 21:51

Add general C API for setting tensor params

0563c1a

Signed-off-by: Tim Moon <[email protected]>

Implement general accessors for NVTETensor

5c9b1be

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/pre-swizzled-scales

219ddc1

Refactor tex swizzling to skip if scales are already swizzled

1c49646

Signed-off-by: Tim Moon <[email protected]>

Add checks for non-swizzled scales in MXFP8 and NVFP4 kernels

5f60184

Signed-off-by: Tim Moon <[email protected]>

Support pre-swizzled scales in MXFP8Tensor

21ec928

Signed-off-by: Tim Moon <[email protected]>

Add tex function to swizzle MXFP8 scales

fa7e7c0

Signed-off-by: Tim Moon <[email protected]>

Fix bug in inplace swizzle function

b796c96

Signed-off-by: Tim Moon <[email protected]>

Tweak comments to use "compact/swizzled format"

52ce3a4

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the tmoon/pre-swizzled-scales branch from d274220 to 52ce3a4 Compare December 6, 2025 02:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c7c1d9

for more information, see https://pre-commit.ci

timmoon10 added enhancement New feature or request refactor labels Dec 6, 2025

timmoon10 added 5 commits December 9, 2025 21:20

MXFP8 quantize kernel with pre-swizzled scales

dfb4b94

Signed-off-by: Tim Moon <[email protected]>

Expose pre-swizzled scales in modules

1a8b551

Signed-off-by: Tim Moon <[email protected]>

Fix bug in multi-swizzle

cb1254a

Signed-off-by: Tim Moon <[email protected]>

Support MXFP8 gated activations with swizzled scales

8b10300

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/pre-swizzled-scales

1de4b5e

timmoon10 force-pushed the tmoon/pre-swizzled-scales branch from 4925b63 to 1de4b5e Compare December 10, 2025 07:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

8c6ea61

for more information, see https://pre-commit.ci

This comment was marked as outdated.

Sign in to view

timmoon10 and others added 4 commits December 10, 2025 22:22

Add PyTorch infrastructure for pre-swizzled NVFP4 tensors

a0184bc

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2365821

for more information, see https://pre-commit.ci

Deprecate DSv3-specific quantization logic in C API

bf12da9

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a89c006

for more information, see https://pre-commit.ci

This comment was marked as outdated.

Sign in to view

timmoon10 added 4 commits December 12, 2025 04:56

Remove support for DSv3 compact data from quantizer

b7eced8

Signed-off-by: Tim Moon <[email protected]>

Remove DSv3 compact data format from core lib

1da2c19

Signed-off-by: Tim Moon <[email protected]>

Fix bug in FP8 all-gather

9ed62cb

Signed-off-by: Tim Moon <[email protected]>

Fix linter warnings

43c8132

Signed-off-by: Tim Moon <[email protected]>

Disable pre-swizzling with debug quantizer

5aec484

Signed-off-by: Tim Moon <[email protected]>

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into tmoon/pre-swizzled-scales

f05fd06

Signed-off-by: Tim Moon <[email protected]>

This comment was marked as resolved.

Sign in to view

Review suggestion from @greptile-apps

c6f12e1

Signed-off-by: Tim Moon <[email protected]>

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

zhongbozhu reviewed Jan 5, 2026

View reviewed changes

timmoon10 added 2 commits January 5, 2026 19:34

Merge branch 'main' into tmoon/pre-swizzled-scales

583e948

Signed-off-by: Tim Moon <[email protected]>

Fix merge conflicts and review suggestions

28d08a7

Update copyright years. Tweak comments. Fix various complaints from @greptile-apps. Signed-off-by: Tim Moon <[email protected]>

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

timmoon10 added the 2.12.0 label Jan 8, 2026

yaox12 mentioned this pull request Jan 14, 2026

MoE training optimization #2438

Open

ptrendx reviewed Jan 14, 2026

View reviewed changes

timmoon10 and others added 3 commits January 15, 2026 01:59

Use explicitly sized types in config accessors

1a184ab

Miscellaneous review suggestions from @ptrendx. Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ed3fb0a

for more information, see https://pre-commit.ci

Merge branch 'main' into tmoon/pre-swizzled-scales

eb6c5e7

This comment was marked as outdated.

Sign in to view

timmoon10 and others added 3 commits January 15, 2026 20:31

Make util header for function that compute swizzled scale index

0358355

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/pre-swizzled-scales

47c93ea

[pre-commit.ci] auto fixes from pre-commit.com hooks

75c299f

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Jan 15, 2026

View reviewed changes

Apply suggestions from @greptile-apps

63b88d2

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <[email protected]>

jberchtold-nvidia previously approved these changes Jan 15, 2026

View reviewed changes

timmoon10 added 2 commits January 16, 2026 03:18

Merge branch 'main' into tmoon/pre-swizzled-scales

0f970ce

Update expected error message in FP8 block-scaling test

fa37297

Signed-off-by: Tim Moon <[email protected]>

timmoon10 dismissed jberchtold-nvidia’s stale review via fa37297 January 16, 2026 03:22


		const size_t shmem_size = in_mem + out_mem + TMA_SHMEM_ALIGNMENT;

		// Zero out swizzled scales if padding is needed

	if (rowwise_usage) {
	const std::vector<int64_t> scale_inv_shape_int64(rowwise_scale_inv_shape.begin(),
	rowwise_scale_inv_shape.end());
	rowwise_data_tensor = at::empty(shape_int64, uint8_tensor_opts);
	rowwise_scale_inv_tensor = at::empty(scale_inv_shape_int64, uint8_tensor_opts);
	}
	if (columnwise_usage) {
	const std::vector<int64_t> scale_inv_shape_int64(columnwise_scale_inv_shape.begin(),
	columnwise_scale_inv_shape.end());
	columnwise_data_tensor = at::empty(shape_int64, uint8_tensor_opts);
	columnwise_scale_inv_tensor = at::empty(scale_inv_shape_int64, uint8_tensor_opts);
	}


		const size_t dshmem_size = in_mem + out_mem + TMA_SHMEM_ALIGNMENT;

		// Zero out swizzled scales if padding is needed


		namespace {

		void reset_tensor_data(transformer_engine::TensorWrapper &tensor, bool rowwise, bool columnwise) {


		} // namespace

		std::tuple<std::optional<at::Tensor>, std::optional<at::Tensor>> swizzle_scales_for_gemm(

Add logic for block-scaled tensors with GEMM swizzled scales #2486

Are you sure you want to change the base?

Add logic for block-scaled tensors with GEMM swizzled scales #2486

Uh oh!

Conversation

timmoon10 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Dec 6, 2025 •

edited

Loading

timmoon10 Jan 14, 2026 •

edited

Loading

timmoon10 Jan 15, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

timmoon10 Jan 15, 2026 •

edited

Loading