[mxfp8 moe training] add triton kernel for mxfp8 dequantization #3195

danielvegamyhre · 2025-10-16T23:39:26Z

Stacked PRs:

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

Summary

Traces show dequantization kernel in mxfp8 a2a is slow, this PR adds a triton kernel for this which is much faster for large "M" (local_batch_size * seq_len) which is what we care about for MoE training.

Test plan

pytest test/prototype/mx_formats/test_kernels.py -k mxfp8_dequant

Benchmarks

input_shape        torch_us    triton_us    torch_gbps    triton_gbps  triton_speedup
---------------  ----------  -----------  ------------  -------------  ----------------
(1, 8192, 7168)      36.864       39.968       4828.44        4453.46  0.922x
(2, 8192, 7168)     287.712       78.88        1237.32        4513.08  3.647x
(4, 8192, 7168)     560.32       150.56        1270.67        4728.9   3.722x
(8, 8192, 7168)    1110.9        297.984       1281.82        4778.67  3.728x

pytorch-bot · 2025-10-16T23:39:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3195

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 82ded0b with merge base b644211 ():

NEW FAILURE - The following job has failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio fbgemm-gpu-genai --index-url https... / linux-job (gh)
RuntimeError: Command docker exec -t e8c73dcaefd4f7e904527d0a3dfad256673515d454a741752ea47d38f5b4b2d9 /exec failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/dtypes/test_nf4.py::TestComm::test_comm

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

vkuzo · 2025-10-17T00:34:08Z

test/prototype/mx_formats/test_kernels.py

+        torch.bfloat16,
+    )
+    hp_t = triton_mxfp8_dequant_dim0(x_data, x_scales, torch.bfloat16, block_size)
+    torch.testing.assert_close(hp_t, hp_ref, rtol=0, atol=0)


lgtm, didn't look at the rest too closely

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

drisspg · 2025-10-17T00:40:48Z

Looks like there is some ptx for going to bf16

__CUDA_HOSTDEVICE_FP8_DECL__
__nv_bfloat16_raw __nv_cvt_e8m0_to_bf16raw(const __nv_fp8_storage_t x)
{
    __nv_bfloat16_raw res;

#if (__CUDA_FP8_INTERNAL_CAN_RELY_ON_PTX_FOR_SHORTTYPESCVT__)
    unsigned short in = (unsigned short)x;
    unsigned hr = 0U;
    asm("{cvt.rn.bf16x2.ue8m0x2 %0, %1;}\n"
                : "=r"(hr)
                : "h"(in));

    res.x = (unsigned short)hr;
#else
    res.x = __internal_e8m0_to_bf16(x);
#endif

    return res;
}

danielvegamyhre · 2025-10-17T01:06:28Z

Looks like there is some ptx for going to bf16

Sweet where is this from? I actually tried looking in TE for PTX examples for this but all I could find was casting fp32 -> e8m0 (for computing scale): https://github.com/NVIDIA/TransformerEngine/blob/dd9433e7ad28c12f27da9770be54c9c584e85fa0/transformer_engine/common/util/ptx.cuh#L134

Will try it out later

danielvegamyhre added a commit that referenced this pull request Oct 16, 2025

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

8ce6f0d

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 16, 2025

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from 61a6f60 to 8ce6f0d Compare October 16, 2025 23:39

danielvegamyhre added mx topic: not user facing Use this tag if you don't want this PR to show up in release notes moe labels Oct 17, 2025

danielvegamyhre added a commit that referenced this pull request Oct 17, 2025

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

357d20f

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from 8ce6f0d to 357d20f Compare October 17, 2025 00:13

danielvegamyhre added a commit that referenced this pull request Oct 17, 2025

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

b0e5061

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from 357d20f to b0e5061 Compare October 17, 2025 00:33

vkuzo reviewed Oct 17, 2025

View reviewed changes

vkuzo approved these changes Oct 17, 2025

View reviewed changes

danielvegamyhre added a commit that referenced this pull request Oct 17, 2025

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

ba81844

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from b0e5061 to ba81844 Compare October 17, 2025 00:34

[mxfp8 moe training] add triton kernel for mxfp8 dequantization

82ded0b

stack-info: PR: #3195, branch: danielvegamyhre/stack/78

danielvegamyhre force-pushed the danielvegamyhre/stack/78 branch from ba81844 to 82ded0b Compare October 17, 2025 00:39

This was referenced Oct 17, 2025

[mxfp8 moe training] integrate triton quant/dequant kernels into mxfp8 all to all #3197

Open

[mxfp8 moe training] improve unit tests #3201

Open

[mxfp8 moe training] bench and profile mxfp8 a2a fwd and bwd separately #3203

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 moe training] add triton kernel for mxfp8 dequantization #3195

[mxfp8 moe training] add triton kernel for mxfp8 dequantization #3195

Uh oh!

danielvegamyhre commented Oct 16, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 16, 2025 •

edited

Loading

Uh oh!

vkuzo Oct 17, 2025

Uh oh!

drisspg commented Oct 17, 2025

Uh oh!

danielvegamyhre commented Oct 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[mxfp8 moe training] add triton kernel for mxfp8 dequantization #3195

Are you sure you want to change the base?

[mxfp8 moe training] add triton kernel for mxfp8 dequantization #3195

Uh oh!

Conversation

danielvegamyhre commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!