[Quantization] Add cutlass kernel for FP8 #43304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

MekkCyber wants to merge 4 commits into main from quantization-kernels

Contributor

MekkCyber commented Jan 15, 2026 •

edited

Loading

What does this PR do?

adds the cutlass kernel for scaled matmul, the performance is much better than triton for the specific block size : (128, 128):

================================================================================
COMPARISON: CUTLASS vs Triton Speedup
================================================================================

     M x      N x      K |  Triton (ms) | CUTLASS (ms) |    Speedup
----------------------------------------------------------------------
     1 x   4096 x   4096 |       0.0571 |       0.0317 |       1.80x
     4 x   4096 x   4096 |       0.0571 |       0.0516 |       1.11x
    16 x   4096 x   4096 |       0.0563 |       0.0511 |       1.10x
    32 x   4096 x   4096 |       0.0573 |       0.0509 |       1.13x
    64 x   4096 x   4096 |       0.0581 |       0.0509 |       1.14x
   128 x   4096 x   4096 |       0.0807 |       0.0509 |       1.59x
   256 x   4096 x   4096 |       0.0798 |       0.0511 |       1.56x
   512 x   4096 x   4096 |       0.0826 |       0.0509 |       1.62x
  1024 x   4096 x   4096 |       0.1069 |       0.0510 |       2.10x
  2048 x   4096 x   4096 |       0.2187 |       0.0660 |       3.31x
  4096 x   4096 x   4096 |       0.4310 |       0.1256 |       3.43x

All FP8 tests passing !


          add cutlass

4dfe49e

MekkCyber requested a review from SunMarc

January 15, 2026 12:37

HuggingFaceDocBuilderDev commented Jan 15, 2026

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul reviewed

View reviewed changes

src/transformers/integrations/finegrained_fp8.py

    
                          _quantization_kernel = get_kernel("RedHatAI/quantization")

                      except Exception as e:

                          logger.warning_once(f"Failed to load CUTLASS quantization kernel: {e}. Falling back to Triton.")

Member

sayakpaul Jan 15, 2026

Do we want to also log that we're using the Redhat kernel in case it was successfully loaded?

Contributor Author

MekkCyber Jan 15, 2026

Yes we can do that I think

sayakpaul reviewed

View reviewed changes

src/transformers/integrations/finegrained_fp8.py

    
                  Check if CUTLASS blockwise FP8 matmul is supported for the given block size.

                  CUTLASS blockwise kernels require:

                  - SM90+ (Hopper or newer)

Member

sayakpaul Jan 15, 2026

This is fine IMO. When hardware is available, users should be able to max them out!

SunMarc reviewed

View reviewed changes

Member

SunMarc left a comment

Thanks ! Just a few nits

src/transformers/integrations/finegrained_fp8.py Outdated Show resolved Hide resolved

src/transformers/integrations/finegrained_fp8.py Show resolved Hide resolved

src/transformers/integrations/finegrained_fp8.py Outdated Show resolved Hide resolved

src/transformers/integrations/finegrained_fp8.py Show resolved Hide resolved

src/transformers/integrations/finegrained_fp8.py Show resolved Hide resolved

src/transformers/integrations/finegrained_fp8.py

Comment on lines +74 to +76

    
                  kernel = _get_quantization_kernel()

                  if kernel is None:

                      return False

Member

SunMarc Jan 15, 2026

I think we also need to check that kernels is installed no?

src/transformers/integrations/finegrained_fp8.py

Comment on lines +30 to +43

    
              # Global for the CUTLASS quantization kernel (lazily loaded)

              _quantization_kernel = None

              def _get_quantization_kernel():

                  """Lazily load the CUTLASS quantization kernel from HuggingFace Hub."""

                  global _quantization_kernel

                  if _quantization_kernel is None:

                      try:

                          from .hub_kernels import get_kernel

                          _quantization_kernel = get_kernel("RedHatAI/quantization")

                      except Exception as e:

                          logger.warning_once(f"Failed to load CUTLASS quantization kernel: {e}. Falling back to Triton.")

Member

SunMarc Jan 15, 2026

instead of a single kernel, can we try to create a dict where we store multiple kernels ? We can leave this for a follow-up PR but the idea would be to also move the triton kernels in kernels.

SunMarc and others added 2 commits

January 15, 2026 17:11


          Merge branch 'main' into quantization-kernels

6939e63


          feedback

2d15c09

SunMarc approved these changes

View reviewed changes

Member

SunMarc left a comment

Thanks ! Let's merge this


          Merge branch 'main' into quantization-kernels

cfd4b9e

Contributor

github-actions bot commented Jan 19, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43304&sha=cfd4b9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet