Skip to content

Conversation

@MekkCyber
Copy link
Contributor

@MekkCyber MekkCyber commented Jan 15, 2026

What does this PR do?

adds the cutlass kernel for scaled matmul, the performance is much better than triton for the specific block size : (128, 128):

================================================================================
COMPARISON: CUTLASS vs Triton Speedup
================================================================================

     M x      N x      K |  Triton (ms) | CUTLASS (ms) |    Speedup
----------------------------------------------------------------------
     1 x   4096 x   4096 |       0.0571 |       0.0317 |       1.80x
     4 x   4096 x   4096 |       0.0571 |       0.0516 |       1.11x
    16 x   4096 x   4096 |       0.0563 |       0.0511 |       1.10x
    32 x   4096 x   4096 |       0.0573 |       0.0509 |       1.13x
    64 x   4096 x   4096 |       0.0581 |       0.0509 |       1.14x
   128 x   4096 x   4096 |       0.0807 |       0.0509 |       1.59x
   256 x   4096 x   4096 |       0.0798 |       0.0511 |       1.56x
   512 x   4096 x   4096 |       0.0826 |       0.0509 |       1.62x
  1024 x   4096 x   4096 |       0.1069 |       0.0510 |       2.10x
  2048 x   4096 x   4096 |       0.2187 |       0.0660 |       3.31x
  4096 x   4096 x   4096 |       0.4310 |       0.1256 |       3.43x

All FP8 tests passing !

@MekkCyber MekkCyber requested a review from SunMarc January 15, 2026 12:37
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.


_quantization_kernel = get_kernel("RedHatAI/quantization")
except Exception as e:
logger.warning_once(f"Failed to load CUTLASS quantization kernel: {e}. Falling back to Triton.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to also log that we're using the Redhat kernel in case it was successfully loaded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can do that I think

Check if CUTLASS blockwise FP8 matmul is supported for the given block size.
CUTLASS blockwise kernels require:
- SM90+ (Hopper or newer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine IMO. When hardware is available, users should be able to max them out!

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Just a few nits

Comment on lines +74 to +76
kernel = _get_quantization_kernel()
if kernel is None:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to check that kernels is installed no?

Comment on lines +30 to +43
# Global for the CUTLASS quantization kernel (lazily loaded)
_quantization_kernel = None


def _get_quantization_kernel():
"""Lazily load the CUTLASS quantization kernel from HuggingFace Hub."""
global _quantization_kernel
if _quantization_kernel is None:
try:
from .hub_kernels import get_kernel

_quantization_kernel = get_kernel("RedHatAI/quantization")
except Exception as e:
logger.warning_once(f"Failed to load CUTLASS quantization kernel: {e}. Falling back to Triton.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a single kernel, can we try to create a dict where we store multiple kernels ? We can leave this for a follow-up PR but the idea would be to also move the triton kernels in kernels.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Let's merge this

@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43304&sha=cfd4b9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants