Add MXFP4 MoE/attention backward kernels

## A Call To Action!

The Hugging Face community needs you! 🫵 

The `GPT-OSS` model was recently added using `mxfp4` weights. These 4-bit weights are tiny and highly performant on H100/B100/50xx GPUs, but we only have kernels for **forward** passes. This means that users can't train `GPT-OSS` unless they convert the weights to `bfloat16`. This uses 4X more memory, and reduces speed enormously!

We want **native** `mxfp4` training, but that means we need **backward** kernels too. Ideally, these kernels should also support GPUs that don't have FP4 hardware, so users can still benefit from reduced memory usage during training, even if the computation has to be done in FP8 or `bfloat16`.

### Custom Kernels: An extremely short introduction

`transformers` uses the [Kernels](https://github.com/huggingface/kernels) library to load custom kernels. The critical kernel is the MoE kernel, because attention weights are stored in `bfloat16` for `GPT-OSS`.  This kernel lives on the Hub, in the [triton-kernels repo](https://huggingface.co/kernels-community/triton_kernels). The forward kernel is in the [matmul_ogs file](https://huggingface.co/kernels-community/triton_kernels/blob/main/torch-ext/triton_kernels/matmul_ogs.py), but the backward kernel should probably go in its own file.

This won't be easy - the kernels are written in [Triton](https://triton-lang.org/main/index.html), so this is not a good issue for beginners! This will probably be overwhelming if you don't have some experience with raw CUDA or Triton programming already.

Once the hard work of writing the kernel is done, we can help with integrating it into Transformers (there's a [work-in-progress blogpost](https://github.com/huggingface/blog/pull/2971) about this). We can help with serious attempts, but please don't just throw a code agent at the problem or something (you can use a code agent to help with writing and testing the kernel, but only if you're competent enough to evaluate and bugfix its outputs!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MXFP4 MoE/attention backward kernels #40170

A Call To Action!

Custom Kernels: An extremely short introduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add MXFP4 MoE/attention backward kernels #40170

Description

A Call To Action!

Custom Kernels: An extremely short introduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions