-
Notifications
You must be signed in to change notification settings - Fork 30.2k
Description
A Call To Action!
The Hugging Face community needs you! π«΅
The GPT-OSS
model was recently added using mxfp4
weights. These 4-bit weights are tiny and highly performant on H100/B100/50xx GPUs, but we only have kernels for forward passes. This means that users can't train GPT-OSS
unless they convert the weights to bfloat16
. This uses 4X more memory, and reduces speed enormously!
We want native mxfp4
training, but that means we need backward kernels too. Ideally, these kernels should also support GPUs that don't have FP4 hardware, so users can still benefit from reduced memory usage during training, even if the computation has to be done in FP8 or bfloat16
.
Custom Kernels: An extremely short introduction
transformers
uses the Kernels library to load custom kernels. The critical kernel is the MoE kernel, because attention weights are stored in bfloat16
for GPT-OSS
. This kernel lives on the Hub, in the triton-kernels repo. The forward kernel is in the matmul_ogs file, but the backward kernel should probably go in its own file.
This won't be easy - the kernels are written in Triton, so this is not a good issue for beginners! This will probably be overwhelming if you don't have some experience with raw CUDA or Triton programming already.
Once the hard work of writing the kernel is done, we can help with integrating it into Transformers (there's a work-in-progress blogpost about this). We can help with serious attempts, but please don't just throw a code agent at the problem or something (you can use a code agent to help with writing and testing the kernel, but only if you're competent enough to evaluate and bugfix its outputs!)