Skip to content

Commit 0bc0595

Browse files
authored
[gpt-oss] mxfp4 easier installation, better support (#3025)
1 parent bd6e225 commit 0bc0595

File tree

1 file changed

+8
-16
lines changed

1 file changed

+8
-16
lines changed

welcome-openai-gpt-oss.md

Lines changed: 8 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -126,25 +126,17 @@ print(response)
126126

127127
### Using Transformers
128128

129-
You need to install the latest `transformers` release (v4.55 or later), as well as `accelerate` and `kernels`:
129+
You need to install the latest `transformers` release (v4.55.1 or later), as well as `accelerate` and `kernels`. We also recommend installing triton 3.4 or better, as it unblocks support for `mxfp4` quantization on CUDA hardware:
130130

131131
```shell
132-
pip install --upgrade accelerate transformers kernels
132+
pip install --upgrade transformers kernels accelerate "triton>=3.4"
133133
```
134134

135-
The model weights are quantized in `mxfp4` format, which is compatible with GPUs of the Hopper or Blackwell families. This includes data-center cards such as H100, H200 or GB200, as well as the latest consumer GPUs in the 50xx family. If you have one of these cards, `mxfp4` will yield the best results in terms of speed and memory consumption. To use it, you need `triton 3.4` and `triton_kernels`. If these libraries are not installed (or you don’t have a compatible GPU), loading the model will fall back to `bfloat16`, unpacked from the quantized weights.
135+
The model weights are quantized in `mxfp4` format, which was originally available on GPUs of the Hopper or Blackwell families, but now works on previous CUDA architectures (including Ada, Ampere, and Tesla). Installing triton 3.4, together with the `kernels` library, makes it possible to download optimized `mxfp4` kernels on first use, achieving large memory savings. With these components in place, you can run the 20B model on GPUs with 16 GB of RAM. This includes many consumer cards (3090, 4090, 5080) as well as Colab and Kaggle!
136136

137-
In our tests, Triton 3.4 works fine with the latest PyTorch version (2.7.x). You may optionally want to install PyTorch 2.8 instead – it’s a pre-release version at the time of writing ([although it should be released soon](https://github.com/pytorch/pytorch/milestone/53)), but it’s the one that’s been prepared alongside triton 3.4, so they are stable together. Here’s how to install PyTorch 2.8 (comes with triton 3.4) and the triton kernels:
137+
If the previous libraries are not installed (or you don’t have a compatible GPU), loading the model will fall back to `bfloat16`, unpacked from the quantized weights.
138138

139-
```shell
140-
# Optional step if you want PyTorch 2.8, otherwise just `pip install torch`
141-
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/test/cu128
142-
143-
# Install triton kernels for mxfp4 support
144-
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
145-
```
146-
147-
The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using `mxfp4`, or \~48 GB in `bfloat16`.
139+
The following snippet shows simple inference with the 20B model. As explained, it runs on 16 GB GPUs when using `mxfp4`, or \~48 GB in `bfloat16`.
148140

149141
```py
150142
from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -210,7 +202,7 @@ This snippet will download the optimized, pre-compiled kernel code from `kernels
210202

211203
#### Other optimizations
212204

213-
If you have a Hopper GPU or better, we recommend you use `mxfp4` for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!
205+
We recommend you use `mxfp4` if your GPU supports it. If you can additionally use Flash Attention 3, then by all means do enable it!
214206

215207
> [!TIP]
216208
> If your GPU is not compatible with `mxfp4`, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:
@@ -261,10 +253,10 @@ At the time of writing, this table summarizes our _recommendations_ based on GPU
261253
| | mxfp4 | Flash Attention 3 (w/ sink attention) | MegaBlocks MoE kernels |
262254
| :---- | :---- | :---- | :---- |
263255
| Hopper GPUs (H100, H200) ||||
264-
| Blackwell GPUs (GB200, 50xx, RTX Pro 6000\) ||||
256+
| CUDA GPUS with 16+ GB of RAM ||||
265257
| Other CUDA GPUs ||||
266258
| AMD Instinct (MI3XX) ||||
267-
| *How to enable* | Install triton 3.4 + triton kernels | Use vllm-flash-attn3 from kernels-community" | `use_kernels` |
259+
| *How to enable* | triton 3.4 + kernels library | Use vllm-flash-attn3 from kernels-community | `use_kernels` |
268260

269261
Even though the 120B model fits on a single H100 GPU (using `mxfp4`), you can also run it easily on multiple GPUs using `accelerate` or `torchrun`. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with `torchrun --nproc_per_node=4 generate.py` on a system with 4 GPUs:
270262

0 commit comments

Comments
 (0)