You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: welcome-openai-gpt-oss.md
+8-16Lines changed: 8 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -126,25 +126,17 @@ print(response)
126
126
127
127
### Using Transformers
128
128
129
-
You need to install the latest `transformers` release (v4.55 or later), as well as `accelerate` and `kernels`:
129
+
You need to install the latest `transformers` release (v4.55.1 or later), as well as `accelerate` and `kernels`. We also recommend installing triton 3.4 or better, as it unblocks support for `mxfp4` quantization on CUDA hardware:
The model weights are quantized in `mxfp4` format, which is compatible with GPUs of the Hopper or Blackwell families. This includes data-center cards such as H100, H200 or GB200, as well as the latest consumer GPUs in the 50xx family. If you have one of these cards, `mxfp4`will yield the best results in terms of speed and memory consumption. To use it, you need `triton 3.4` and `triton_kernels`. If these libraries are not installed (or you don’t have a compatible GPU), loading the model will fall back to `bfloat16`, unpacked from the quantized weights.
135
+
The model weights are quantized in `mxfp4` format, which was originally available on GPUs of the Hopper or Blackwell families, but now works on previous CUDA architectures (including Ada, Ampere, and Tesla). Installing triton 3.4, together with the `kernels` library, makes it possible to download optimized `mxfp4`kernels on first use, achieving large memory savings. With these components in place, you can run the 20B model on GPUs with 16 GB of RAM. This includes many consumer cards (3090, 4090, 5080) as well as Colab and Kaggle!
136
136
137
-
In our tests, Triton 3.4 works fine with the latest PyTorch version (2.7.x). You may optionally want to install PyTorch 2.8 instead – it’s a pre-release version at the time of writing ([although it should be released soon](https://github.com/pytorch/pytorch/milestone/53)), but it’s the one that’s been prepared alongside triton 3.4, so they are stable together. Here’s how to install PyTorch 2.8 (comes with triton 3.4) and the triton kernels:
137
+
If the previous libraries are not installed (or you don’t have a compatible GPU), loading the model will fall back to `bfloat16`, unpacked from the quantized weights.
138
138
139
-
```shell
140
-
# Optional step if you want PyTorch 2.8, otherwise just `pip install torch`
The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using `mxfp4`, or \~48 GB in `bfloat16`.
139
+
The following snippet shows simple inference with the 20B model. As explained, it runs on 16 GB GPUs when using `mxfp4`, or \~48 GB in `bfloat16`.
148
140
149
141
```py
150
142
from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -210,7 +202,7 @@ This snippet will download the optimized, pre-compiled kernel code from `kernels
210
202
211
203
#### Other optimizations
212
204
213
-
If you have a Hopper GPU or better, we recommend you use `mxfp4`for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!
205
+
We recommend you use `mxfp4`if your GPU supports it. If you can additionally use Flash Attention 3, then by all means do enable it!
214
206
215
207
> [!TIP]
216
208
> If your GPU is not compatible with `mxfp4`, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:
@@ -261,10 +253,10 @@ At the time of writing, this table summarizes our _recommendations_ based on GPU
|*How to enable*|Install triton 3.4 + triton kernels | Use vllm-flash-attn3 from kernels-community"|`use_kernels`|
259
+
|*How to enable*| triton 3.4 + kernels library | Use vllm-flash-attn3 from kernels-community |`use_kernels`|
268
260
269
261
Even though the 120B model fits on a single H100 GPU (using `mxfp4`), you can also run it easily on multiple GPUs using `accelerate` or `torchrun`. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with `torchrun --nproc_per_node=4 generate.py` on a system with 4 GPUs:
0 commit comments