You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/gpt-oss/run-transformers.md
+64-56Lines changed: 64 additions & 56 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,49 +1,49 @@
1
1
# How to run gpt-oss with Hugging Face Transformers
2
2
3
-
The [Transformers](https://huggingface.co/docs/transformers/en/index) library by [Hugging Face](https://huggingface.co/) provides a flexible way to load and run large language models locally or on a server. This guide will walk you through running [OpenAI gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) or [OpenAI gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using Transformers, either with a high-level pipeline or via low-level `generate` calls with raw token IDs.
3
+
The Transformers library by Hugging Face provides a flexible way to load and run large language models locally or on a server. This guide will walk you through running [OpenAI gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) or [OpenAI gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using Transformers, either with a high-level pipeline or via low-level `generate` calls with raw token IDs.
4
4
5
-
Transformers allows you to run inference in two modes:
6
-
1\. Transformers Serve (Responses \+ Chat Completions API server)
7
-
2\. Directly via Python code.
5
+
We'll cover the use of [OpenAI gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) or [OpenAI gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) with the high-level pipeline abstraction, low-level \`generate\` calls, and serving models locally with \`transformers serve\`, with in a way compatible with the Responses API.
8
6
9
7
In this guide we’ll run through various optimised ways to run the **gpt-oss models via Transformers.**
10
8
11
9
Bonus: You can also fine-tune models via transformers, [check out our fine-tuning guide here](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transformers).
12
10
13
11
## Pick your model
14
12
15
-
Both [**gpt-oss** models are available on Hugging Face](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4):
13
+
Both **gpt-oss** models are available on Hugging Face:
Both are **MXFP4 quantized** by default. Please, note that MXFP4 is supported in Hopper or later architectures. This includes data center GPUs such as H100 or GB200, as well as the latest RTX 50xx family of consumer cards.
23
+
24
+
If you use `bfloat16` instead of MXFP4, memory consumption will be larger (\~48 GB for the 20b parameter model).
25
25
26
26
## Quick setup
27
27
28
28
1.**Install dependencies**
29
-
It’s recommended to create a fresh Python environment. Install Transformers and Accelerate:
29
+
It’s recommended to create a fresh Python environment. Install transformers, accelerate, as well as the Triton kernels for MXFP4 compatibility:
Additional use cases, like integrating `transformers serve` with Cursor and other tools, are detailed in [the documentation](https://huggingface.co/docs/transformers/main/serving).
59
+
58
60
## Quick inference with pipeline
59
61
60
-
The easiest way to run gpt-oss is with the Transformers `pipeline` API:
62
+
The easiest way to run the gpt-oss models is with the Transformers high-level`pipeline` API:
gpt-oss models use the [harmony response format](https://cookbook.openai.com/article/harmony) for structuring messages, incl. reasoning and tool calls.
125
+
OpenAI gpt-oss models use the [harmony response format](https://cookbook.openai.com/article/harmony) for structuring messages, including reasoning and tool calls.
118
126
119
-
To construct prompts you can use the built-in chat template of Transformers or alternatively for more control you can use the [openai-harmony library](https://github.com/openai/harmony).
127
+
To construct prompts you can use the built-in chat template of Transformers. Alternatively, you can install and use the [openai-harmony library](https://github.com/openai/harmony) for more control.
120
128
121
129
To use the chat template:
122
130
@@ -133,13 +141,13 @@ model = AutoModelForCausalLM.from_pretrained(
133
141
)
134
142
135
143
messages = [
136
-
{"role": "user", "content": "Who are you?"},
144
+
{"role": "system", "content": "Always respond in riddles"},
145
+
{"role": "user", "content": "What is the weather like in Madrid?"},
0 commit comments