Skip to content

Commit 8a230e5

Browse files
update transformers guide
1 parent 3e4bf74 commit 8a230e5

File tree

1 file changed

+64
-56
lines changed

1 file changed

+64
-56
lines changed

articles/gpt-oss/run-transformers.md

Lines changed: 64 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,49 @@
11
# How to run gpt-oss with Hugging Face Transformers
22

3-
The [Transformers](https://huggingface.co/docs/transformers/en/index) library by [Hugging Face](https://huggingface.co/) provides a flexible way to load and run large language models locally or on a server. This guide will walk you through running [OpenAI gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) or [OpenAI gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using Transformers, either with a high-level pipeline or via low-level `generate` calls with raw token IDs.
3+
The Transformers library by Hugging Face provides a flexible way to load and run large language models locally or on a server. This guide will walk you through running [OpenAI gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) or [OpenAI gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using Transformers, either with a high-level pipeline or via low-level `generate` calls with raw token IDs.
44

5-
Transformers allows you to run inference in two modes:
6-
1\. Transformers Serve (Responses \+ Chat Completions API server)
7-
2\. Directly via Python code.
5+
We'll cover the use of [OpenAI gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) or [OpenAI gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) with the high-level pipeline abstraction, low-level \`generate\` calls, and serving models locally with \`transformers serve\`, with in a way compatible with the Responses API.
86

97
In this guide we’ll run through various optimised ways to run the **gpt-oss models via Transformers.**
108

119
Bonus: You can also fine-tune models via transformers, [check out our fine-tuning guide here](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transformers).
1210

1311
## Pick your model
1412

15-
Both [**gpt-oss** models are available on Hugging Face](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4):
13+
Both **gpt-oss** models are available on Hugging Face:
1614

17-
- [**`openai/gpt-oss-20b`**](https://huggingface.co/openai/gpt-oss-20b)
18-
- \~16GB VRAM requirement
19-
- Great for single higher-end consumer GPUs
20-
- [**`openai/gpt-oss-120b`**](https://huggingface.co/openai/gpt-oss-120b)
15+
- **`openai/gpt-oss-20b`**
16+
- \~16GB VRAM requirement when using MXFP4
17+
- Great for single high-end consumer GPUs
18+
- **`openai/gpt-oss-120b`**
2119
- Requires ≥60GB VRAM or multi-GPU setup
22-
- Ideal for H100/A100-class hardware
20+
- Ideal for H100-class hardware
2321

24-
Both are **MXFP4 quantized** by default.
22+
Both are **MXFP4 quantized** by default. Please, note that MXFP4 is supported in Hopper or later architectures. This includes data center GPUs such as H100 or GB200, as well as the latest RTX 50xx family of consumer cards.
23+
24+
If you use `bfloat16` instead of MXFP4, memory consumption will be larger (\~48 GB for the 20b parameter model).
2525

2626
## Quick setup
2727

2828
1. **Install dependencies**
29-
It’s recommended to create a fresh Python environment. Install Transformers and Accelerate:
29+
It’s recommended to create a fresh Python environment. Install transformers, accelerate, as well as the Triton kernels for MXFP4 compatibility:
3030

3131
```bash
3232
pip install -U transformers accelerate torch triton kernels pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
3333
```
3434

3535
2. **(Optional) Enable multi-GPU**
36-
If you’re running large models, use Accelerate to handle device mapping automatically.
36+
If you’re running large models, use Accelerate or torchrun to handle device mapping automatically.
3737

3838
## Create an Open AI Responses / Chat Completions endpoint
3939

40-
To launch a server, simply use the transformers serve CLI command:
40+
To launch a server, simply use the `transformers serve` CLI command:
4141

4242
```bash
4343
transformers serve
4444
```
4545

46-
The simplest way to interact with the server is through our transformers chat CLI
46+
The simplest way to interact with the server is through the transformers chat CLI
4747

4848
```bash
4949
transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-20b
@@ -52,12 +52,14 @@ transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-20b
5252
or by sending an HTTP request with cURL, e.g.
5353

5454
```bash
55-
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-20b"}'
55+
curl -X POST http://localhost:8000/v1/responses -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-20b"}'
5656
```
5757

58+
Additional use cases, like integrating `transformers serve` with Cursor and other tools, are detailed in [the documentation](https://huggingface.co/docs/transformers/main/serving).
59+
5860
## Quick inference with pipeline
5961

60-
The easiest way to run gpt-oss is with the Transformers `pipeline` API:
62+
The easiest way to run the gpt-oss models is with the Transformers high-level `pipeline` API:
6163

6264
```py
6365
from transformers import pipeline
@@ -84,7 +86,7 @@ print(result[0]["generated_text"])
8486

8587
## Advanced inference with `.generate()`
8688

87-
If you want more control over sampling, you can load the model and tokenizer manually:
89+
If you want more control, you can load the model and tokenizer manually and invoke the `.generate()` method:
8890

8991
```py
9092
from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -98,9 +100,15 @@ model = AutoModelForCausalLM.from_pretrained(
98100
device_map="auto"
99101
)
100102

101-
inputs = tokenizer(
102-
"Explain what MXFP4 quantization is.",
103-
return_tensors="pt"
103+
messages = [
104+
{"role": "user", "content": "Explain what MXFP4 quantization is."},
105+
]
106+
107+
inputs = tokenizer.apply_chat_template(
108+
messages,
109+
add_generation_prompt=True,
110+
return_tensors="pt",
111+
return_dict=True,
104112
).to(model.device)
105113

106114
outputs = model.generate(
@@ -109,14 +117,14 @@ outputs = model.generate(
109117
temperature=0.7
110118
)
111119

112-
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
120+
print(tokenizer.decode(outputs[0]))
113121
```
114122

115123
## Chat template and tool calling
116124

117-
gpt-oss models use the [harmony response format](https://cookbook.openai.com/article/harmony) for structuring messages, incl. reasoning and tool calls.
125+
OpenAI gpt-oss models use the [harmony response format](https://cookbook.openai.com/article/harmony) for structuring messages, including reasoning and tool calls.
118126

119-
To construct prompts you can use the built-in chat template of Transformers or alternatively for more control you can use the [openai-harmony library](https://github.com/openai/harmony).
127+
To construct prompts you can use the built-in chat template of Transformers. Alternatively, you can install and use the [openai-harmony library](https://github.com/openai/harmony) for more control.
120128

121129
To use the chat template:
122130

@@ -133,13 +141,13 @@ model = AutoModelForCausalLM.from_pretrained(
133141
)
134142

135143
messages = [
136-
{"role": "user", "content": "Who are you?"},
144+
{"role": "system", "content": "Always respond in riddles"},
145+
{"role": "user", "content": "What is the weather like in Madrid?"},
137146
]
138147

139148
inputs = tokenizer.apply_chat_template(
140149
messages,
141150
add_generation_prompt=True,
142-
tokenize=True,
143151
return_tensors="pt",
144152
return_dict=True,
145153
).to(model.device)
@@ -148,13 +156,13 @@ generated = model.generate(**inputs, max_new_tokens=100)
148156
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1] :]))
149157
```
150158

151-
To integrate the [`openai-harmony`](https://github.com/openai/harmony) library to prepare prompts and parse responses first install the library:
159+
To integrate the [`openai-harmony`](https://github.com/openai/harmony) library to prepare prompts and parse responses, first install it like this:
152160

153161
```bash
154162
pip install openai-harmony
155163
```
156164

157-
Here’s an example of how to then use the library to construct your prompts and encode them to tokens:
165+
Here’s an example of how to use the library to build your prompts and encode them to tokens:
158166

159167
```py
160168
import json
@@ -205,59 +213,59 @@ for message in entries:
205213
print(json.dumps(message.to_dict(), indent=2))
206214
```
207215

216+
Note that the `Developer` role in Harmony maps to the `system` prompt in the chat template.
217+
208218
## Multi-GPU & distributed inference
209219

210-
For large models like gpt-oss-120b, you can:
220+
The large gpt-oss-120b fits on a single H100 GPU when using MXFP4. If you want to run it on multiple GPUs, you can:
211221

212222
- Use `tp_plan="auto"` for automatic placement and tensor parallelism
213223
- Launch with `accelerate launch or torchrun` for distributed setups
214-
- Leverage Expert Parallelism and specialised Flash attention kernels for faster inference
224+
- Leverage Expert Parallelism
225+
- Use specialised Flash attention kernels for faster inference
215226

216227
```py
217-
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
218-
import torch
228+
from transformers import AutoModelForCausalLM, AutoTokenizer
219229
from transformers.distributed import DistributedConfig
230+
import torch
220231

221-
model_path = ""
232+
model_path = "openai/gpt-oss-120b"
222233
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")
223234

224-
# Set up chat template
225-
messages = [
226-
{"role": "user", "content": "Explain how expert parallelism works in large language models."}
227-
]
228-
chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
229-
230-
generation_config = GenerationConfig(
231-
max_new_tokens=1024,
232-
do_sample=True,
233-
)
234-
235235
device_map = {
236-
"distributed_config": DistributedConfig(enable_expert_parallel=1), # Enable Expert Parallelism
237-
"tp_plan": "auto", # Enables Tensor Parallelism
236+
# Enable Expert Parallelism
237+
"distributed_config": DistributedConfig(enable_expert_parallel=1),
238+
# Enable Tensor Parallelism
239+
"tp_plan": "auto",
238240
}
239241

240242
model = AutoModelForCausalLM.from_pretrained(
241243
model_path,
242244
torch_dtype="auto",
243-
attn_implementation="vllm-flash-attn3:flash_attn_varlen_func",
245+
attn_implementation="kernels-community/vllm-flash-attn3",
244246
**device_map,
245247
)
246248

247-
model.eval()
249+
messages = [
250+
{"role": "user", "content": "Explain how expert parallelism works in large language models."}
251+
]
252+
253+
inputs = tokenizer.apply_chat_template(
254+
messages,
255+
add_generation_prompt=True,
256+
return_tensors="pt",
257+
return_dict=True,
258+
).to(model.device)
248259

249-
# Tokenize and generate
250-
inputs = tokenizer(chat_prompt, return_tensors="pt").to("cuda")
251-
outputs = model.generate(**inputs, generation_config=generation_config)
260+
outputs = model.generate(**inputs, max_new_tokens=1000)
252261

253262
# Decode and print
254-
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
255-
print("Model response:", response.split("assistant\n")[-1].strip())
256-
263+
response = tokenizer.decode(outputs[0])
264+
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())
257265
```
258266

259-
You can then run this generation via
267+
You can then run this on a node with four GPUs via
260268

261269
```bash
262-
torchrun --nproc_per_node=2 generate.py
270+
torchrun --nproc_per_node=4 generate.py
263271
```

0 commit comments

Comments
 (0)