You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-08-05-gpt-oss.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
---
2
2
layout: post
3
-
title: "vLLM Now Supports GPT-OSS"
3
+
title: "vLLM Now Supports gpt-oss"
4
4
author: "The vLLM Team"
5
5
image: /assets/logos/vllm-logo-text-light.png
6
6
---
7
7
8
-
We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it.
8
+
We're thrilled to announce that vLLM now supports gpt-oss on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of gpt-oss and how vLLM supports it.
9
9
10
-
To quickly get started with GPT-OSS, you try our container:
10
+
To quickly get started with gpt-oss, you try our container:
See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.md) for more detail.
27
+
See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/gpt-oss.md) for more detail.
28
28
29
29
30
30
### **MXFP4 MoE**
31
31
32
-
GPT-OSS is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)!
32
+
gpt-oss is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)!
33
33
34
34
In MXFP4, each weight is represented as a 4-bit floating-point (fp4 e2m1). Additionally, MXFP4 introduces a power-of-two scaling factor for each group of 32 consecutive fp4 values, to represent a wide numerical range. When it runs on hardware, two fp4 values are packed into a single 8-bit unit in memory, and then unpacked on the fly within the matmul kernel for computation.
35
35
@@ -40,23 +40,23 @@ To efficiently run MXFP4 MoE, vLLM has integrated two specialized GPU kernels vi
40
40
41
41
### **Efficient Attention**
42
42
43
-
GPT-OSS has a highly efficient attention design. It uses GQA with 64 query heads and 8 KV heads. Importantly, the model interleaves full attention and sliding window attention (with window size **128**) with 1:1 ratio. Furthermore, the head size of the model is 64, 50% of the standard head size 128. Finally, each query head has a trained “attention sink” vector.
43
+
gpt-oss has a highly efficient attention design. It uses GQA with 64 query heads and 8 KV heads. Importantly, the model interleaves full attention and sliding window attention (with window size **128**) with 1:1 ratio. Furthermore, the head size of the model is 64, 50% of the standard head size 128. Finally, each query head has a trained “attention sink” vector.
44
44
45
45
To efficiently support this attention, vLLM has integrated special GPU kernels from FlashInfer (Blackwell) and FlashAttention 3 (Hopper). Also, we enhanced our Triton attention kernel to support this on AMD GPUs.
46
46
47
47
Furthermore, to efficiently manage the KV cache with different types of attention (i.e., full and sliding window), vLLM has integrated the [hybrid KV cache allocator](https://arxiv.org/abs/2503.18292), a novel technique proposed by the vLLM team. With the hybrid KV cache manager, vLLM can dynamically share the KV cache space between the full attention layers and sliding window attention layers, reducing the potential memory fragmentation down to zero.
48
48
49
49
### **Built-in Tool Support: Agent Loop & Tool Server via MCP**
50
50
51
-
GPT-OSS includes built-in support for powerful tools, such as web browsing and Python code interpreter. When enabled, the model autonomously decides when and how to invoke these tools, interpreting the results seamlessly.
51
+
gpt-oss includes built-in support for powerful tools, such as web browsing and Python code interpreter. When enabled, the model autonomously decides when and how to invoke these tools, interpreting the results seamlessly.
52
52
53
53
vLLM natively supports these capabilities by integrating the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) and the gpt-oss toolkit. Through this integration, vLLM implements a loop to parse the model’s tool call, actually invoke the search and code interpreter tools, parse their outputs, and send them back to the model.
54
54
55
55
Alternatively, users can launch an MCP-compliant external tool server, to let vLLM use the tool server instead of directly leveraging the gpt-oss toolkit. This modular architecture simplifies the creation of scalable tool-calling libraries and services, requiring no internal changes to vLLM.
56
56
57
57
### **Looking Ahead**
58
58
59
-
This announcement is just the beginning of vLLM’s continued optimization for GPT-OSS. Our ongoing roadmap includes:
59
+
This announcement is just the beginning of vLLM’s continued optimization for gpt-oss. Our ongoing roadmap includes:
0 commit comments