Skip to content

Commit 9322723

Browse files
committed
lowercase
Signed-off-by: youkaichao <[email protected]>
1 parent e425ffc commit 9322723

File tree

1 file changed

+8
-8
lines changed

1 file changed

+8
-8
lines changed

_posts/2025-08-05-gpt-oss.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
22
layout: post
3-
title: "vLLM Now Supports GPT-OSS"
3+
title: "vLLM Now Supports gpt-oss"
44
author: "The vLLM Team"
55
image: /assets/logos/vllm-logo-text-light.png
66
---
77

8-
We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it.
8+
We're thrilled to announce that vLLM now supports gpt-oss on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of gpt-oss and how vLLM supports it.
99

10-
To quickly get started with GPT-OSS, you try our container:
10+
To quickly get started with gpt-oss, you try our container:
1111
```
1212
docker run --gpus all \
1313
-p 8000:8000 \
@@ -24,12 +24,12 @@ uv pip install --pre vllm==0.10.1+gptoss \
2424
2525
vllm serve openai/gpt-oss-120b
2626
```
27-
See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.md) for more detail.
27+
See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/gpt-oss.md) for more detail.
2828

2929

3030
### **MXFP4 MoE**
3131

32-
GPT-OSS is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)!
32+
gpt-oss is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)!
3333

3434
In MXFP4, each weight is represented as a 4-bit floating-point (fp4 e2m1). Additionally, MXFP4 introduces a power-of-two scaling factor for each group of 32 consecutive fp4 values, to represent a wide numerical range. When it runs on hardware, two fp4 values are packed into a single 8-bit unit in memory, and then unpacked on the fly within the matmul kernel for computation.
3535

@@ -40,23 +40,23 @@ To efficiently run MXFP4 MoE, vLLM has integrated two specialized GPU kernels vi
4040

4141
### **Efficient Attention**
4242

43-
GPT-OSS has a highly efficient attention design. It uses GQA with 64 query heads and 8 KV heads. Importantly, the model interleaves full attention and sliding window attention (with window size **128**) with 1:1 ratio. Furthermore, the head size of the model is 64, 50% of the standard head size 128. Finally, each query head has a trained “attention sink” vector.
43+
gpt-oss has a highly efficient attention design. It uses GQA with 64 query heads and 8 KV heads. Importantly, the model interleaves full attention and sliding window attention (with window size **128**) with 1:1 ratio. Furthermore, the head size of the model is 64, 50% of the standard head size 128. Finally, each query head has a trained “attention sink” vector.
4444

4545
To efficiently support this attention, vLLM has integrated special GPU kernels from FlashInfer (Blackwell) and FlashAttention 3 (Hopper). Also, we enhanced our Triton attention kernel to support this on AMD GPUs.
4646

4747
Furthermore, to efficiently manage the KV cache with different types of attention (i.e., full and sliding window), vLLM has integrated the [hybrid KV cache allocator](https://arxiv.org/abs/2503.18292), a novel technique proposed by the vLLM team. With the hybrid KV cache manager, vLLM can dynamically share the KV cache space between the full attention layers and sliding window attention layers, reducing the potential memory fragmentation down to zero.
4848

4949
### **Built-in Tool Support: Agent Loop & Tool Server via MCP**
5050

51-
GPT-OSS includes built-in support for powerful tools, such as web browsing and Python code interpreter. When enabled, the model autonomously decides when and how to invoke these tools, interpreting the results seamlessly.
51+
gpt-oss includes built-in support for powerful tools, such as web browsing and Python code interpreter. When enabled, the model autonomously decides when and how to invoke these tools, interpreting the results seamlessly.
5252

5353
vLLM natively supports these capabilities by integrating the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) and the gpt-oss toolkit. Through this integration, vLLM implements a loop to parse the model’s tool call, actually invoke the search and code interpreter tools, parse their outputs, and send them back to the model.
5454

5555
Alternatively, users can launch an MCP-compliant external tool server, to let vLLM use the tool server instead of directly leveraging the gpt-oss toolkit. This modular architecture simplifies the creation of scalable tool-calling libraries and services, requiring no internal changes to vLLM.
5656

5757
### **Looking Ahead**
5858

59-
This announcement is just the beginning of vLLM’s continued optimization for GPT-OSS. Our ongoing roadmap includes:
59+
This announcement is just the beginning of vLLM’s continued optimization for gpt-oss. Our ongoing roadmap includes:
6060

6161
* Hardening the Responses API
6262

0 commit comments

Comments
 (0)