Skip to content

Commit 93fb74b

Browse files
committed
add blurb
Signed-off-by: simon-mo <[email protected]>
1 parent 12c3f8c commit 93fb74b

File tree

1 file changed

+20
-0
lines changed

1 file changed

+20
-0
lines changed

_posts/2025-08-06-gpt-oss.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,26 @@ image: /assets/figures/v1/vLLM_V1_Logo.png
77

88
We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it.
99

10+
To quickly get started with GPT-OSS, you try our container:
11+
```
12+
docker run --gpus all \
13+
-p 8000:8000 \
14+
--ipc=host \
15+
vllm/vllm-openai:gptoss \
16+
--model openai/gpt-oss-20b
17+
```
18+
or install it in your virtual environment
19+
```
20+
uv pip install --pre vllm==0.10.1+gptoss \
21+
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
22+
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
23+
--index-strategy unsafe-best-match
24+
25+
vllm serve openai/gpt-oss-120b
26+
```
27+
See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.md) for more detail.
28+
29+
1030
### **MXFP4 MoE**
1131

1232
GPT-OSS is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)!

0 commit comments

Comments
 (0)