Merge branch 'gpt-oss' of github.com:vllm-project/vllm-project.github.io into gpt-oss

youkaichao · youkaichao · commit d767c2fe84d5 · 2025-08-06T01:10:19.000+08:00
diff --git a/_posts/2025-08-06-gpt-oss.md b/_posts/2025-08-06-gpt-oss.md
@@ -7,6 +7,26 @@ image: /assets/logos/vllm-logo-text-light.png
 
 We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it.
 
+To quickly get started with GPT-OSS, you try our container:
+```
+docker run --gpus all \
+    -p 8000:8000 \
+    --ipc=host \
+    vllm/vllm-openai:gptoss \
+    --model openai/gpt-oss-20b
+```
+or install it in your virtual environment
+```
+uv pip install --pre vllm==0.10.1+gptoss \
+    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
+    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
+    --index-strategy unsafe-best-match
+
+vllm serve openai/gpt-oss-120b
+```
+See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.md) for more detail. 
+
+
 ### **MXFP4 MoE**
 
 GPT-OSS is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)!