From 12c3f8c5bb3170e6cc6f57d5d550ea6957ba0ee5 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 6 Aug 2025 00:59:38 +0800 Subject: [PATCH 1/7] add gpt oss --- _posts/2025-08-06-gpt-oss.md | 61 ++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 _posts/2025-08-06-gpt-oss.md diff --git a/_posts/2025-08-06-gpt-oss.md b/_posts/2025-08-06-gpt-oss.md new file mode 100644 index 0000000..3909e55 --- /dev/null +++ b/_posts/2025-08-06-gpt-oss.md @@ -0,0 +1,61 @@ +--- +layout: post +title: "vLLM Now Supports GPT-OSS" +author: "The vLLM Team" +image: /assets/figures/v1/vLLM_V1_Logo.png +--- + +We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it. + +### **MXFP4 MoE** + +GPT-OSS is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)! + +In MXFP4, each weight is represented as a 4-bit floating-point (fp4 e2m1). Additionally, MXFP4 introduces a power-of-two scaling factor for each group of 32 consecutive fp4 values, to represent a wide numerical range. When it runs on hardware, two fp4 values are packed into a single 8-bit unit in memory, and then unpacked on the fly within the matmul kernel for computation. + +To efficiently run MXFP4 MoE, vLLM has integrated two specialized GPU kernels via collaboration with OpenAI and NVIDIA: + +* **Blackwell GPUs (e.g., B200):** A new MoE kernel from [FlashInfer](https://github.com/flashinfer-ai/flashinfer). This kernel is implemented by NVIDIA and uses Blackwell’s native MXFP4 tensor cores for maximum performance. +* **Hopper GPUs (e.g., H100, H200):** Triton [`matmul_ogs` kernel](https://github.com/triton-lang/triton/tree/main/python/triton_kernels), officially implemented by the OpenAI Triton team. This kernel is optimized specifically for Hopper architectures, includes the [swizzling](https://en.wikipedia.org/wiki/Swizzling_\(computer_graphics\)) optimization and built-in heuristics, removing the need for manual tuning. + +### **Efficient Attention** + +GPT-OSS has a highly efficient attention design. It uses GQA with 64 query heads and 8 KV heads. Importantly, the model interleaves full attention and sliding window attention (with window size **128**) with 1:1 ratio. Furthermore, the head size of the model is 64, 50% of the standard head size 128. Finally, each query head has a trained “attention sink” vector. + +To efficiently support this attention, vLLM has integrated special GPU kernels from FlashInfer (Blackwell) and FlashAttention 3 (Hopper). Also, we enhanced our Triton attention kernel to support this on AMD GPUs. + +Furthermore, to efficiently manage the KV cache with different types of attention (i.e., full and sliding window), vLLM has integrated the [hybrid KV cache allocator](https://arxiv.org/abs/2503.18292), a novel technique proposed by the vLLM team. With the hybrid KV cache manager, vLLM can dynamically share the KV cache space between the full attention layers and sliding window attention layers, reducing the potential memory fragmentation down to zero. + +### **Built-in Tool Support: Agent Loop & Tool Server via MCP** + +GPT-OSS includes built-in support for powerful tools, such as web browsing and Python code interpreter. When enabled, the model autonomously decides when and how to invoke these tools, interpreting the results seamlessly. + +vLLM natively supports these capabilities by integrating the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) and the gpt-oss toolkit. Through this integration, vLLM implements a loop to parse the model’s tool call, actually invoke the search and code interpreter tools, parse their outputs, and send them back to the model. + +Alternatively, users can launch an MCP-compliant external tool server, to let vLLM use the tool server instead of directly leveraging the gpt-oss toolkit. This modular architecture simplifies the creation of scalable tool-calling libraries and services, requiring no internal changes to vLLM. + +### **Looking Ahead** + +This announcement is just the beginning of vLLM’s continued optimization for GPT-OSS. Our ongoing roadmap includes: + +* Hardening the Responses API + +* Further enhancing attention DP and MoE EP support + +* Reducing CPU overhead to maximize throughput + +## Acknowledgement + +vLLM team members who contributed to this effort are: Yongye Zhu, Woosuk Kwon, Chen Zhang, Simon Mo, Kaichao You. + +Jay Shah from Colfax International implemented the necessary changes to adapt to attention sinks and uncovered optimizations in the FA3 algorithm for gpt-oss. + +We want to thank OpenAI for the amazing partnership: Zhuohan Li, Xiaoxuan Liu, Philippe Tillet, Mario Lezcano-Casado, Dominik Kundel, Casey Dvorak. + +NVIDIA and vLLM worked closely to develop and verify both performance and accuracy on NVIDIA Blackwell architecture: Duncan Moss, Grace Ho, Julien Demouth, Minseok Lee, Siyuan Fu, Zihao Ye, Pen Chung Li. + +The AMD team contributed significantly to the integration of the model on their devices: Hongxia Yang, Ali Zaidy, with great support from Peng Sun, Vinayak Gokhale, Andy Luo + +The Hugging Face team continues to be amazing at building an open source ecosystem: Lysandre, Hugo, Marc, vb, Arthur, Mohamed, Andrien. + +Finally, we want to thank all the partners that leveraged vLLM in some way and delivered valuable feedback and improvements to this effort: AWS, Cloudflare, Snowflake, Databricks, Together, Fireworks, Cerebras. From 93fb74b385ce09e802b66535c1f0df4e325dfb83 Mon Sep 17 00:00:00 2001 From: simon-mo Date: Tue, 5 Aug 2025 10:08:44 -0700 Subject: [PATCH 2/7] add blurb Signed-off-by: simon-mo --- _posts/2025-08-06-gpt-oss.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/_posts/2025-08-06-gpt-oss.md b/_posts/2025-08-06-gpt-oss.md index 3909e55..dd3e0d8 100644 --- a/_posts/2025-08-06-gpt-oss.md +++ b/_posts/2025-08-06-gpt-oss.md @@ -7,6 +7,26 @@ image: /assets/figures/v1/vLLM_V1_Logo.png We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it. +To quickly get started with GPT-OSS, you try our container: +``` +docker run --gpus all \ + -p 8000:8000 \ + --ipc=host \ + vllm/vllm-openai:gptoss \ + --model openai/gpt-oss-20b +``` +or install it in your virtual environment +``` +uv pip install --pre vllm==0.10.1+gptoss \ + --extra-index-url https://wheels.vllm.ai/gpt-oss/ \ + --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \ + --index-strategy unsafe-best-match + +vllm serve openai/gpt-oss-120b +``` +See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.md) for more detail. + + ### **MXFP4 MoE** GPT-OSS is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)! From 6934afde0f46547b92236876f734f7c56ff7a0aa Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 6 Aug 2025 01:09:45 +0800 Subject: [PATCH 3/7] fix image Signed-off-by: youkaichao --- _posts/2025-08-06-gpt-oss.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-08-06-gpt-oss.md b/_posts/2025-08-06-gpt-oss.md index 3909e55..8f5edb6 100644 --- a/_posts/2025-08-06-gpt-oss.md +++ b/_posts/2025-08-06-gpt-oss.md @@ -2,7 +2,7 @@ layout: post title: "vLLM Now Supports GPT-OSS" author: "The vLLM Team" -image: /assets/figures/v1/vLLM_V1_Logo.png +image: /assets/logos/vllm-logo-text-light.png --- We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it. From 79ba945c3ba46ad7460c1683460e92ceb07e8940 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 6 Aug 2025 01:11:12 +0800 Subject: [PATCH 4/7] add people Signed-off-by: youkaichao --- _posts/2025-08-06-gpt-oss.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-08-06-gpt-oss.md b/_posts/2025-08-06-gpt-oss.md index 7faa4dd..b18782d 100644 --- a/_posts/2025-08-06-gpt-oss.md +++ b/_posts/2025-08-06-gpt-oss.md @@ -70,7 +70,7 @@ vLLM team members who contributed to this effort are: Yongye Zhu, Woosuk Kwon, C Jay Shah from Colfax International implemented the necessary changes to adapt to attention sinks and uncovered optimizations in the FA3 algorithm for gpt-oss. -We want to thank OpenAI for the amazing partnership: Zhuohan Li, Xiaoxuan Liu, Philippe Tillet, Mario Lezcano-Casado, Dominik Kundel, Casey Dvorak. +We want to thank OpenAI for the amazing partnership: Zhuohan Li, Xiaoxuan Liu, Philippe Tillet, Mario Lezcano-Casado, Dominik Kundel, Casey Dvorak, Vol Kyrylov. NVIDIA and vLLM worked closely to develop and verify both performance and accuracy on NVIDIA Blackwell architecture: Duncan Moss, Grace Ho, Julien Demouth, Minseok Lee, Siyuan Fu, Zihao Ye, Pen Chung Li. From 0f132e67ac1a41edfe544992e1b9510afeab77d8 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 6 Aug 2025 01:15:25 +0800 Subject: [PATCH 5/7] try? --- _posts/2025-08-05-gpt-oss.md | 9 ++++ _posts/2025-08-06-gpt-oss.md | 81 ------------------------------------ 2 files changed, 9 insertions(+), 81 deletions(-) create mode 100644 _posts/2025-08-05-gpt-oss.md delete mode 100644 _posts/2025-08-06-gpt-oss.md diff --git a/_posts/2025-08-05-gpt-oss.md b/_posts/2025-08-05-gpt-oss.md new file mode 100644 index 0000000..3b3766f --- /dev/null +++ b/_posts/2025-08-05-gpt-oss.md @@ -0,0 +1,9 @@ +--- +layout: post +title: "vLLM Now Supports GPT-OSS" +author: "The vLLM Team" +image: /assets/logos/vllm-logo-text-light.png +--- + +We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it. + diff --git a/_posts/2025-08-06-gpt-oss.md b/_posts/2025-08-06-gpt-oss.md deleted file mode 100644 index b18782d..0000000 --- a/_posts/2025-08-06-gpt-oss.md +++ /dev/null @@ -1,81 +0,0 @@ ---- -layout: post -title: "vLLM Now Supports GPT-OSS" -author: "The vLLM Team" -image: /assets/logos/vllm-logo-text-light.png ---- - -We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it. - -To quickly get started with GPT-OSS, you try our container: -``` -docker run --gpus all \ - -p 8000:8000 \ - --ipc=host \ - vllm/vllm-openai:gptoss \ - --model openai/gpt-oss-20b -``` -or install it in your virtual environment -``` -uv pip install --pre vllm==0.10.1+gptoss \ - --extra-index-url https://wheels.vllm.ai/gpt-oss/ \ - --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \ - --index-strategy unsafe-best-match - -vllm serve openai/gpt-oss-120b -``` -See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.md) for more detail. - - -### **MXFP4 MoE** - -GPT-OSS is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)! - -In MXFP4, each weight is represented as a 4-bit floating-point (fp4 e2m1). Additionally, MXFP4 introduces a power-of-two scaling factor for each group of 32 consecutive fp4 values, to represent a wide numerical range. When it runs on hardware, two fp4 values are packed into a single 8-bit unit in memory, and then unpacked on the fly within the matmul kernel for computation. - -To efficiently run MXFP4 MoE, vLLM has integrated two specialized GPU kernels via collaboration with OpenAI and NVIDIA: - -* **Blackwell GPUs (e.g., B200):** A new MoE kernel from [FlashInfer](https://github.com/flashinfer-ai/flashinfer). This kernel is implemented by NVIDIA and uses Blackwell’s native MXFP4 tensor cores for maximum performance. -* **Hopper GPUs (e.g., H100, H200):** Triton [`matmul_ogs` kernel](https://github.com/triton-lang/triton/tree/main/python/triton_kernels), officially implemented by the OpenAI Triton team. This kernel is optimized specifically for Hopper architectures, includes the [swizzling](https://en.wikipedia.org/wiki/Swizzling_\(computer_graphics\)) optimization and built-in heuristics, removing the need for manual tuning. - -### **Efficient Attention** - -GPT-OSS has a highly efficient attention design. It uses GQA with 64 query heads and 8 KV heads. Importantly, the model interleaves full attention and sliding window attention (with window size **128**) with 1:1 ratio. Furthermore, the head size of the model is 64, 50% of the standard head size 128. Finally, each query head has a trained “attention sink” vector. - -To efficiently support this attention, vLLM has integrated special GPU kernels from FlashInfer (Blackwell) and FlashAttention 3 (Hopper). Also, we enhanced our Triton attention kernel to support this on AMD GPUs. - -Furthermore, to efficiently manage the KV cache with different types of attention (i.e., full and sliding window), vLLM has integrated the [hybrid KV cache allocator](https://arxiv.org/abs/2503.18292), a novel technique proposed by the vLLM team. With the hybrid KV cache manager, vLLM can dynamically share the KV cache space between the full attention layers and sliding window attention layers, reducing the potential memory fragmentation down to zero. - -### **Built-in Tool Support: Agent Loop & Tool Server via MCP** - -GPT-OSS includes built-in support for powerful tools, such as web browsing and Python code interpreter. When enabled, the model autonomously decides when and how to invoke these tools, interpreting the results seamlessly. - -vLLM natively supports these capabilities by integrating the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) and the gpt-oss toolkit. Through this integration, vLLM implements a loop to parse the model’s tool call, actually invoke the search and code interpreter tools, parse their outputs, and send them back to the model. - -Alternatively, users can launch an MCP-compliant external tool server, to let vLLM use the tool server instead of directly leveraging the gpt-oss toolkit. This modular architecture simplifies the creation of scalable tool-calling libraries and services, requiring no internal changes to vLLM. - -### **Looking Ahead** - -This announcement is just the beginning of vLLM’s continued optimization for GPT-OSS. Our ongoing roadmap includes: - -* Hardening the Responses API - -* Further enhancing attention DP and MoE EP support - -* Reducing CPU overhead to maximize throughput - -## Acknowledgement - -vLLM team members who contributed to this effort are: Yongye Zhu, Woosuk Kwon, Chen Zhang, Simon Mo, Kaichao You. - -Jay Shah from Colfax International implemented the necessary changes to adapt to attention sinks and uncovered optimizations in the FA3 algorithm for gpt-oss. - -We want to thank OpenAI for the amazing partnership: Zhuohan Li, Xiaoxuan Liu, Philippe Tillet, Mario Lezcano-Casado, Dominik Kundel, Casey Dvorak, Vol Kyrylov. - -NVIDIA and vLLM worked closely to develop and verify both performance and accuracy on NVIDIA Blackwell architecture: Duncan Moss, Grace Ho, Julien Demouth, Minseok Lee, Siyuan Fu, Zihao Ye, Pen Chung Li. - -The AMD team contributed significantly to the integration of the model on their devices: Hongxia Yang, Ali Zaidy, with great support from Peng Sun, Vinayak Gokhale, Andy Luo - -The Hugging Face team continues to be amazing at building an open source ecosystem: Lysandre, Hugo, Marc, vb, Arthur, Mohamed, Andrien. - -Finally, we want to thank all the partners that leveraged vLLM in some way and delivered valuable feedback and improvements to this effort: AWS, Cloudflare, Snowflake, Databricks, Together, Fireworks, Cerebras. From e425ffce26074d0818e73731503562ebdd904467 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 6 Aug 2025 01:16:42 +0800 Subject: [PATCH 6/7] restore --- _posts/2025-08-05-gpt-oss.md | 72 ++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) diff --git a/_posts/2025-08-05-gpt-oss.md b/_posts/2025-08-05-gpt-oss.md index 3b3766f..b18782d 100644 --- a/_posts/2025-08-05-gpt-oss.md +++ b/_posts/2025-08-05-gpt-oss.md @@ -7,3 +7,75 @@ image: /assets/logos/vllm-logo-text-light.png We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it. +To quickly get started with GPT-OSS, you try our container: +``` +docker run --gpus all \ + -p 8000:8000 \ + --ipc=host \ + vllm/vllm-openai:gptoss \ + --model openai/gpt-oss-20b +``` +or install it in your virtual environment +``` +uv pip install --pre vllm==0.10.1+gptoss \ + --extra-index-url https://wheels.vllm.ai/gpt-oss/ \ + --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \ + --index-strategy unsafe-best-match + +vllm serve openai/gpt-oss-120b +``` +See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.md) for more detail. + + +### **MXFP4 MoE** + +GPT-OSS is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)! + +In MXFP4, each weight is represented as a 4-bit floating-point (fp4 e2m1). Additionally, MXFP4 introduces a power-of-two scaling factor for each group of 32 consecutive fp4 values, to represent a wide numerical range. When it runs on hardware, two fp4 values are packed into a single 8-bit unit in memory, and then unpacked on the fly within the matmul kernel for computation. + +To efficiently run MXFP4 MoE, vLLM has integrated two specialized GPU kernels via collaboration with OpenAI and NVIDIA: + +* **Blackwell GPUs (e.g., B200):** A new MoE kernel from [FlashInfer](https://github.com/flashinfer-ai/flashinfer). This kernel is implemented by NVIDIA and uses Blackwell’s native MXFP4 tensor cores for maximum performance. +* **Hopper GPUs (e.g., H100, H200):** Triton [`matmul_ogs` kernel](https://github.com/triton-lang/triton/tree/main/python/triton_kernels), officially implemented by the OpenAI Triton team. This kernel is optimized specifically for Hopper architectures, includes the [swizzling](https://en.wikipedia.org/wiki/Swizzling_\(computer_graphics\)) optimization and built-in heuristics, removing the need for manual tuning. + +### **Efficient Attention** + +GPT-OSS has a highly efficient attention design. It uses GQA with 64 query heads and 8 KV heads. Importantly, the model interleaves full attention and sliding window attention (with window size **128**) with 1:1 ratio. Furthermore, the head size of the model is 64, 50% of the standard head size 128. Finally, each query head has a trained “attention sink” vector. + +To efficiently support this attention, vLLM has integrated special GPU kernels from FlashInfer (Blackwell) and FlashAttention 3 (Hopper). Also, we enhanced our Triton attention kernel to support this on AMD GPUs. + +Furthermore, to efficiently manage the KV cache with different types of attention (i.e., full and sliding window), vLLM has integrated the [hybrid KV cache allocator](https://arxiv.org/abs/2503.18292), a novel technique proposed by the vLLM team. With the hybrid KV cache manager, vLLM can dynamically share the KV cache space between the full attention layers and sliding window attention layers, reducing the potential memory fragmentation down to zero. + +### **Built-in Tool Support: Agent Loop & Tool Server via MCP** + +GPT-OSS includes built-in support for powerful tools, such as web browsing and Python code interpreter. When enabled, the model autonomously decides when and how to invoke these tools, interpreting the results seamlessly. + +vLLM natively supports these capabilities by integrating the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) and the gpt-oss toolkit. Through this integration, vLLM implements a loop to parse the model’s tool call, actually invoke the search and code interpreter tools, parse their outputs, and send them back to the model. + +Alternatively, users can launch an MCP-compliant external tool server, to let vLLM use the tool server instead of directly leveraging the gpt-oss toolkit. This modular architecture simplifies the creation of scalable tool-calling libraries and services, requiring no internal changes to vLLM. + +### **Looking Ahead** + +This announcement is just the beginning of vLLM’s continued optimization for GPT-OSS. Our ongoing roadmap includes: + +* Hardening the Responses API + +* Further enhancing attention DP and MoE EP support + +* Reducing CPU overhead to maximize throughput + +## Acknowledgement + +vLLM team members who contributed to this effort are: Yongye Zhu, Woosuk Kwon, Chen Zhang, Simon Mo, Kaichao You. + +Jay Shah from Colfax International implemented the necessary changes to adapt to attention sinks and uncovered optimizations in the FA3 algorithm for gpt-oss. + +We want to thank OpenAI for the amazing partnership: Zhuohan Li, Xiaoxuan Liu, Philippe Tillet, Mario Lezcano-Casado, Dominik Kundel, Casey Dvorak, Vol Kyrylov. + +NVIDIA and vLLM worked closely to develop and verify both performance and accuracy on NVIDIA Blackwell architecture: Duncan Moss, Grace Ho, Julien Demouth, Minseok Lee, Siyuan Fu, Zihao Ye, Pen Chung Li. + +The AMD team contributed significantly to the integration of the model on their devices: Hongxia Yang, Ali Zaidy, with great support from Peng Sun, Vinayak Gokhale, Andy Luo + +The Hugging Face team continues to be amazing at building an open source ecosystem: Lysandre, Hugo, Marc, vb, Arthur, Mohamed, Andrien. + +Finally, we want to thank all the partners that leveraged vLLM in some way and delivered valuable feedback and improvements to this effort: AWS, Cloudflare, Snowflake, Databricks, Together, Fireworks, Cerebras. From 9322723f85fe1303a7e7446214fbf978a3ae96cf Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 6 Aug 2025 01:19:56 +0800 Subject: [PATCH 7/7] lowercase Signed-off-by: youkaichao --- _posts/2025-08-05-gpt-oss.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/_posts/2025-08-05-gpt-oss.md b/_posts/2025-08-05-gpt-oss.md index b18782d..9407313 100644 --- a/_posts/2025-08-05-gpt-oss.md +++ b/_posts/2025-08-05-gpt-oss.md @@ -1,13 +1,13 @@ --- layout: post -title: "vLLM Now Supports GPT-OSS" +title: "vLLM Now Supports gpt-oss" author: "The vLLM Team" image: /assets/logos/vllm-logo-text-light.png --- -We're thrilled to announce that vLLM now supports GPT-OSS on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of GPT-OSS and how vLLM supports it. +We're thrilled to announce that vLLM now supports gpt-oss on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of gpt-oss and how vLLM supports it. -To quickly get started with GPT-OSS, you try our container: +To quickly get started with gpt-oss, you try our container: ``` docker run --gpus all \ -p 8000:8000 \ @@ -24,12 +24,12 @@ uv pip install --pre vllm==0.10.1+gptoss \ vllm serve openai/gpt-oss-120b ``` -See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.md) for more detail. +See [vLLM User Guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/gpt-oss.md) for more detail. ### **MXFP4 MoE** -GPT-OSS is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)! +gpt-oss is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses [MXFP4](https://arxiv.org/abs/2310.10537), a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)! In MXFP4, each weight is represented as a 4-bit floating-point (fp4 e2m1). Additionally, MXFP4 introduces a power-of-two scaling factor for each group of 32 consecutive fp4 values, to represent a wide numerical range. When it runs on hardware, two fp4 values are packed into a single 8-bit unit in memory, and then unpacked on the fly within the matmul kernel for computation. @@ -40,7 +40,7 @@ To efficiently run MXFP4 MoE, vLLM has integrated two specialized GPU kernels vi ### **Efficient Attention** -GPT-OSS has a highly efficient attention design. It uses GQA with 64 query heads and 8 KV heads. Importantly, the model interleaves full attention and sliding window attention (with window size **128**) with 1:1 ratio. Furthermore, the head size of the model is 64, 50% of the standard head size 128. Finally, each query head has a trained “attention sink” vector. +gpt-oss has a highly efficient attention design. It uses GQA with 64 query heads and 8 KV heads. Importantly, the model interleaves full attention and sliding window attention (with window size **128**) with 1:1 ratio. Furthermore, the head size of the model is 64, 50% of the standard head size 128. Finally, each query head has a trained “attention sink” vector. To efficiently support this attention, vLLM has integrated special GPU kernels from FlashInfer (Blackwell) and FlashAttention 3 (Hopper). Also, we enhanced our Triton attention kernel to support this on AMD GPUs. @@ -48,7 +48,7 @@ Furthermore, to efficiently manage the KV cache with different types of attentio ### **Built-in Tool Support: Agent Loop & Tool Server via MCP** -GPT-OSS includes built-in support for powerful tools, such as web browsing and Python code interpreter. When enabled, the model autonomously decides when and how to invoke these tools, interpreting the results seamlessly. +gpt-oss includes built-in support for powerful tools, such as web browsing and Python code interpreter. When enabled, the model autonomously decides when and how to invoke these tools, interpreting the results seamlessly. vLLM natively supports these capabilities by integrating the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) and the gpt-oss toolkit. Through this integration, vLLM implements a loop to parse the model’s tool call, actually invoke the search and code interpreter tools, parse their outputs, and send them back to the model. @@ -56,7 +56,7 @@ Alternatively, users can launch an MCP-compliant external tool server, to let vL ### **Looking Ahead** -This announcement is just the beginning of vLLM’s continued optimization for GPT-OSS. Our ongoing roadmap includes: +This announcement is just the beginning of vLLM’s continued optimization for gpt-oss. Our ongoing roadmap includes: * Hardening the Responses API