Skip to content

Commit c6fd13e

Browse files
author
zhiweiz
committed
update recipe with container tool
1 parent 2b294fe commit c6fd13e

File tree

1 file changed

+53
-27
lines changed

1 file changed

+53
-27
lines changed

OpenAI/GPT-OSS.md

Lines changed: 53 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
## `gpt-oss` vLLM Usage Guide
22

3-
`gpt-oss-20b` and `gpt-oss-120b` are powerful reasoning models open-sourced by OpenAI.
4-
In vLLM, you can run it on NVIDIA H100, H200, B200 as well as MI300x, MI325x, MI355x and Radeon AI PRO R9700.
5-
We are actively working on ensuring this model can work on Ampere, Ada Lovelace, and RTX 5090.
3+
`gpt-oss-20b` and `gpt-oss-120b` are powerful reasoning models open-sourced by OpenAI.
4+
In vLLM, you can run it on NVIDIA H100, H200, B200 as well as MI300x, MI325x, MI355x and Radeon AI PRO R9700.
5+
We are actively working on ensuring this model can work on Ampere, Ada Lovelace, and RTX 5090.
66
Specifically, vLLM optimizes for `gpt-oss` family of models with
77

88
* **Flexible parallelism options**: the model can be sharded across 2, 4, 8 GPUs, scaling throughput.
9-
* **High performance attention and MoE kernels**: attention kernel is specifically optimized for the attention sinks mechanism and sliding window shapes.
10-
* **Asynchronous scheduling**: optimizing for maximum utilization and high throughput by overlapping CPU operations with GPU operations.
9+
* **High performance attention and MoE kernels**: attention kernel is specifically optimized for the attention sinks mechanism and sliding window shapes.
10+
* **Asynchronous scheduling**: optimizing for maximum utilization and high throughput by overlapping CPU operations with GPU operations.
1111

12-
This is a living document and we welcome contributions, corrections, and creation of new recipes!
12+
This is a living document and we welcome contributions, corrections, and creation of new recipes!
1313

1414
## Quickstart
1515

@@ -41,7 +41,7 @@ GPT-OSS works on Ampere devices by default, using the `TRITON_ATTN` attention ba
4141

4242
```
4343
# openai/gpt-oss-20b should run on a single A100
44-
vllm serve openai/gpt-oss-20b --async-scheduling
44+
vllm serve openai/gpt-oss-20b --async-scheduling
4545
4646
# gpt-oss-120b will fit on a single A100 (80GB), but scaling it to higher TP sizes can help with throughput
4747
vllm serve openai/gpt-oss-120b --async-scheduling
@@ -54,11 +54,11 @@ vllm serve openai/gpt-oss-120b --tensor-parallel-size 4 --async-scheduling
5454
GPT-OSS works on Hopper devices by default, using the FlashAttention3 backend and Marlin MXFP4 MoE:
5555

5656
* `--async-scheduling` can be enabled for higher performance. Currently it is not compatible with structured output.
57-
* We recommend TP=2 for H100 and H200 as the best performance tradeoff point.
57+
* We recommend TP=2 for H100 and H200 as the best performance tradeoff point.
5858

5959
```
6060
# openai/gpt-oss-20b should run in single GPU
61-
vllm serve openai/gpt-oss-20b --async-scheduling
61+
vllm serve openai/gpt-oss-20b --async-scheduling
6262
6363
# gpt-oss-120b will fit in a single H100/H200, but scaling it to higher TP sizes can help with throughput
6464
vllm serve openai/gpt-oss-120b --async-scheduling
@@ -74,19 +74,19 @@ NVIDIA Blackwell requires installation of [FlashInfer library](https://github.co
7474
uv pip install vllm[flashinfer]==0.10.1 --torch-backend=auto
7575
```
7676

77-
We recommend TP=1 as a starting point for a performant option. We are actively working on the performance of vLLM on Blackwell.
77+
We recommend TP=1 as a starting point for a performant option. We are actively working on the performance of vLLM on Blackwell.
7878

7979
```
8080
# Pick only one out of the two for MoE implementation
8181
# bf16 activation for MoE. matching reference precision (default).
82-
export VLLM_USE_FLASHINFER_MXFP4_BF16_MOE=1
82+
export VLLM_USE_FLASHINFER_MXFP4_BF16_MOE=1
8383
# mxfp8 activation for MoE. faster, but higher risk for accuracy.
84-
export VLLM_USE_FLASHINFER_MXFP4_MOE=1
84+
export VLLM_USE_FLASHINFER_MXFP4_MOE=1
8585
8686
# openai/gpt-oss-20b
87-
vllm serve openai/gpt-oss-20b --async-scheduling
87+
vllm serve openai/gpt-oss-20b --async-scheduling
8888
89-
# gpt-oss-120b
89+
# gpt-oss-120b
9090
vllm serve openai/gpt-oss-120b --async-scheduling
9191
vllm serve openai/gpt-oss-120b --tensor-parallel-size 2 --async-scheduling
9292
vllm serve openai/gpt-oss-120b --tensor-parallel-size 4 --async-scheduling
@@ -96,8 +96,8 @@ vllm serve openai/gpt-oss-120b --tensor-parallel-size 4 --async-scheduling
9696

9797
ROCm supports OpenAI gpt-oss-120b or gpt-oss-20b models on these 3 different GPUs on day one, along with the pre-built docker containers:
9898

99-
* gfx950: MI350x series, `rocm/vllm-dev:open-mi355-08052025`
100-
* gfx942: MI300x/MI325 series, `rocm/vllm-dev:open-mi300-08052025`
99+
* gfx950: MI350x series, `rocm/vllm-dev:open-mi355-08052025`
100+
* gfx942: MI300x/MI325 series, `rocm/vllm-dev:open-mi300-08052025`
101101
* gfx1201: Radeon AI PRO R9700, `rocm/vllm-dev:open-r9700-08052025`
102102

103103
To run the container:
@@ -115,7 +115,7 @@ export VLLM_ROCM_USE_AITER=1
115115
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
116116
export VLLM_ROCM_USE_AITER_MHA=0
117117
118-
vllm serve openai/gpt-oss-120b --compilation-config '{"full_cuda_graph": true}'
118+
vllm serve openai/gpt-oss-120b --compilation-config '{"full_cuda_graph": true}'
119119
```
120120

121121
For MI355x:
@@ -130,22 +130,22 @@ export VLLM_USE_AITER_UNIFIED_ATTENTION=1
130130
export VLLM_ROCM_USE_AITER_MHA=0
131131
export TRITON_HIP_PRESHUFFLE_SCALES=1
132132
133-
vllm serve openai/gpt-oss-120b --compilation-config '{"compile_sizes": [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 4096, 8192], "full_cuda_graph": true}' --block-size 64
133+
vllm serve openai/gpt-oss-120b --compilation-config '{"compile_sizes": [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 4096, 8192], "full_cuda_graph": true}' --block-size 64
134134
```
135135

136136
#### Known Issues
137137
- When you encounter this error `The link interface of target "torch::nvtoolsext" contains: CUDA::nvToolsExt but the target was not found.` Please double check your pytorch version has suffix `+cu128`.
138-
- If the output you see is garbage, that might be because you haven't properly set `CUDA_HOME`. The CUDA version needs to be greater than or equal to 12.8 and must be the same for installation and serving.
138+
- If the output you see is garbage, that might be because you haven't properly set `CUDA_HOME`. The CUDA version needs to be greater than or equal to 12.8 and must be the same for installation and serving.
139139

140140
## Usage
141141

142142
Once the `vllm serve` runs and `INFO: Application startup complete` has been displayed, you can send requests using HTTP request or OpenAI SDK to the following endpoints:
143143

144144
* `/v1/responses` endpoint can perform tool use (browsing, python, mcp) in between chain-of-thought and deliver a final response. This endpoint leverages the `openai-harmony` library for input rendering and output parsing. Stateful operation and full streaming API are work in progress. Responses API is recommended by OpenAI as the way to interact with this model.
145145
* `/v1/chat/completions` endpoint offers a familiar interface to this model. No tool will be invoked but reasoning and final text output will be returned structurally. Function calling is work in progress. You can also set the parameter `include_reasoning: false` in request parameter to skip CoT being part of the output.
146-
* `/v1/completions` endpoint is the endpoint for a simple input output interface without any sorts of template rendering.
146+
* `/v1/completions` endpoint is the endpoint for a simple input output interface without any sorts of template rendering.
147147

148-
All endpoints accept `stream: true` as part of the operations to enable incremental token streaming. Please note that vLLM currently does not cover the full scope of responses API, for more detail, please see Limitation section below.
148+
All endpoints accept `stream: true` as part of the operations to enable incremental token streaming. Please note that vLLM currently does not cover the full scope of responses API, for more detail, please see Limitation section below.
149149

150150
### Tool Use
151151

@@ -159,8 +159,8 @@ uv pip install gpt-oss
159159
vllm serve ... --tool-server demo
160160
```
161161

162-
* Please note that the default options are simply for demo purposes. For production usage, vLLM itself can act as MCP client to multiple services.
163-
Here is an [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) that vLLM can work with, they wrap the demo tools:
162+
* Please note that the default options are simply for demo purposes. For production usage, vLLM itself can act as MCP client to multiple services.
163+
Here is an [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) that vLLM can work with, they wrap the demo tools:
164164

165165
```
166166
mcp run -t sse browser_server.py:mcp
@@ -169,7 +169,33 @@ mcp run -t sse python_server.py:mcp
169169
vllm serve ... --tool-server ip-1:port-1,ip-2:port-2
170170
```
171171

172-
The URLs are expected to be MCP SSE servers that implement `instructions` in server info and well documented tools. The tools will be injected into the system prompt for the model to enable them.
172+
The URLs are expected to be MCP SSE servers that implement `instructions` in server info and well documented tools. The tools will be injected into the system prompt for the model to enable them.
173+
174+
GPT OSS also expects a builtin tool called container. It doesn't have exposed tool type in openai types.
175+
For reference the container tool is a stateful docker container that can be used to run command line tools.
176+
The enabled tool namespace is `container` and the tool name used the most is `exec`.
177+
MCP server need to implement the following functions to support container tool:
178+
```
179+
- for tool name: exec
180+
- args:
181+
{
182+
"cmd":List[str] "command to execute",
183+
"workdir":optional[str] "current working directory",
184+
"env":optional[object/dict] "environment variables",
185+
"session_name":optional[str] "session name",
186+
"timeout":optional[int] "timeout in seconds",
187+
"user":optional[str] "user name",
188+
}
189+
Signature:
190+
async def exec(ctx: Context, rest_of_the_args) -> str
191+
expect ctx to contain a session id to identify the container session and make it stateful
192+
```
193+
Container tool runtime implementation can be referenced from https://github.com/SWE-agent/SWE-ReX
194+
The docker image might need to have some similar features as codex supports
195+
To enable container tool in vllm before openai types has it, Add below
196+
```
197+
export VLLM_ENABLE_CONTAINER_TOOL=1
198+
```
173199

174200
## Accuracy Evaluation Panels
175201

@@ -233,15 +259,15 @@ vllm serve openai/gpt-oss-120b --gpu-memory-utilization 0.95 --max-num-batched-t
233259
* Streaming is fairly barebone at the moment, for example:
234260
* Item id and indexing needs more work
235261
* Tool invocation and output are not properly streamed, rather batched.
236-
* Proper error handling is missing.
262+
* Proper error handling is missing.
237263

238264
## Troubleshooting
239265

240266
- Attention sink dtype error on Blackwell:
241267

242268
```
243-
ERROR 08-05 07:31:10 [multiproc_executor.py:559] assert sinks.dtype == torch.float32, "Sinks must be of type float32"
244-
**(VllmWorker TP0 pid=174579)** ERROR 08-05 07:31:10 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
269+
ERROR 08-05 07:31:10 [multiproc_executor.py:559] assert sinks.dtype == torch.float32, "Sinks must be of type float32"
270+
**(VllmWorker TP0 pid=174579)** ERROR 08-05 07:31:10 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
245271
**(VllmWorker TP0 pid=174579)** ERROR 08-05 07:31:10 [multiproc_executor.py:559] AssertionError: Sinks must be of type float32
246272
```
247273

0 commit comments

Comments
 (0)