You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: OpenAI/GPT-OSS.md
+53-27Lines changed: 53 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,15 +1,15 @@
1
1
## `gpt-oss` vLLM Usage Guide
2
2
3
-
`gpt-oss-20b` and `gpt-oss-120b` are powerful reasoning models open-sourced by OpenAI.
4
-
In vLLM, you can run it on NVIDIA H100, H200, B200 as well as MI300x, MI325x, MI355x and Radeon AI PRO R9700.
5
-
We are actively working on ensuring this model can work on Ampere, Ada Lovelace, and RTX 5090.
3
+
`gpt-oss-20b` and `gpt-oss-120b` are powerful reasoning models open-sourced by OpenAI.
4
+
In vLLM, you can run it on NVIDIA H100, H200, B200 as well as MI300x, MI325x, MI355x and Radeon AI PRO R9700.
5
+
We are actively working on ensuring this model can work on Ampere, Ada Lovelace, and RTX 5090.
6
6
Specifically, vLLM optimizes for `gpt-oss` family of models with
7
7
8
8
***Flexible parallelism options**: the model can be sharded across 2, 4, 8 GPUs, scaling throughput.
9
-
***High performance attention and MoE kernels**: attention kernel is specifically optimized for the attention sinks mechanism and sliding window shapes.
10
-
***Asynchronous scheduling**: optimizing for maximum utilization and high throughput by overlapping CPU operations with GPU operations.
9
+
***High performance attention and MoE kernels**: attention kernel is specifically optimized for the attention sinks mechanism and sliding window shapes.
10
+
***Asynchronous scheduling**: optimizing for maximum utilization and high throughput by overlapping CPU operations with GPU operations.
11
11
12
-
This is a living document and we welcome contributions, corrections, and creation of new recipes!
12
+
This is a living document and we welcome contributions, corrections, and creation of new recipes!
13
13
14
14
## Quickstart
15
15
@@ -41,7 +41,7 @@ GPT-OSS works on Ampere devices by default, using the `TRITON_ATTN` attention ba
41
41
42
42
```
43
43
# openai/gpt-oss-20b should run on a single A100
44
-
vllm serve openai/gpt-oss-20b --async-scheduling
44
+
vllm serve openai/gpt-oss-20b --async-scheduling
45
45
46
46
# gpt-oss-120b will fit on a single A100 (80GB), but scaling it to higher TP sizes can help with throughput
- When you encounter this error `The link interface of target "torch::nvtoolsext" contains: CUDA::nvToolsExt but the target was not found.` Please double check your pytorch version has suffix `+cu128`.
138
-
- If the output you see is garbage, that might be because you haven't properly set `CUDA_HOME`. The CUDA version needs to be greater than or equal to 12.8 and must be the same for installation and serving.
138
+
- If the output you see is garbage, that might be because you haven't properly set `CUDA_HOME`. The CUDA version needs to be greater than or equal to 12.8 and must be the same for installation and serving.
139
139
140
140
## Usage
141
141
142
142
Once the `vllm serve` runs and `INFO: Application startup complete` has been displayed, you can send requests using HTTP request or OpenAI SDK to the following endpoints:
143
143
144
144
*`/v1/responses` endpoint can perform tool use (browsing, python, mcp) in between chain-of-thought and deliver a final response. This endpoint leverages the `openai-harmony` library for input rendering and output parsing. Stateful operation and full streaming API are work in progress. Responses API is recommended by OpenAI as the way to interact with this model.
145
145
*`/v1/chat/completions` endpoint offers a familiar interface to this model. No tool will be invoked but reasoning and final text output will be returned structurally. Function calling is work in progress. You can also set the parameter `include_reasoning: false` in request parameter to skip CoT being part of the output.
146
-
*`/v1/completions` endpoint is the endpoint for a simple input output interface without any sorts of template rendering.
146
+
*`/v1/completions` endpoint is the endpoint for a simple input output interface without any sorts of template rendering.
147
147
148
-
All endpoints accept `stream: true` as part of the operations to enable incremental token streaming. Please note that vLLM currently does not cover the full scope of responses API, for more detail, please see Limitation section below.
148
+
All endpoints accept `stream: true` as part of the operations to enable incremental token streaming. Please note that vLLM currently does not cover the full scope of responses API, for more detail, please see Limitation section below.
149
149
150
150
### Tool Use
151
151
@@ -159,8 +159,8 @@ uv pip install gpt-oss
159
159
vllm serve ... --tool-server demo
160
160
```
161
161
162
-
* Please note that the default options are simply for demo purposes. For production usage, vLLM itself can act as MCP client to multiple services.
163
-
Here is an [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) that vLLM can work with, they wrap the demo tools:
162
+
* Please note that the default options are simply for demo purposes. For production usage, vLLM itself can act as MCP client to multiple services.
163
+
Here is an [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) that vLLM can work with, they wrap the demo tools:
164
164
165
165
```
166
166
mcp run -t sse browser_server.py:mcp
@@ -169,7 +169,33 @@ mcp run -t sse python_server.py:mcp
The URLs are expected to be MCP SSE servers that implement `instructions` in server info and well documented tools. The tools will be injected into the system prompt for the model to enable them.
172
+
The URLs are expected to be MCP SSE servers that implement `instructions` in server info and well documented tools. The tools will be injected into the system prompt for the model to enable them.
173
+
174
+
GPT OSS also expects a builtin tool called container. It doesn't have exposed tool type in openai types.
175
+
For reference the container tool is a stateful docker container that can be used to run command line tools.
176
+
The enabled tool namespace is `container` and the tool name used the most is `exec`.
177
+
MCP server need to implement the following functions to support container tool:
178
+
```
179
+
- for tool name: exec
180
+
- args:
181
+
{
182
+
"cmd":List[str] "command to execute",
183
+
"workdir":optional[str] "current working directory",
0 commit comments