Skip to content

Commit 84008ad

Browse files
authored
[gpt-oss] Fix command for running eval (#24)
Signed-off-by: Chen Zhang <[email protected]>
1 parent 8e2ab81 commit 84008ad

File tree

1 file changed

+12
-9
lines changed

1 file changed

+12
-9
lines changed

OpenAI/GPT-OSS.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -173,15 +173,9 @@ The URLs are expected to be MCP SSE servers that implement `instructions` in ser
173173

174174
## Accuracy Evaluation Panels
175175

176-
OpenAI recommends using the gpt-oss reference library to perform evaluation. For example,
176+
OpenAI recommends using the gpt-oss reference library to perform evaluation.
177177

178-
```
179-
python -m gpt_oss.evals --model 120b-low --eval gpqa --n-threads 128
180-
python -m gpt_oss.evals --model 120b --eval gpqa --n-threads 128
181-
python -m gpt_oss.evals --model 120b-high --eval gpqa --n-threads 128
182-
```
183-
To eval on AIME2025, change `gpqa` to `aime25`.
184-
With vLLM deployed:
178+
First, deploy the model with vLLM:
185179

186180
```
187181
# Example deployment on 8xH100
@@ -194,9 +188,18 @@ vllm serve openai/gpt-oss-120b \
194188
--no-enable-prefix-caching
195189
```
196190

191+
Then, run the evaluation with gpt-oss. The following command will run all the 3 reasoning effort levels.
192+
193+
```
194+
mkdir -p /tmp/gpqa_openai
195+
OPENAI_API_KEY=empty python -m gpt_oss.evals --model openai/gpt-oss-120b --eval gpqa --n-threads 128
196+
```
197+
198+
To eval on AIME2025, change `gpqa` to `aime25`.
199+
197200
Here is the score we were able to reproduce without tool use, and we encourage you to try reproducing it as well!
198201
We’ve observed that the numbers may vary slightly across runs, so feel free to run the evaluation multiple times to get a sense of the variance.
199-
For a quick correctness check, we recommend starting with the low reasoning effort setting (120b-low), which should complete within minutes.
202+
For a quick correctness check, we recommend starting with the low reasoning effort setting (`--reasoning-effort low`), which should complete within minutes.
200203

201204
Model: 120B
202205

0 commit comments

Comments
 (0)