Skip to content

[Bug]: GPT-OSS-120B One Spec Path - Quality Eval Issue #9603

@harrisonlimh

Description

@harrisonlimh

System Info

Using nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1 on B200x8 machine

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Start TensorRT-LLM official docker container
docker run --privileged -itd --ipc=host \
-v /tmp:/tmp --gpus=all --network=host \
--name=${CONTAIER_NAME} \
--entrypoint="bash" \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
  1. Download weights from HF
HF_MODEL_PATH="openai/gpt-oss-120b"
hf download ${HF_MODEL_PATH} --local-dir=/tmp/${HF_MODEL_PATH}
HF_MODEL_PATH="nvidia/gpt-oss-120b-Eagle3"
hf download ${HF_MODEL_PATH} --local-dir=/tmp/${HF_MODEL_PATH}
  1. Create runtime config and start the server
cat > gpt-oss-120b-eagle3-trtllm-bs-256-xgrammar.yaml  << 'EOF'
trust_remote_code: true
kv_cache_config:
  enable_block_reuse: false
  free_gpu_memory_fraction: 0.8
speculative_config:
  decoding_type: Eagle
  max_draft_len: 3
  speculative_model_dir: /tmp/nvidia/gpt-oss-120b-Eagle3
cuda_graph_config:
  max_batch_size: 256
moe_config:
  backend: TRTLLM
guided_decoding_backend: xgrammar
EOF

config_file_name=gpt-oss-120b-eagle3-trtllm-bs-256-xgrammar.yaml 
port=30000
tp=8
ep=4

TRTLLM_ENABLE_PDL=1
trtllm-serve /tmp/openai/gpt-oss-120b --host 0.0.0.0 --port ${port} --max_batch_size 256  --tp_size ${tp} --ep_size ${ep} --trust_remote_code --extra_llm_api_options ${config_file_name} --max_num_tokens 131072 --max_seq_len 131072
  1. Install gpt_oss and run evals below
python -m gpt_oss.evals --model openai/gpt-oss-120b --eval aime25 --base-url http://0.0.0.0:30000/v1/ --sampler=chat_completions --n-threads 64 --reasoning-effort high 

python -m gpt_oss.evals --model openai/gpt-oss-120b --eval gpqa --base-url http://0.0.0.0:30000/v1/ --sampler=chat_completions --n-threads 64 --reasoning-effort high 

Expected behavior

Based on Artificial Analysis, we are expecting near 93% for AIME and 78% for GPQA.

actual behavior

However, we are observing 80.47% for AIME and 72.60% for GPQA.

AIME 25
Writing report to /tmp/aime25_openai__gpt-oss-120b-high_temp1.0_20251125_201315.html
{'chars': 1798.6375, 'chars:std': 1430.1271089523534, 'score': 0.8041666666666667, 'score:std': 0.3968408231128558}
Writing results to /tmp/aime25_openai__gpt-oss-120b-high_temp1.0_20251125_201315.json
Writing all results to /tmp/aime25_openai__gpt-oss-120b-high_temp1.0_20251125_201315_allresults.json
[{'eval_name': 'aime25', 'model_name': 'openai__gpt-oss-120b-high_temp1.0_20251125_201315', 'metric': 0.8041666666666667}]

GPQA
Writing report to /tmp/gpqa_openai__gpt-oss-120b-high_temp1.0_20251125_221731.html
{'chars': 483.42171717171715, 'chars:std': 10694.679383758872, 'score': 0.726010101010101, 'score:std': 0.44600385002979953}
Writing results to /tmp/gpqa_openai__gpt-oss-120b-high_temp1.0_20251125_221731.json
Writing all results to /tmp/gpqa_openai__gpt-oss-120b-high_temp1.0_20251125_221731_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-120b-high_temp1.0_20251125_221731', 'metric': 0.726010101010101}]
Image

additional notes

  • What would be expected range of eval scores and quality for aime25, gpqa and if_bench for gpt-oss running on TensorRT-LLM with one model (eagle head) spec decoding ?
  • If reaching the expected range of 93% for AIME and 78% for GPQA is possible, would you be able configurations to reproduce those scores with one spec decoding usage?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Speculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafterbugSomething isn't workingwaiting for feedback

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions