[Bug]: GPT-OSS-120B One Spec Path - Quality Eval Issue

### System Info

Using nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1 on B200x8 machine

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. Start TensorRT-LLM official docker container
```
docker run --privileged -itd --ipc=host \
-v /tmp:/tmp --gpus=all --network=host \
--name=${CONTAIER_NAME} \
--entrypoint="bash" \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
```

2. Download weights from HF
```
HF_MODEL_PATH="openai/gpt-oss-120b"
hf download ${HF_MODEL_PATH} --local-dir=/tmp/${HF_MODEL_PATH}
HF_MODEL_PATH="nvidia/gpt-oss-120b-Eagle3"
hf download ${HF_MODEL_PATH} --local-dir=/tmp/${HF_MODEL_PATH}
```
3. Create runtime config and start the server

```
cat > gpt-oss-120b-eagle3-trtllm-bs-256-xgrammar.yaml  << 'EOF'
trust_remote_code: true
kv_cache_config:
  enable_block_reuse: false
  free_gpu_memory_fraction: 0.8
speculative_config:
  decoding_type: Eagle
  max_draft_len: 3
  speculative_model_dir: /tmp/nvidia/gpt-oss-120b-Eagle3
cuda_graph_config:
  max_batch_size: 256
moe_config:
  backend: TRTLLM
guided_decoding_backend: xgrammar
EOF

config_file_name=gpt-oss-120b-eagle3-trtllm-bs-256-xgrammar.yaml 
port=30000
tp=8
ep=4

TRTLLM_ENABLE_PDL=1
trtllm-serve /tmp/openai/gpt-oss-120b --host 0.0.0.0 --port ${port} --max_batch_size 256  --tp_size ${tp} --ep_size ${ep} --trust_remote_code --extra_llm_api_options ${config_file_name} --max_num_tokens 131072 --max_seq_len 131072
```

4. Install [gpt_oss](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#installation) and run evals below
```
python -m gpt_oss.evals --model openai/gpt-oss-120b --eval aime25 --base-url http://0.0.0.0:30000/v1/ --sampler=chat_completions --n-threads 64 --reasoning-effort high 

python -m gpt_oss.evals --model openai/gpt-oss-120b --eval gpqa --base-url http://0.0.0.0:30000/v1/ --sampler=chat_completions --n-threads 64 --reasoning-effort high 
```

### Expected behavior

Based on [Artificial Analysis](https://artificialanalysis.ai/models/gpt-oss-120b), we are expecting near 93% for AIME  and 78% for GPQA.

### actual behavior

However, we are observing 80.47% for AIME and 72.60% for GPQA. 
```
AIME 25
Writing report to /tmp/aime25_openai__gpt-oss-120b-high_temp1.0_20251125_201315.html
{'chars': 1798.6375, 'chars:std': 1430.1271089523534, 'score': 0.8041666666666667, 'score:std': 0.3968408231128558}
Writing results to /tmp/aime25_openai__gpt-oss-120b-high_temp1.0_20251125_201315.json
Writing all results to /tmp/aime25_openai__gpt-oss-120b-high_temp1.0_20251125_201315_allresults.json
[{'eval_name': 'aime25', 'model_name': 'openai__gpt-oss-120b-high_temp1.0_20251125_201315', 'metric': 0.8041666666666667}]

GPQA
Writing report to /tmp/gpqa_openai__gpt-oss-120b-high_temp1.0_20251125_221731.html
{'chars': 483.42171717171715, 'chars:std': 10694.679383758872, 'score': 0.726010101010101, 'score:std': 0.44600385002979953}
Writing results to /tmp/gpqa_openai__gpt-oss-120b-high_temp1.0_20251125_221731.json
Writing all results to /tmp/gpqa_openai__gpt-oss-120b-high_temp1.0_20251125_221731_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-120b-high_temp1.0_20251125_221731', 'metric': 0.726010101010101}]
```

<img width="2132" height="1450" alt="Image" src="https://github.com/user-attachments/assets/1a54fa8d-49f1-47f2-9169-ceba1367d682" />




### additional notes

* What would be expected range of eval scores and quality for aime25, gpqa and if_bench for gpt-oss running on TensorRT-LLM with one model (eagle head) spec decoding ? 
* If reaching the expected range of 93% for AIME  and 78% for GPQA is possible, would you be able configurations to reproduce those scores with one spec decoding usage?

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: GPT-OSS-120B One Spec Path - Quality Eval Issue #9603

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: GPT-OSS-120B One Spec Path - Quality Eval Issue #9603

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions