-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Labels
Speculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafter<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafterbugSomething isn't workingSomething isn't workingwaiting for feedback
Description
System Info
Using nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1 on B200x8 machine
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Start TensorRT-LLM official docker container
docker run --privileged -itd --ipc=host \
-v /tmp:/tmp --gpus=all --network=host \
--name=${CONTAIER_NAME} \
--entrypoint="bash" \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
- Download weights from HF
HF_MODEL_PATH="openai/gpt-oss-120b"
hf download ${HF_MODEL_PATH} --local-dir=/tmp/${HF_MODEL_PATH}
HF_MODEL_PATH="nvidia/gpt-oss-120b-Eagle3"
hf download ${HF_MODEL_PATH} --local-dir=/tmp/${HF_MODEL_PATH}
- Create runtime config and start the server
cat > gpt-oss-120b-eagle3-trtllm-bs-256-xgrammar.yaml << 'EOF'
trust_remote_code: true
kv_cache_config:
enable_block_reuse: false
free_gpu_memory_fraction: 0.8
speculative_config:
decoding_type: Eagle
max_draft_len: 3
speculative_model_dir: /tmp/nvidia/gpt-oss-120b-Eagle3
cuda_graph_config:
max_batch_size: 256
moe_config:
backend: TRTLLM
guided_decoding_backend: xgrammar
EOF
config_file_name=gpt-oss-120b-eagle3-trtllm-bs-256-xgrammar.yaml
port=30000
tp=8
ep=4
TRTLLM_ENABLE_PDL=1
trtllm-serve /tmp/openai/gpt-oss-120b --host 0.0.0.0 --port ${port} --max_batch_size 256 --tp_size ${tp} --ep_size ${ep} --trust_remote_code --extra_llm_api_options ${config_file_name} --max_num_tokens 131072 --max_seq_len 131072
- Install gpt_oss and run evals below
python -m gpt_oss.evals --model openai/gpt-oss-120b --eval aime25 --base-url http://0.0.0.0:30000/v1/ --sampler=chat_completions --n-threads 64 --reasoning-effort high
python -m gpt_oss.evals --model openai/gpt-oss-120b --eval gpqa --base-url http://0.0.0.0:30000/v1/ --sampler=chat_completions --n-threads 64 --reasoning-effort high
Expected behavior
Based on Artificial Analysis, we are expecting near 93% for AIME and 78% for GPQA.
actual behavior
However, we are observing 80.47% for AIME and 72.60% for GPQA.
AIME 25
Writing report to /tmp/aime25_openai__gpt-oss-120b-high_temp1.0_20251125_201315.html
{'chars': 1798.6375, 'chars:std': 1430.1271089523534, 'score': 0.8041666666666667, 'score:std': 0.3968408231128558}
Writing results to /tmp/aime25_openai__gpt-oss-120b-high_temp1.0_20251125_201315.json
Writing all results to /tmp/aime25_openai__gpt-oss-120b-high_temp1.0_20251125_201315_allresults.json
[{'eval_name': 'aime25', 'model_name': 'openai__gpt-oss-120b-high_temp1.0_20251125_201315', 'metric': 0.8041666666666667}]
GPQA
Writing report to /tmp/gpqa_openai__gpt-oss-120b-high_temp1.0_20251125_221731.html
{'chars': 483.42171717171715, 'chars:std': 10694.679383758872, 'score': 0.726010101010101, 'score:std': 0.44600385002979953}
Writing results to /tmp/gpqa_openai__gpt-oss-120b-high_temp1.0_20251125_221731.json
Writing all results to /tmp/gpqa_openai__gpt-oss-120b-high_temp1.0_20251125_221731_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-120b-high_temp1.0_20251125_221731', 'metric': 0.726010101010101}]
additional notes
- What would be expected range of eval scores and quality for aime25, gpqa and if_bench for gpt-oss running on TensorRT-LLM with one model (eagle head) spec decoding ?
- If reaching the expected range of 93% for AIME and 78% for GPQA is possible, would you be able configurations to reproduce those scores with one spec decoding usage?
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
Speculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafter<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafterbugSomething isn't workingSomething isn't workingwaiting for feedback