Skip to content

[Eagle3] Accuracy Issue #5791

@xq25478

Description

@xq25478

System Info

  • CPU Architecture: x86_64
  • ​GPU Properties:
    • Name: H100
    • Memory: 80GB
  • TensorRT-LLM Branc: v1.0.0rc1
  • Versions: cuda12.8

Who can help?

@kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  • My Script
export TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
base_model=/mnt/modelops/models/Qwen3-30B-A3B/Qwen__Qwen3-30B-A3B
eagle_model=/mnt/modelops/models/qwen3_30b_moe_eagle3_8b1ecfe164a4efb3

max_batch_size=32
max_num_tokens=32768
tp_size=8

cat <<EOF > extra_llm_api_options.yaml
kv_cache_config:
  enable_block_reuse: False
  free_gpu_memory_fraction: 0.85
use_cuda_graph: True
cuda_graph_max_batch_size: 1
cuda_graph_padding_enabled: True
attn_backend: TRTLLM
enable_iter_perf_stats: False
enable_iter_req_stats: False
print_iter_log: False
enable_chunked_prefill: False
disable_overlap_scheduler: False
dtype: auto
# EAGLE-3
speculative_config:
    decoding_type: Eagle
    max_draft_len: 4
    pytorch_weights_path: ${eagle_model}

#
EOF

trtllm-serve \
    ${base_model} \
    --host 127.0.0.1 \
    --port 9122 \
    --tp_size ${tp_size} \
    --max_batch_size ${max_batch_size} \
    --max_num_tokens ${max_num_tokens} \
    --kv_cache_free_gpu_memory_fraction 0.85 \
    --log_level info \
    --trust_remote_code \
    --backend pytorch \
    --extra_llm_api_options extra_llm_api_options.yaml
  • curl command
curl http:/127.0.0.1:9122/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "test",
        "messages":[{"role": "user", "content": "Python bubble sort code."}],
        "max_tokens": 128,
        "stream":false
    }'
  • bad case
{"id":"chatcmpl-18967b0c68294089bd6188da8b8eac77","object":"chat.completion","created":1751878126,"model":"test","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n[…]\nOkay,颧, I控制 the user is askingDTV for Python bubble[method for bubble \nOkay, dri, Iyre, I need \n\n\nOkay,licted, the user ///</span> \n\nOkay, vk, the user站 is asking|\n\nOkay,一个职业, the \n\nOkay,ANTA, the站 is|\n\nOkay \nOkay, Tomas, the cott, theiris, the \n\nOkay,[method for bubble奶粉, \n\nOkay ///</span> \n\nOkay站 is|\n\nOkay \nOkay,iris, \n\nOkay,OAD, \n\nOkay站 is|\n\nOkay \n\nOkay,[method for \n\nOkay站 is","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":13,"total_tokens":137,"completion_tokens":124},"prompt_token_ids":null}

Expected behavior

right output:
{"id":"chatcmpl-eb60fe9f6ea44a079a0affc9b63044a8","object":"chat.completion","created":1751879316,"model":"test","choices":[{"index":0,"message":{"role":"assistant","content":"\nOkay, I need to write a Python code for bubble sort. Let me think about how bubble sort works. From what I remember, bubble sort is a simple sorting algorithm that repeatedly steps through the list, compares adjacent elements, and swaps them if they are in the wrong order. The pass through the list is repeated until the list is sorted.\n\nSo the basic steps are: iterate through the list, compare each pair of adjacent elements, and swap them if they are not in the correct order. Each pass moves the largest unsorted element to its correct position at the end of the list. That's why it's called bubble sort","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":13,"total_tokens":141,"completion_tokens":128},"prompt_token_ids":null}

actual behavior

{"id":"chatcmpl-18967b0c68294089bd6188da8b8eac77","object":"chat.completion","created":1751878126,"model":"test","choices":[{"index":0,"message":{"role":"assistant","content":"\n[…]\nOkay,颧, I控制 the user is askingDTV for Python bubble[method for bubble \nOkay, dri, Iyre, I need \n\n\nOkay,licted, the user /// \n\nOkay, vk, the user站 is asking|\n\nOkay,一个职业, the \n\nOkay,ANTA, the站 is|\n\nOkay \nOkay, Tomas, the cott, theiris, the \n\nOkay,[method for bubble奶粉, \n\nOkay /// \n\nOkay站 is|\n\nOkay \nOkay,iris, \n\nOkay,OAD, \n\nOkay站 is|\n\nOkay \n\nOkay,[method for \n\nOkay站 is","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":13,"total_tokens":137,"completion_tokens":124},"prompt_token_ids":null}

additional notes

none

Metadata

Metadata

Labels

InvestigatingSpeculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafterTesting<NV>Continuous integration, build system, and testing infrastructure issuesbugSomething isn't workingtriagedIssue has been triaged by maintainerswaiting for feedback

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions