Skip to content

[Bug]: Qwen3 + Eagle3 generates different result with greedy decoding compared to "no eagle version" #10309

@BigRabbit71

Description

@BigRabbit71

System Info

  • CPU: x86_64
  • GPU: NVIDIA H100
  • OS: Ubuntu 22.04
  • docker: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6
  • tensorrt_llm Version: 1.2.0rc6
  • tensorrt Version: 10.13.3.9
  • torch Version: 2.9.0a0+145a3a7bda.nv25.10
  • nvidia-smi: NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0
Image
  • nvcc --version
Image

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. start the docker by:
    sudo docker run -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all --name=trtllm nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 /bin/bash

  2. download the models from hugging-face
    huggingface-cli download Qwen/Qwen3-4B --local-dir Qwen/Qwen3-4B
    huggingface-cli download zhuyksir/EAGLE3-Qwen3-4B-DenseHead --local-dir zhuyksir/EAGLE3-Qwen3-4B-DenseHead

  3. luanch the Qwen baseline and get the result
    luanch by: trtllm-serve Qwen/Qwen3-4B/
    test by:

curl -X POST "http://localhost:8000/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -d '{
           "model": " Qwen3-4B",
           "messages": [{"role": "user", "content": "Tell me a long story"}],
           "max_tokens": 500,
		   "temperature": 0
         }'
  1. luanch the Qwen+Eagle3 and get the result
    luanch by: trtllm-serve Qwen/Qwen3-4B/ --extra_llm_api_options zhuyksir/EAGLE3-Qwen3-4B-DenseHead/extra-llm-api-config.yml
    the config content in extra-llm-api-config.yml:
speculative_config:
    decoding_type: Eagle
    max_draft_len: 4
    speculative_model_dir:  zhuyksir/EAGLE3-Qwen3-4B-DenseHead/
    eagle3_one_model: true

test by:

curl -X POST "http://localhost:8000/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -d '{
           "model": " Qwen3-4B",
           "messages": [{"role": "user", "content": "Tell me a long story"}],
           "max_tokens": 500,
		   "temperature": 0
         }'
  1. compare the generated results

Expected behavior

Eagle3 should NOT change the output when we do greedy decoding.

actual behavior

Eagle shows DIFFERENT outputs when we do greedy decoding.

for example the outputs start to become different at the bold sentences:

  • Qwen baseline without Eagle output:
    {"id":"chatcmpl-xxxxx","object":"chat.completion","created":1766737434,"model":"Qwen3-4B","choices":[{"index":0,"message":{"role":"assistant","content":"\nOkay, the user wants a long story. Let me think about what kind of story would be engaging. Maybe something with a unique setting and characters. I should start with a strong opening to hook the reader. Maybe a fantasy or sci-fi element? Or perhaps a more realistic story with a twist.\n\nHmm, a fantasy story could be interesting. Let's set it in a world with some magical elements. Maybe a kingdom with a unique feature. Oh, what about a place where time is a currency? That's an original concept. Time as a resource... that could lead to interesting conflicts.\n\nSo, the main character could be someone who discovers this secret. Maybe a young person, like a girl named Lira. She's curious and adventurous. She finds a hidden village where time is traded. The villagers use time to grow food, create art, and even age. But there's a catch—time is finite, and the village is in danger of running out.\n\nConflict: The village is facing a crisis because their time supply is depleting. They need to find a way to replenish it. Lira has to go on a quest to find the source of time. Maybe there's a magical entity or a hidden place where time is stored. Along the way, she meets allies and faces challenges.\n\nThemes could include the value of time, the consequences of greed, and the importance of community. Maybe the villagers are hoarding time, leading to their downfall. Lira has to teach them to value time differently, perhaps by sharing it or finding a sustainable source.\n\nI need to build the world around this. The village's society, how they use time, the magic system. Maybe the time is drawn from a sacred place, like a crystal or a tree. The climax could involve a ritual or a battle to restore the balance. The resolution would be about harmony and understanding.\n\nI should make sure the story has a clear beginning, middle, and end. Develop the characters, show their growth, and include some emotional moments. Maybe Lira learns something about herself and the importance of time in her life. Also, include some descriptive details to make the world vivid.\n\nCheck for plot holes. How does time work in this world? What are the rules? Make sure the magic system is consistent. Also, the conflict needs to be resolved in a satisfying way. Maybe the villagers realize they need to use time wisely, not hoard it. The ending could be hopeful","reasoning_content":"","reasoning":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":13,"total_tokens":513,"completion_tokens":500,"prompt_tokens_details":{"cached_tokens":0}},"prompt_token_ids":null}

  • Qwen + Eagle3 output:
    {"id":"chatcmpl-xxxxx","object":"chat.completion","created":1766737371,"model":"Qwen3-4B","choices":[{"index":0,"message":{"role":"assistant","content":"\nOkay, the user wants a long story. Let me think about what kind of story would be engaging. Maybe something with a unique setting and characters. I should start with a strong opening to hook the reader. Maybe a fantasy or sci-fi element? Or perhaps a more realistic story with a twist.\n\nHmm, a fantasy story could be interesting. Let's set it in a world with some magical elements. Maybe a kingdom with a unique feature. Oh, what about a place where time is a currency? That's an original concept. Time as a resource... that could lead to interesting conflicts.\n\nSo, the main character could be someone who discovers this secret. Maybe a young person, like a girl named Lira. She's curious and adventurous. She finds a hidden village where time is traded. The villagers use time to grow food, create art, and even age. But there's a catch—time is finite, and the village is in danger of running out.\n\nConflict: The village is facing a crisis because their time supply is depleting. They need to find a way to replenish it. Lira has to go on a quest to find the source of time. Maybe there's a magical entity or a hidden place where time is stored. Along the way, she meets allies and faces challenges.\n\nThemes could include the value of time, the consequences of greed, and the importance of community. Maybe the villagers are hoarding time, leading to their downfall. Lira has to teach them to value time differently, perhaps by sharing it or finding a sustainable source.\n\nI need to build the world around this. The village's society, how they use time, the magic system. Maybe the time is drawn from a sacred place, like a crystal or a tree. The climax could involve a ritual or a battle to restore the balance. The resolution would be about harmony and understanding.\n\nI should make sure the story has a clear beginning, middle, and end. Develop the characters, show their growth, and include some emotional moments. Maybe Lira learns to appreciate time more, and the village changes for the better. Add some suspense and adventure elements to keep it engaging.\n\nLet me outline the story structure. Start with Lira discovering the village, then the problem they face, her journey to find the solution, the challenges she encounters, and the resolution. Include some magical elements and maybe a mentor figure or a rival.\n\nI need to check for consistency in the world-building","reasoning_content":"","reasoning":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.5873016119003296}],"usage":{"prompt_tokens":13,"total_tokens":513,"completion_tokens":500,"prompt_tokens_details":{"cached_tokens":12}},"prompt_token_ids":null}

additional notes

Not sure if it is related the eagle specific parallel verification kernels.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Decoding<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).Speculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafterbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions