Skip to content

[Bug]: Inconsistent Inference Results Between Eagle3 Speculative Decoding and Standard Inference #8285

@0xd8b

Description

@0xd8b

System Info

  • TensorRT-LLM Version: 1.1.0rc4 & 1.1.0rc5 & v1.2.0rc0
  • Model: Qwen3-14B
  • Hardware: CUDA-enabled GPU
  • Backend: HuggingFace backend with TensorRT-LLM serve

Who can help?

@pathorn @syuoni @nvzhou

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Setup Details

Service Configuration

Port 8000 (Eagle3 Enabled):

export CUDA_VISIBLE_DEVICES=1
trtllm-serve /data/algorithm/david/model/Qwen3-14B/ \
    --host 0.0.0.0 \
    --port 8000 \
    --trust_remote_code \
    --kv_cache_free_gpu_memory_fraction 0.9 \
    --log_level info \
    --extra_llm_api_options eagle3_config.yaml

Port 8001 (Standard Inference):

export CUDA_VISIBLE_DEVICES=0
trtllm-serve /data/algorithm/david/model/Qwen3-14B/ \
    --host 0.0.0.0 \
    --port 8001 \
    --trust_remote_code \
    --kv_cache_free_gpu_memory_fraction 0.9 \
    --log_level info \
    --extra_llm_api_options debug_config_compare.yaml

Eagle3 Configuration (eagle3_config.yaml)

speculative_config:
  decoding_type: Eagle
  max_draft_len: 8
  max_concurrency: 32
  speculative_model_dir: /data/algorithm/david/model/Qwen3-14B_eagle3/
  eagle3_one_model: true  

enable_attention_dp: false 
enable_chunked_prefill: True
sampler_type: TorchSampler
return_perf_metrics: true
print_iter_log: False

Standard Configuration (debug_config_compare.yaml)

enable_attention_dp: false 
enable_chunked_prefill: True
sampler_type: TorchSampler
return_perf_metrics: true
print_iter_log: False

Issue Details

Test Parameters

  • Model: "Qwen3-14B"
  • Temperature: 0 (deterministic)
  • Top_p: 1.0
  • Seed: 12345 (fixed)
  • Stream: True
  • Prompt: "tell me a story in 1000 words"

Observed Behavior

  1. Initial Consistency: Both services generate identical content for the first several paragraphs
  2. Divergence Point: After approximately 1000-1500 characters, the outputs start to diverge
  3. Different Lengths: Final outputs have significantly different lengths
    • Eagle3 (Port 8000): 4380 characters
    • Standard (Port 8001): 5537 characters

Example Test Results

Common Beginning (Both Outputs):

Once upon a time, in a quiet village nestled between emerald hills and whispering pines, there lived a young girl named Elara. She was known for her curious mind and her love for stories. Every evening, she would sit by the fire with her grandmother, who would spin tales of dragons, lost cities, and stars that whispered secrets to those who listened closely.

One night, as the fire crackled and the stars shimmered above, Elara asked her grandmother, "What is the most magical place in the world?"

Her grandmother smiled, her eyes twinkling like the stars. "There is a place called the Vale of Echoes. It is hidden deep in the heart of the forest, where time moves differently. Some say it is a place where dreams come to life, and others say it is a trap for those who lose their way."

Divergence Point:
After the initial setup, the stories take completely different narrative paths:

  • Eagle3 version focuses on self-reflection and inner journey
  • Standard version continues with a fantasy adventure with character "Kael"

Performance Differences

  • Eagle3 (Port 8000): 7.610s total, 0.162s TTFT
  • Standard (Port 8001): 11.645s total, 0.034s TTFT

Expected Behavior

With identical parameters (temperature=0, fixed seed), both configurations should produce identical outputs regardless of the underlying acceleration method. Speculative decoding should only affect inference speed, not the actual generation results.

Reproduction Steps

  1. Start both services with the configurations provided above
  2. Send identical requests with fixed temperature and seed
  3. Compare the generated outputs
  4. Observe divergence in content after initial paragraphs

Test Script

A comparison script is available that demonstrates the issue by sending identical requests to both ports and highlighting the differences.

Comparison Script (compare_port.py)

import openai
import time
from typing import Dict, List

def test_port(port: int, test_name: str) -> Dict:
    """测试指定端口的API并返回结果"""
    client = openai.OpenAI(
        base_url=f"http://localhost:{port}/v1",
        api_key=""
    )
    
    # 记录开始时间
    start_time = time.time()
    print(f"\n{'='*60}")
    print(f"{test_name} - 端口 {port}")
    print(f"{'='*60}")
    print(f"开始时间: {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time))}")
    
    try:
        response = client.chat.completions.create(
            model="Qwen3-14B",
            messages=[
                {"role": "user", "content": "tell me a story in 1000 words"}
            ],
            extra_body={
                "chat_template_kwargs": {"enable_thinking": False}
            },
            temperature=0,
            top_p=1.0,
            stream=True,
            seed=12345
        )
        
        # 处理流式响应
        print("开始流式输出:")
        first_token_time = None
        usage_info = None
        content_tokens = []
        
        for chunk in response:
            # 检查是否有usage信息
            if hasattr(chunk, 'usage') and chunk.usage is not None:
                usage_info = chunk.usage
            
            if chunk.choices[0].delta.content is not None:
                if first_token_time is None:
                    first_token_time = time.time()
                content = chunk.choices[0].delta.content
                content_tokens.append(content)
                print(content, end='', flush=True)
        
        end_time = time.time()
        total_time = end_time - start_time
        ttft = first_token_time - start_time if first_token_time else 0
        
        full_content = ''.join(content_tokens)
        
        print(f"\n\n流式输出完成!")
        print(f"结束时间: {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(end_time))}")
        print(f"总耗时: {total_time:.3f}秒")
        print(f"首token延迟 (TTFT): {ttft:.3f}秒")
        
        if usage_info:
            print(f"\nToken使用情况:")
            print(f"输入token数量: {usage_info.prompt_tokens}")
            print(f"输出token数量: {usage_info.completion_tokens}")
            print(f"总token数量: {usage_info.total_tokens}")
        
        return {
            'port': port,
            'success': True,
            'content': full_content,
            'total_time': total_time,
            'ttft': ttft,
            'usage': usage_info,
            'error': None
        }
        
    except Exception as e:
        print(f"\n错误: {str(e)}")
        return {
            'port': port,
            'success': False,
            'content': None,
            'total_time': None,
            'ttft': None,
            'usage': None,
            'error': str(e)
        }

def compare_results(result1: Dict, result2: Dict):
    """比较两个端口的结果"""
    print(f"\n{'='*60}")
    print("比较结果")
    print(f"{'='*60}")
    
    # 检查是否都成功
    if not result1['success']:
        print(f"⚠️  端口 {result1['port']} 请求失败: {result1['error']}")
    if not result2['success']:
        print(f"⚠️  端口 {result2['port']} 请求失败: {result2['error']}")
    
    if not result1['success'] or not result2['success']:
        print("\n无法进行完整比较,因为至少有一个端口请求失败")
        return
    
    # 比较内容
    print(f"\n1. 输出内容比较:")
    if result1['content'] == result2['content']:
        print(f"   ✅ 内容完全一致")
        print(f"   内容长度: {len(result1['content'])} 字符")
    else:
        print(f"   ❌ 内容不一致")
        print(f"   端口 {result1['port']} 内容长度: {len(result1['content'])} 字符")
        print(f"   端口 {result2['port']} 内容长度: {len(result2['content'])} 字符")
        
        # 显示前500个字符的对比
        print(f"\n   端口 {result1['port']} 前500字符:")
        print(f"   {result1['content'][:500]}")
        print(f"\n   端口 {result2['port']} 前500字符:")
        print(f"   {result2['content'][:500]}")
    
    # 比较性能
    print(f"\n2. 性能比较:")
    print(f"   端口 {result1['port']}:")
    print(f"      总耗时: {result1['total_time']:.3f}秒")
    print(f"      首token延迟: {result1['ttft']:.3f}秒")
    print(f"   端口 {result2['port']}:")
    print(f"      总耗时: {result2['total_time']:.3f}秒")
    print(f"      首token延迟: {result2['ttft']:.3f}秒")
    
    time_diff = abs(result1['total_time'] - result2['total_time'])
    ttft_diff = abs(result1['ttft'] - result2['ttft'])
    print(f"\n   时间差异:")
    print(f"      总耗时差异: {time_diff:.3f}秒")
    print(f"      首token延迟差异: {ttft_diff:.3f}秒")
    
    # 比较Token使用
    print(f"\n3. Token使用比较:")
    if result1['usage'] and result2['usage']:
        print(f"   端口 {result1['port']}:")
        print(f"      输入: {result1['usage'].prompt_tokens}, 输出: {result1['usage'].completion_tokens}, 总计: {result1['usage'].total_tokens}")
        print(f"   端口 {result2['port']}:")
        print(f"      输入: {result2['usage'].prompt_tokens}, 输出: {result2['usage'].completion_tokens}, 总计: {result2['usage'].total_tokens}")
        
        if (result1['usage'].prompt_tokens == result2['usage'].prompt_tokens and
            result1['usage'].completion_tokens == result2['usage'].completion_tokens):
            print(f"   ✅ Token使用完全一致")
        else:
            print(f"   ❌ Token使用不一致")
    else:
        print(f"   ⚠️  至少有一个端口未返回usage信息")
    
    # 总结
    print(f"\n{'='*60}")
    print("总结:")
    is_identical = (result1['content'] == result2['content'] and
                    result1['usage'] and result2['usage'] and
                    result1['usage'].completion_tokens == result2['usage'].completion_tokens)
    
    if is_identical:
        print("✅ 两个端口的输出结果完全一致(内容和token数相同)")
    else:
        print("❌ 两个端口的输出结果存在差异")
    print(f"{'='*60}\n")

if __name__ == "__main__":
    print("开始比较端口 8000 和 8001 的输出结果...\n")
    
    # 测试8000端口 (Eagle3)
    result_8000 = test_port(8000, "Eagle3测试")
    
    # 等待一小段时间
    time.sleep(1)
    
    # 测试8001端口 (Standard)
    result_8001 = test_port(8001, "标准推理测试")
    
    # 比较结果
    compare_results(result_8000, result_8001)

Usage

python compare_port.py

Expected behavior

Eagle3 speculative decoding should produce identical results to standard inference when using the same sampling parameters and random seed, ensuring that the acceleration technique does not compromise output determinism.

actual behavior

After the initial setup, the stories take completely different narrative paths:

  • Eagle3 version focuses on self-reflection and inner journey
  • Standard version continues with a fantasy adventure with character "Kael"

additional notes

no

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

InvestigatingSpeculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafternot a bugSome known limitation, but not a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions