[Bug]: Inconsistent Inference Results Between Eagle3 Speculative Decoding and Standard Inference

### System Info

- **TensorRT-LLM Version**: 1.1.0rc4 & 1.1.0rc5 & v1.2.0rc0
- **Model**: Qwen3-14B
- **Hardware**: CUDA-enabled GPU
- **Backend**: HuggingFace backend with TensorRT-LLM serve

### Who can help?

@pathorn @syuoni  @nvzhou 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

## Setup Details

### Service Configuration

**Port 8000 (Eagle3 Enabled):**
```bash
export CUDA_VISIBLE_DEVICES=1
trtllm-serve /data/algorithm/david/model/Qwen3-14B/ \
    --host 0.0.0.0 \
    --port 8000 \
    --trust_remote_code \
    --kv_cache_free_gpu_memory_fraction 0.9 \
    --log_level info \
    --extra_llm_api_options eagle3_config.yaml
```

**Port 8001 (Standard Inference):**
```bash
export CUDA_VISIBLE_DEVICES=0
trtllm-serve /data/algorithm/david/model/Qwen3-14B/ \
    --host 0.0.0.0 \
    --port 8001 \
    --trust_remote_code \
    --kv_cache_free_gpu_memory_fraction 0.9 \
    --log_level info \
    --extra_llm_api_options debug_config_compare.yaml
```

### Eagle3 Configuration (eagle3_config.yaml)
```yaml
speculative_config:
  decoding_type: Eagle
  max_draft_len: 8
  max_concurrency: 32
  speculative_model_dir: /data/algorithm/david/model/Qwen3-14B_eagle3/
  eagle3_one_model: true  

enable_attention_dp: false 
enable_chunked_prefill: True
sampler_type: TorchSampler
return_perf_metrics: true
print_iter_log: False
```

### Standard Configuration (debug_config_compare.yaml)
```yaml
enable_attention_dp: false 
enable_chunked_prefill: True
sampler_type: TorchSampler
return_perf_metrics: true
print_iter_log: False
```

## Issue Details

### Test Parameters
- **Model**: "Qwen3-14B"
- **Temperature**: 0 (deterministic)
- **Top_p**: 1.0
- **Seed**: 12345 (fixed)
- **Stream**: True
- **Prompt**: "tell me a story in 1000 words"

### Observed Behavior

1. **Initial Consistency**: Both services generate identical content for the first several paragraphs
2. **Divergence Point**: After approximately 1000-1500 characters, the outputs start to diverge
3. **Different Lengths**: Final outputs have significantly different lengths
   - Eagle3 (Port 8000): 4380 characters
   - Standard (Port 8001): 5537 characters

### Example Test Results

**Common Beginning (Both Outputs):**
```
Once upon a time, in a quiet village nestled between emerald hills and whispering pines, there lived a young girl named Elara. She was known for her curious mind and her love for stories. Every evening, she would sit by the fire with her grandmother, who would spin tales of dragons, lost cities, and stars that whispered secrets to those who listened closely.

One night, as the fire crackled and the stars shimmered above, Elara asked her grandmother, "What is the most magical place in the world?"

Her grandmother smiled, her eyes twinkling like the stars. "There is a place called the Vale of Echoes. It is hidden deep in the heart of the forest, where time moves differently. Some say it is a place where dreams come to life, and others say it is a trap for those who lose their way."
```

**Divergence Point:**
After the initial setup, the stories take completely different narrative paths:
- Eagle3 version focuses on self-reflection and inner journey
- Standard version continues with a fantasy adventure with character "Kael"

### Performance Differences
- **Eagle3 (Port 8000)**: 7.610s total, 0.162s TTFT
- **Standard (Port 8001)**: 11.645s total, 0.034s TTFT

## Expected Behavior

With identical parameters (temperature=0, fixed seed), both configurations should produce identical outputs regardless of the underlying acceleration method. Speculative decoding should only affect inference speed, not the actual generation results.


## Reproduction Steps

1. Start both services with the configurations provided above
2. Send identical requests with fixed temperature and seed
3. Compare the generated outputs
4. Observe divergence in content after initial paragraphs

## Test Script

A comparison script is available that demonstrates the issue by sending identical requests to both ports and highlighting the differences.

### Comparison Script (compare_port.py)

```python
import openai
import time
from typing import Dict, List

def test_port(port: int, test_name: str) -> Dict:
    """测试指定端口的API并返回结果"""
    client = openai.OpenAI(
        base_url=f"http://localhost:{port}/v1",
        api_key=""
    )
    
    # 记录开始时间
    start_time = time.time()
    print(f"\n{'='*60}")
    print(f"{test_name} - 端口 {port}")
    print(f"{'='*60}")
    print(f"开始时间: {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time))}")
    
    try:
        response = client.chat.completions.create(
            model="Qwen3-14B",
            messages=[
                {"role": "user", "content": "tell me a story in 1000 words"}
            ],
            extra_body={
                "chat_template_kwargs": {"enable_thinking": False}
            },
            temperature=0,
            top_p=1.0,
            stream=True,
            seed=12345
        )
        
        # 处理流式响应
        print("开始流式输出:")
        first_token_time = None
        usage_info = None
        content_tokens = []
        
        for chunk in response:
            # 检查是否有usage信息
            if hasattr(chunk, 'usage') and chunk.usage is not None:
                usage_info = chunk.usage
            
            if chunk.choices[0].delta.content is not None:
                if first_token_time is None:
                    first_token_time = time.time()
                content = chunk.choices[0].delta.content
                content_tokens.append(content)
                print(content, end='', flush=True)
        
        end_time = time.time()
        total_time = end_time - start_time
        ttft = first_token_time - start_time if first_token_time else 0
        
        full_content = ''.join(content_tokens)
        
        print(f"\n\n流式输出完成!")
        print(f"结束时间: {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(end_time))}")
        print(f"总耗时: {total_time:.3f}秒")
        print(f"首token延迟 (TTFT): {ttft:.3f}秒")
        
        if usage_info:
            print(f"\nToken使用情况:")
            print(f"输入token数量: {usage_info.prompt_tokens}")
            print(f"输出token数量: {usage_info.completion_tokens}")
            print(f"总token数量: {usage_info.total_tokens}")
        
        return {
            'port': port,
            'success': True,
            'content': full_content,
            'total_time': total_time,
            'ttft': ttft,
            'usage': usage_info,
            'error': None
        }
        
    except Exception as e:
        print(f"\n错误: {str(e)}")
        return {
            'port': port,
            'success': False,
            'content': None,
            'total_time': None,
            'ttft': None,
            'usage': None,
            'error': str(e)
        }

def compare_results(result1: Dict, result2: Dict):
    """比较两个端口的结果"""
    print(f"\n{'='*60}")
    print("比较结果")
    print(f"{'='*60}")
    
    # 检查是否都成功
    if not result1['success']:
        print(f"⚠️  端口 {result1['port']} 请求失败: {result1['error']}")
    if not result2['success']:
        print(f"⚠️  端口 {result2['port']} 请求失败: {result2['error']}")
    
    if not result1['success'] or not result2['success']:
        print("\n无法进行完整比较，因为至少有一个端口请求失败")
        return
    
    # 比较内容
    print(f"\n1. 输出内容比较:")
    if result1['content'] == result2['content']:
        print(f"   ✅ 内容完全一致")
        print(f"   内容长度: {len(result1['content'])} 字符")
    else:
        print(f"   ❌ 内容不一致")
        print(f"   端口 {result1['port']} 内容长度: {len(result1['content'])} 字符")
        print(f"   端口 {result2['port']} 内容长度: {len(result2['content'])} 字符")
        
        # 显示前500个字符的对比
        print(f"\n   端口 {result1['port']} 前500字符:")
        print(f"   {result1['content'][:500]}")
        print(f"\n   端口 {result2['port']} 前500字符:")
        print(f"   {result2['content'][:500]}")
    
    # 比较性能
    print(f"\n2. 性能比较:")
    print(f"   端口 {result1['port']}:")
    print(f"      总耗时: {result1['total_time']:.3f}秒")
    print(f"      首token延迟: {result1['ttft']:.3f}秒")
    print(f"   端口 {result2['port']}:")
    print(f"      总耗时: {result2['total_time']:.3f}秒")
    print(f"      首token延迟: {result2['ttft']:.3f}秒")
    
    time_diff = abs(result1['total_time'] - result2['total_time'])
    ttft_diff = abs(result1['ttft'] - result2['ttft'])
    print(f"\n   时间差异:")
    print(f"      总耗时差异: {time_diff:.3f}秒")
    print(f"      首token延迟差异: {ttft_diff:.3f}秒")
    
    # 比较Token使用
    print(f"\n3. Token使用比较:")
    if result1['usage'] and result2['usage']:
        print(f"   端口 {result1['port']}:")
        print(f"      输入: {result1['usage'].prompt_tokens}, 输出: {result1['usage'].completion_tokens}, 总计: {result1['usage'].total_tokens}")
        print(f"   端口 {result2['port']}:")
        print(f"      输入: {result2['usage'].prompt_tokens}, 输出: {result2['usage'].completion_tokens}, 总计: {result2['usage'].total_tokens}")
        
        if (result1['usage'].prompt_tokens == result2['usage'].prompt_tokens and
            result1['usage'].completion_tokens == result2['usage'].completion_tokens):
            print(f"   ✅ Token使用完全一致")
        else:
            print(f"   ❌ Token使用不一致")
    else:
        print(f"   ⚠️  至少有一个端口未返回usage信息")
    
    # 总结
    print(f"\n{'='*60}")
    print("总结:")
    is_identical = (result1['content'] == result2['content'] and
                    result1['usage'] and result2['usage'] and
                    result1['usage'].completion_tokens == result2['usage'].completion_tokens)
    
    if is_identical:
        print("✅ 两个端口的输出结果完全一致（内容和token数相同）")
    else:
        print("❌ 两个端口的输出结果存在差异")
    print(f"{'='*60}\n")

if __name__ == "__main__":
    print("开始比较端口 8000 和 8001 的输出结果...\n")
    
    # 测试8000端口 (Eagle3)
    result_8000 = test_port(8000, "Eagle3测试")
    
    # 等待一小段时间
    time.sleep(1)
    
    # 测试8001端口 (Standard)
    result_8001 = test_port(8001, "标准推理测试")
    
    # 比较结果
    compare_results(result_8000, result_8001)
```

### Usage
```bash
python compare_port.py
```

### Expected behavior

Eagle3 speculative decoding should produce identical results to standard inference when using the same sampling parameters and random seed, ensuring that the acceleration technique does not compromise output determinism.

### actual behavior

After the initial setup, the stories take completely different narrative paths:
- Eagle3 version focuses on self-reflection and inner journey
- Standard version continues with a fantasy adventure with character "Kael"

### additional notes

no

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Inconsistent Inference Results Between Eagle3 Speculative Decoding and Standard Inference #8285

System Info

Who can help?

Information

Tasks

Reproduction

Setup Details

Service Configuration

Eagle3 Configuration (eagle3_config.yaml)

Standard Configuration (debug_config_compare.yaml)

Issue Details

Test Parameters

Observed Behavior

Example Test Results

Performance Differences

Expected Behavior

Reproduction Steps

Test Script

Comparison Script (compare_port.py)

Usage

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Inconsistent Inference Results Between Eagle3 Speculative Decoding and Standard Inference #8285

Description

System Info

Who can help?

Information

Tasks

Reproduction

Setup Details

Service Configuration

Eagle3 Configuration (eagle3_config.yaml)

Standard Configuration (debug_config_compare.yaml)

Issue Details

Test Parameters

Observed Behavior

Example Test Results

Performance Differences

Expected Behavior

Reproduction Steps

Test Script

Comparison Script (compare_port.py)

Usage

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions