-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
System Info
- TensorRT-LLM Version: 1.1.0rc4 & 1.1.0rc5 & v1.2.0rc0
- Model: Qwen3-14B
- Hardware: CUDA-enabled GPU
- Backend: HuggingFace backend with TensorRT-LLM serve
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Setup Details
Service Configuration
Port 8000 (Eagle3 Enabled):
export CUDA_VISIBLE_DEVICES=1
trtllm-serve /data/algorithm/david/model/Qwen3-14B/ \
--host 0.0.0.0 \
--port 8000 \
--trust_remote_code \
--kv_cache_free_gpu_memory_fraction 0.9 \
--log_level info \
--extra_llm_api_options eagle3_config.yamlPort 8001 (Standard Inference):
export CUDA_VISIBLE_DEVICES=0
trtllm-serve /data/algorithm/david/model/Qwen3-14B/ \
--host 0.0.0.0 \
--port 8001 \
--trust_remote_code \
--kv_cache_free_gpu_memory_fraction 0.9 \
--log_level info \
--extra_llm_api_options debug_config_compare.yamlEagle3 Configuration (eagle3_config.yaml)
speculative_config:
decoding_type: Eagle
max_draft_len: 8
max_concurrency: 32
speculative_model_dir: /data/algorithm/david/model/Qwen3-14B_eagle3/
eagle3_one_model: true
enable_attention_dp: false
enable_chunked_prefill: True
sampler_type: TorchSampler
return_perf_metrics: true
print_iter_log: FalseStandard Configuration (debug_config_compare.yaml)
enable_attention_dp: false
enable_chunked_prefill: True
sampler_type: TorchSampler
return_perf_metrics: true
print_iter_log: FalseIssue Details
Test Parameters
- Model: "Qwen3-14B"
- Temperature: 0 (deterministic)
- Top_p: 1.0
- Seed: 12345 (fixed)
- Stream: True
- Prompt: "tell me a story in 1000 words"
Observed Behavior
- Initial Consistency: Both services generate identical content for the first several paragraphs
- Divergence Point: After approximately 1000-1500 characters, the outputs start to diverge
- Different Lengths: Final outputs have significantly different lengths
- Eagle3 (Port 8000): 4380 characters
- Standard (Port 8001): 5537 characters
Example Test Results
Common Beginning (Both Outputs):
Once upon a time, in a quiet village nestled between emerald hills and whispering pines, there lived a young girl named Elara. She was known for her curious mind and her love for stories. Every evening, she would sit by the fire with her grandmother, who would spin tales of dragons, lost cities, and stars that whispered secrets to those who listened closely.
One night, as the fire crackled and the stars shimmered above, Elara asked her grandmother, "What is the most magical place in the world?"
Her grandmother smiled, her eyes twinkling like the stars. "There is a place called the Vale of Echoes. It is hidden deep in the heart of the forest, where time moves differently. Some say it is a place where dreams come to life, and others say it is a trap for those who lose their way."
Divergence Point:
After the initial setup, the stories take completely different narrative paths:
- Eagle3 version focuses on self-reflection and inner journey
- Standard version continues with a fantasy adventure with character "Kael"
Performance Differences
- Eagle3 (Port 8000): 7.610s total, 0.162s TTFT
- Standard (Port 8001): 11.645s total, 0.034s TTFT
Expected Behavior
With identical parameters (temperature=0, fixed seed), both configurations should produce identical outputs regardless of the underlying acceleration method. Speculative decoding should only affect inference speed, not the actual generation results.
Reproduction Steps
- Start both services with the configurations provided above
- Send identical requests with fixed temperature and seed
- Compare the generated outputs
- Observe divergence in content after initial paragraphs
Test Script
A comparison script is available that demonstrates the issue by sending identical requests to both ports and highlighting the differences.
Comparison Script (compare_port.py)
import openai
import time
from typing import Dict, List
def test_port(port: int, test_name: str) -> Dict:
"""测试指定端口的API并返回结果"""
client = openai.OpenAI(
base_url=f"http://localhost:{port}/v1",
api_key=""
)
# 记录开始时间
start_time = time.time()
print(f"\n{'='*60}")
print(f"{test_name} - 端口 {port}")
print(f"{'='*60}")
print(f"开始时间: {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time))}")
try:
response = client.chat.completions.create(
model="Qwen3-14B",
messages=[
{"role": "user", "content": "tell me a story in 1000 words"}
],
extra_body={
"chat_template_kwargs": {"enable_thinking": False}
},
temperature=0,
top_p=1.0,
stream=True,
seed=12345
)
# 处理流式响应
print("开始流式输出:")
first_token_time = None
usage_info = None
content_tokens = []
for chunk in response:
# 检查是否有usage信息
if hasattr(chunk, 'usage') and chunk.usage is not None:
usage_info = chunk.usage
if chunk.choices[0].delta.content is not None:
if first_token_time is None:
first_token_time = time.time()
content = chunk.choices[0].delta.content
content_tokens.append(content)
print(content, end='', flush=True)
end_time = time.time()
total_time = end_time - start_time
ttft = first_token_time - start_time if first_token_time else 0
full_content = ''.join(content_tokens)
print(f"\n\n流式输出完成!")
print(f"结束时间: {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(end_time))}")
print(f"总耗时: {total_time:.3f}秒")
print(f"首token延迟 (TTFT): {ttft:.3f}秒")
if usage_info:
print(f"\nToken使用情况:")
print(f"输入token数量: {usage_info.prompt_tokens}")
print(f"输出token数量: {usage_info.completion_tokens}")
print(f"总token数量: {usage_info.total_tokens}")
return {
'port': port,
'success': True,
'content': full_content,
'total_time': total_time,
'ttft': ttft,
'usage': usage_info,
'error': None
}
except Exception as e:
print(f"\n错误: {str(e)}")
return {
'port': port,
'success': False,
'content': None,
'total_time': None,
'ttft': None,
'usage': None,
'error': str(e)
}
def compare_results(result1: Dict, result2: Dict):
"""比较两个端口的结果"""
print(f"\n{'='*60}")
print("比较结果")
print(f"{'='*60}")
# 检查是否都成功
if not result1['success']:
print(f"⚠️ 端口 {result1['port']} 请求失败: {result1['error']}")
if not result2['success']:
print(f"⚠️ 端口 {result2['port']} 请求失败: {result2['error']}")
if not result1['success'] or not result2['success']:
print("\n无法进行完整比较,因为至少有一个端口请求失败")
return
# 比较内容
print(f"\n1. 输出内容比较:")
if result1['content'] == result2['content']:
print(f" ✅ 内容完全一致")
print(f" 内容长度: {len(result1['content'])} 字符")
else:
print(f" ❌ 内容不一致")
print(f" 端口 {result1['port']} 内容长度: {len(result1['content'])} 字符")
print(f" 端口 {result2['port']} 内容长度: {len(result2['content'])} 字符")
# 显示前500个字符的对比
print(f"\n 端口 {result1['port']} 前500字符:")
print(f" {result1['content'][:500]}")
print(f"\n 端口 {result2['port']} 前500字符:")
print(f" {result2['content'][:500]}")
# 比较性能
print(f"\n2. 性能比较:")
print(f" 端口 {result1['port']}:")
print(f" 总耗时: {result1['total_time']:.3f}秒")
print(f" 首token延迟: {result1['ttft']:.3f}秒")
print(f" 端口 {result2['port']}:")
print(f" 总耗时: {result2['total_time']:.3f}秒")
print(f" 首token延迟: {result2['ttft']:.3f}秒")
time_diff = abs(result1['total_time'] - result2['total_time'])
ttft_diff = abs(result1['ttft'] - result2['ttft'])
print(f"\n 时间差异:")
print(f" 总耗时差异: {time_diff:.3f}秒")
print(f" 首token延迟差异: {ttft_diff:.3f}秒")
# 比较Token使用
print(f"\n3. Token使用比较:")
if result1['usage'] and result2['usage']:
print(f" 端口 {result1['port']}:")
print(f" 输入: {result1['usage'].prompt_tokens}, 输出: {result1['usage'].completion_tokens}, 总计: {result1['usage'].total_tokens}")
print(f" 端口 {result2['port']}:")
print(f" 输入: {result2['usage'].prompt_tokens}, 输出: {result2['usage'].completion_tokens}, 总计: {result2['usage'].total_tokens}")
if (result1['usage'].prompt_tokens == result2['usage'].prompt_tokens and
result1['usage'].completion_tokens == result2['usage'].completion_tokens):
print(f" ✅ Token使用完全一致")
else:
print(f" ❌ Token使用不一致")
else:
print(f" ⚠️ 至少有一个端口未返回usage信息")
# 总结
print(f"\n{'='*60}")
print("总结:")
is_identical = (result1['content'] == result2['content'] and
result1['usage'] and result2['usage'] and
result1['usage'].completion_tokens == result2['usage'].completion_tokens)
if is_identical:
print("✅ 两个端口的输出结果完全一致(内容和token数相同)")
else:
print("❌ 两个端口的输出结果存在差异")
print(f"{'='*60}\n")
if __name__ == "__main__":
print("开始比较端口 8000 和 8001 的输出结果...\n")
# 测试8000端口 (Eagle3)
result_8000 = test_port(8000, "Eagle3测试")
# 等待一小段时间
time.sleep(1)
# 测试8001端口 (Standard)
result_8001 = test_port(8001, "标准推理测试")
# 比较结果
compare_results(result_8000, result_8001)Usage
python compare_port.pyExpected behavior
Eagle3 speculative decoding should produce identical results to standard inference when using the same sampling parameters and random seed, ensuring that the acceleration technique does not compromise output determinism.
actual behavior
After the initial setup, the stories take completely different narrative paths:
- Eagle3 version focuses on self-reflection and inner journey
- Standard version continues with a fantasy adventure with character "Kael"
additional notes
no
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.