Skip to content

[Bug]: 【trtllm-serve】Unexpected TTFT Behavior: Longer Input Shows Lower Time-to-First-Token #9470

@kurosakiharachan

Description

@kurosakiharachan

System Info

  • TensorRT-LLM Version: 1.2.0rc2
  • GPU: A100-80GB-PCIe
  • Driver/CUDA Version: Driver Version: 580.82.09 CUDA Version: 13.0
  • Model: Qwen2.5-7B-Instruct FP16

Who can help?

@AdamzNV

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I observed an unexpected behavior in Time-to-First-Token (TTFT) measurements when benchmarking TensorRT-LLM serving with different input lengths. Contrary to theoretical expectations, shorter input sequences resulted in higher TTFT compared to longer input sequences.

Server Command:

trtllm-serve ./engine/build_Qwen2.5-7B-Instruct_fp16_kvfp16_tp1_b1_in16383_out16383 \
--tokenizer=/home/qwen/Qwen2.5-7B-Instruct \
--backend=tensorrt \
--max_batch_size=1 \
--max_num_tokens=16383 \
--port=8099

Test Case 1: 128 input tokens, 16256 output tokens

python3 -m tensorrt_llm.serve.scripts.benchmark_serving \
    --model=/home/qwen/Qwen2.5-7B-Instruct \
    --backend=openai \
    --dataset-name=random \
    --random-prefix-len=26 \
    --random-input-len=102 \
    --random-output-len=16256 \
    --percentile-metrics=ttft,tpot,itl,e2el \
    --num-prompts=1 \
    --ignore-eos \
    --random-ids \
    --tokenize-on-client \
    --seed=16 \
    --port=8099

Results:
Mean TTFT: 36.62 ms
Total input tokens: 128
Total output tokens: 16256

Complete Results is:

============ Serving Benchmark Result ============
Total requests:                          1         
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  178.72    
Total input tokens:                      128       
Total generated tokens:                  16256     
Request throughput (req/s):              0.01      
Output token throughput (tok/s):         90.96     
Total Token throughput (tok/s):          91.68     
User throughput (tok/s):                 90.96     
Avg Decoded Tokens per Iter:             1.00      
---------------Time to First Token----------------
Mean TTFT (ms):                          36.62     
Median TTFT (ms):                        36.62     
P99 TTFT (ms):                           36.62     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.99     
Median TPOT (ms):                        10.99     
P99 TPOT (ms):                           10.99     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.99     
Median ITL (ms):                         10.97     
P99 ITL (ms):                            11.69     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          178712.62 
Median E2EL (ms):                        178712.62 
P99 E2EL (ms):                           178712.62 
==================================================

Test Case 2: 1024 input tokens, 15360 output tokens

python3 -m tensorrt_llm.serve.scripts.benchmark_serving \
    --model=/home/qwen/Qwen2.5-7B-Instruct \
    --backend=openai \
    --dataset-name=random \
    --random-prefix-len=205 \
    --random-input-len=819 \
    --random-output-len=15360 \
    --percentile-metrics=ttft,tpot,itl,e2el \
    --num-prompts=1 \
    --ignore-eos \
    --random-ids \
    --tokenize-on-client \
    --seed=19 \
    --port=8099

Results:
Mean TTFT: 27.86 ms
Total input tokens: 1024
Total output tokens: 15360

Complete Results is:

============ Serving Benchmark Result ============
Total requests:                          1         
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  169.32    
Total input tokens:                      1024      
Total generated tokens:                  15360     
Request throughput (req/s):              0.01      
Output token throughput (tok/s):         90.71     
Total Token throughput (tok/s):          96.76     
User throughput (tok/s):                 90.72     
Avg Decoded Tokens per Iter:             1.00      
---------------Time to First Token----------------
Mean TTFT (ms):                          27.86     
Median TTFT (ms):                        27.86     
P99 TTFT (ms):                           27.86     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.02     
Median TPOT (ms):                        11.02     
P99 TPOT (ms):                           11.02     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.02     
Median ITL (ms):                         10.98     
P99 ITL (ms):                            11.97     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          169321.43 
Median E2EL (ms):                        169321.43 
P99 E2EL (ms):                           169321.43 
==================================================

Expected behavior

Theoretically, shorter input sequences should result in lower TTFT since:
Less computation is required for processing input tokens
Fewer attention operations in the prefill phase
Smaller memory footprint for KV cache initialization

actual behavior

Shorter input (128 tokens) has higher TTFT (36.62 ms) than longer input (1024 tokens) with TTFT (27.86 ms) - approximately 31% higher.

additional notes

Both tests use the same engine configuration
TP1 (Tensor Parallelism=1) configuration
Maximum tokens setting (16383) is sufficient for both test cases
EOS token ignoring is enabled for consistent output length measurement

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions