-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
System Info
- TensorRT-LLM Version: 1.2.0rc2
- GPU: A100-80GB-PCIe
- Driver/CUDA Version: Driver Version: 580.82.09 CUDA Version: 13.0
- Model: Qwen2.5-7B-Instruct FP16
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I observed an unexpected behavior in Time-to-First-Token (TTFT) measurements when benchmarking TensorRT-LLM serving with different input lengths. Contrary to theoretical expectations, shorter input sequences resulted in higher TTFT compared to longer input sequences.
Server Command:
trtllm-serve ./engine/build_Qwen2.5-7B-Instruct_fp16_kvfp16_tp1_b1_in16383_out16383 \
--tokenizer=/home/qwen/Qwen2.5-7B-Instruct \
--backend=tensorrt \
--max_batch_size=1 \
--max_num_tokens=16383 \
--port=8099
Test Case 1: 128 input tokens, 16256 output tokens
python3 -m tensorrt_llm.serve.scripts.benchmark_serving \
--model=/home/qwen/Qwen2.5-7B-Instruct \
--backend=openai \
--dataset-name=random \
--random-prefix-len=26 \
--random-input-len=102 \
--random-output-len=16256 \
--percentile-metrics=ttft,tpot,itl,e2el \
--num-prompts=1 \
--ignore-eos \
--random-ids \
--tokenize-on-client \
--seed=16 \
--port=8099
Results:
Mean TTFT: 36.62 ms
Total input tokens: 128
Total output tokens: 16256
Complete Results is:
============ Serving Benchmark Result ============
Total requests: 1
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 178.72
Total input tokens: 128
Total generated tokens: 16256
Request throughput (req/s): 0.01
Output token throughput (tok/s): 90.96
Total Token throughput (tok/s): 91.68
User throughput (tok/s): 90.96
Avg Decoded Tokens per Iter: 1.00
---------------Time to First Token----------------
Mean TTFT (ms): 36.62
Median TTFT (ms): 36.62
P99 TTFT (ms): 36.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 10.99
Median TPOT (ms): 10.99
P99 TPOT (ms): 10.99
---------------Inter-token Latency----------------
Mean ITL (ms): 10.99
Median ITL (ms): 10.97
P99 ITL (ms): 11.69
----------------End-to-end Latency----------------
Mean E2EL (ms): 178712.62
Median E2EL (ms): 178712.62
P99 E2EL (ms): 178712.62
==================================================
Test Case 2: 1024 input tokens, 15360 output tokens
python3 -m tensorrt_llm.serve.scripts.benchmark_serving \
--model=/home/qwen/Qwen2.5-7B-Instruct \
--backend=openai \
--dataset-name=random \
--random-prefix-len=205 \
--random-input-len=819 \
--random-output-len=15360 \
--percentile-metrics=ttft,tpot,itl,e2el \
--num-prompts=1 \
--ignore-eos \
--random-ids \
--tokenize-on-client \
--seed=19 \
--port=8099
Results:
Mean TTFT: 27.86 ms
Total input tokens: 1024
Total output tokens: 15360
Complete Results is:
============ Serving Benchmark Result ============
Total requests: 1
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 169.32
Total input tokens: 1024
Total generated tokens: 15360
Request throughput (req/s): 0.01
Output token throughput (tok/s): 90.71
Total Token throughput (tok/s): 96.76
User throughput (tok/s): 90.72
Avg Decoded Tokens per Iter: 1.00
---------------Time to First Token----------------
Mean TTFT (ms): 27.86
Median TTFT (ms): 27.86
P99 TTFT (ms): 27.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 11.02
Median TPOT (ms): 11.02
P99 TPOT (ms): 11.02
---------------Inter-token Latency----------------
Mean ITL (ms): 11.02
Median ITL (ms): 10.98
P99 ITL (ms): 11.97
----------------End-to-end Latency----------------
Mean E2EL (ms): 169321.43
Median E2EL (ms): 169321.43
P99 E2EL (ms): 169321.43
==================================================
Expected behavior
Theoretically, shorter input sequences should result in lower TTFT since:
Less computation is required for processing input tokens
Fewer attention operations in the prefill phase
Smaller memory footprint for KV cache initialization
actual behavior
Shorter input (128 tokens) has higher TTFT (36.62 ms) than longer input (1024 tokens) with TTFT (27.86 ms) - approximately 31% higher.
additional notes
Both tests use the same engine configuration
TP1 (Tensor Parallelism=1) configuration
Maximum tokens setting (16383) is sufficient for both test cases
EOS token ignoring is enabled for consistent output length measurement
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.