-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
In many occasions, the throughput value detected and displayed by Benchmark is not the same value seen in the vLLM trace.
┌─────────────────┬────────────────────────────────────────────────────────────────┐
│ Parameter │ Value │
├─────────────────┼────────────────────────────────────────────────────────────────┤
│ Max VUs │ 128 │
│ Duration │ 60 │
│ Warmup Duration │ 30 │
│ Benchmark Kind │ Sweep │
│ Rates │ N/A │
│ Num Rates │ 1 │
│ Prompt Options │ N/A │
│ Decode Options │ num_tokens=Some(800),min_tokens=50,max_tokens=800,variance=100 │
│ Tokenizer │ deepseek-ai/DeepSeek-R1-Distill-Llama-8B │
│ Extra Metadata │ N/A │
└─────────────────┴────────────────────────────────────────────────────────────────┘
Run 1:
│ Benchmark │ QPS │ E2E Latency (avg) │ TTFT (avg) │ ITL (avg) │ Throughput │
├────────────────────┼────────────┼───────────────────┼────────────┼───────────┼────────────────────┼
│ warmup │ 0.07 req/s │ 13.95 sec │ 3268.06 ms │ 15.22 ms │ 50.42 tokens/sec │
│ throughput │ 3.84 req/s │ 24.96 sec │ 307.17 ms │ 36.85 ms │ 2560.47 tokens/sec |
Run 2:
Benchmark │ QPS │ E2E Latency (avg) │ TTFT (avg) │ ITL (avg) │ Throughput │
├─────────────────┼────────────┼───────────────────┼──────────--──┼───────────┼────────────────────┼
warmup │ 0.08 req/s │ 13.30 sec │ 1554.01 ms │ 14.70 ms │ 60.15 tokens/sec │
throughput │ 2.41 req/s │ 38.43 sec │ 665.35 ms │ 56.19 ms │ 1596.76 tokens/sec │
Trace from VLLM similar in both runs:
INFO 06-13 10:23:08 [loggers.py:111] Engine 000: Avg prompt throughput: 548.9 tokens/s, Avg generation throughput: 2475.0 tokens/s, Running: 128 reqs,
The vLLM trace just above is related to Run 2. I would have expected to see a value of 2475 tokens/sec or so in the Run 2 instead of 1596 tokens/sec.
Am I misunderstanding how that works?
Metadata
Metadata
Assignees
Labels
No labels