| sidebar-title | User-Centric Timing for KV Cache Benchmarking |
|---|
Use user-centric timing when you need to:
- Control per-user turn gaps precisely — Each user waits at least
num_users / QPSseconds between their turns, enabling controlled cache TTL testing - Simulate steady-state from the start — Virtual history creates an immediate mix of new and continuing users (no cold-start transient)
- Per-user timing independence — Each user maintains their own schedule, not affected by other users' response times
- Measure prefix caching benefits — Quantify TTFT improvements when a shared system prompt is cached across all users
Imagine a customer support chatbot serving 15 concurrent users. Each user:
- Sends a question
- Reads the response (takes ~15 seconds)
- Sends a follow-up question
- Repeats for ~20 turns until their issue is resolved
User-centric timing recreates this pattern with controlled, consistent timing. You can test whether your KV cache retains entries for exactly 15 seconds, 30 seconds, or any specific gap—something request-rate mode doesn't guarantee because continuation turns are issued at the next available rate interval rather than after a fixed per-user delay.
| Mode | Turn Timing | Startup Behavior | Best For |
|---|---|---|---|
| User-centric rate | Fixed per-user gap (num_users/QPS) |
Steady-state via virtual history | KV cache TTL testing, controlled multi-turn timing |
| Request rate | Next turn at next rate interval (variable per-user gap) | Cold start (all sessions start fresh) | Throughput testing, arrival pattern simulation |
| Concurrency | Immediate (maintain N in-flight) | Cold start | Max throughput discovery, stress testing |
aiperf profile \
--model your-model \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--user-centric-rate 1.0 \
--num-users 15 \
--session-turns-mean 20 \
--shared-system-prompt-length 1000 \
--user-context-prompt-length 20000 \
--synthetic-input-tokens-mean 26 \
--osl 100 \
--num-dataset-entries 1000 \
--benchmark-duration 100This configures 15 simulated users with sessions averaging 20 turns:
- Turn gap: 15 users / 1.0 req/s = 15 seconds between each user's turns
- System throughput: ~1.0 requests/second across all users
- Shared system prompt: 1000 tokens shared across ALL users (KV cache prefix)
- User context: 20000 tokens unique per user (synthetic padding to simulate context length)
- Per-turn input: 26 tokens (the new question each turn)
| Parameter | Description |
|---|---|
--user-centric-rate |
Target requests per second (QPS) across all users (enables user-centric mode) |
--num-users |
Number of concurrent simulated users |
--session-turns-mean |
Mean number of conversation turns per user (must be >= 2) |
In request-rate mode, after a turn completes, the next turn is queued and issued at the next rate interval. This means per-user turn gaps vary depending on when the previous turn finished relative to the rate clock—making it hard to test specific cache TTL thresholds.
User-centric timing solves this with fixed per-user turn gaps:
| Feature | Why It Matters for Cache Measurement |
|---|---|
| Fixed turn gap per user | Each user's turns are spaced at least num_users / QPS seconds apart (exactly this interval when responses complete before the scheduled time). A 15-second gap tests whether your cache retains entries for 15+ seconds. |
| Per-user independent scheduling | User A's timing isn't affected by User B's slow response. Each user maintains their own schedule. |
| Deterministic scheduling | Same benchmark configuration = same request timing = reproducible results across runs. |
| Steady-state from t=0 | Virtual history simulates an already-running system, so metrics aren't skewed by cold-start transients from all users starting at Turn 0 simultaneously. |
The gap between each user's requests is:
turn_gap = num_users / user_centric_rate
| Users | Request Rate | Turn Gap |
|---|---|---|
| 15 | 1.0 req/s | 15.0s |
| 15 | 0.5 req/s | 30.0s |
| 15 | 4.0 req/s | 3.75s |
| 15 | 8.0 req/s | 1.875s |
User-centric mode uses "virtual history" to simulate steady-state behavior immediately. Instead of all users starting at turn 0 simultaneously, users are assigned virtual "ages" at startup—creating an immediate mix of new users and continuations that simulates joining an already-running system.
Evaluate: Benchmark Execution Timeline (t=0 to t=30s)
---------------------------------------------------------------------
TIME (s) >>> 0 1 2 3 4 5 6 7 8 9 10 11 12 ...
EVENT:
t=0: User 1 (virtually done) LEAVES instantly.
t=0: User 16 ENTERS instantly to replace User 1.
ACTUAL TURNS REMAINING (Visualized):
User 16 (New): ████████████████████████████████████████ (20 turns)
User 5 : ████████████ (6 turns)
User 9 : ██████████████████████ (11 turns)
User 13 : ████████████████████████████████ (16 turns)
User 2 : ████ (2 turns - finishes quickly)
User 6 : ██████████████ (7 turns)
User 10 : ████████████████████████ (12 turns)
User 14 : ██████████████████████████████████ (17 turns)
... (remaining users follow staggered pattern) ...
RESULT:
Immediate mix of fresh sessions (User 16) and deep sessions (User 14),
with users finishing and churning naturally from t=6s onwards.
When a response takes longer than the turn gap, the scheduler:
- Sends the next turn immediately when the response arrives
- Resets the timing baseline to "now" for subsequent turns
- Maintains the turn gap minimum going forward
This avoids burst load from catching up to the original schedule.
For effective KV cache benchmarking, configure prompts to create realistic prefix sharing patterns:
┌─────────────────────────────────────────────────────────────┐
│ Shared System Prompt (1000 tokens) │ ← Same across ALL users
│ "You are a helpful assistant..." │ (KV cache shared prefix)
├─────────────────────────────────────────────────────────────┤
│ User Context Prompt (20000 tokens) │ ← Unique per user
│ [synthetic text representing prior conversation context] │ (unique prefix per user)
├─────────────────────────────────────────────────────────────┤
│ Per-Turn Input (26 tokens) │ ← New content each turn
│ "What is the weather today?" │ (the actual question)
└─────────────────────────────────────────────────────────────┘
Note: In multi-turn conversations, previous turns (inputs + responses) also accumulate in the request, growing the total prompt size with each turn. The user context prompt is synthetic padding separate from this accumulated history—both contribute to the total context length.
| Option | Description | Typical Value |
|---|---|---|
--shared-system-prompt-length |
System prompt shared across ALL users (enables prefix sharing) | 1000 |
--user-context-prompt-length |
Per-user unique prefix (synthetic text representing conversation history) | 20000 |
--synthetic-input-tokens-mean |
Per-turn input tokens (the question) | 26 |
--osl |
Output sequence length (answer tokens) | 100 |
--num-dataset-entries |
Required when using --user-context-prompt-length |
≥1000 (recommended) |
Important: User-centric mode does NOT automatically limit concurrency. While the timing model spaces out requests, slow server responses can cause request buildup.
To prevent overwhelming the server, you can cap concurrency with --concurrency. If you set this, use a value at least equal to --num-users to avoid constraining user sessions.
# Cap concurrency to num_users
aiperf profile \
--user-centric-rate 1.0 \
--num-users 15 \
--concurrency 15 \
--model your-model \
--url localhost:8000aiperf profile \
--model Qwen/Qwen3-0.6B \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--user-centric-rate 1.0 \
--num-users 15 \
--session-turns-mean 20 \
--shared-system-prompt-length 1000 \
--user-context-prompt-length 20000 \
--synthetic-input-tokens-mean 26 \
--osl 100 \
--num-dataset-entries 1000 \
--benchmark-duration 100 \
--random-seed 42Sample Output (Successful Run):
INFO Starting AIPerf System
INFO User-centric mode: 15 users, 1.0 req/s (15.0s turn gap per user)
INFO Shared system prompt: 1000 tokens
INFO User context: 20000 tokens per user
INFO AIPerf System is PROFILING
Profiling: [01:40] - Running for 100 seconds...
INFO Benchmark completed successfully
INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-rate1.0/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 3456.78 │ 2890.34 │ 4123.45 │ 3998.67 │ 3423.12 │
│ Time to First Token (ms) │ 1234.56 │ 987.89 │ 1567.90 │ 1498.23 │ 1212.34 │
│ Inter Token Latency (ms) │ 21.45 │ 17.89 │ 28.34 │ 27.12 │ 21.01 │
│ Output Token Count (tokens) │ 100.00 │ 90.00 │ 110.00 │ 109.00 │ 99.00 │
│ Request Throughput (req/s) │ 0.98 │ - │ - │ - │ - │
└─────────────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-rate1.0/profile_export_aiperf.json
- 15-second gaps between each user's turns (15 / 1.0 = 15s)
- 1000-token shared system prompt (prefix shared across ALL users)
- 20000-token user context (unique per user)
Test with higher QPS (shorter per-user gaps):
aiperf profile \
--model Qwen/Qwen3-0.6B \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--user-centric-rate 4.0 \
--num-users 15 \
--session-turns-mean 20 \
--shared-system-prompt-length 1000 \
--user-context-prompt-length 20000 \
--synthetic-input-tokens-mean 26 \
--osl 100 \
--num-dataset-entries 1000 \
--benchmark-duration 100Sample Output (Successful Run):
INFO Starting AIPerf System
INFO User-centric mode: 15 users, 4.0 req/s (3.75s turn gap per user)
INFO Shared system prompt: 1000 tokens
INFO User context: 20000 tokens per user
INFO AIPerf System is PROFILING
Profiling: [01:40] - Running for 100 seconds...
INFO Benchmark completed successfully
INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-rate4.0/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 3234.56 │ 2678.90 │ 3890.12 │ 3798.45 │ 3198.67 │
│ Time to First Token (ms) │ 1145.67 │ 912.34 │ 1456.89 │ 1389.23 │ 1123.45 │
│ Inter Token Latency (ms) │ 20.34 │ 16.78 │ 26.90 │ 25.67 │ 20.01 │
│ Output Token Count (tokens) │ 100.00 │ 90.00 │ 110.00 │ 109.00 │ 99.00 │
│ Request Throughput (req/s) │ 3.89 │ - │ - │ - │ - │
└─────────────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-rate4.0/profile_export_aiperf.json
Gap = 15 / 4.0 = 3.75 seconds between each user's requests.
Test cache TTL limits with 30-second per-user gaps:
aiperf profile \
--model Qwen/Qwen3-0.6B \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--user-centric-rate 0.5 \
--num-users 15 \
--session-turns-mean 20 \
--shared-system-prompt-length 1000 \
--user-context-prompt-length 20000 \
--synthetic-input-tokens-mean 26 \
--osl 100 \
--num-dataset-entries 1000 \
--benchmark-duration 300Sample Output (Successful Run):
INFO Starting AIPerf System
INFO User-centric mode: 15 users, 0.5 req/s (30.0s turn gap per user)
INFO Shared system prompt: 1000 tokens
INFO User context: 20000 tokens per user
INFO AIPerf System is PROFILING
Profiling: [05:00] - Running for 300 seconds...
INFO Benchmark completed successfully
INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-rate0.5/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 3567.89 │ 2956.78 │ 4234.56 │ 4098.23 │ 3512.34 │
│ Time to First Token (ms) │ 1289.45 │ 1023.67 │ 1598.90 │ 1534.12 │ 1267.89 │
│ Inter Token Latency (ms) │ 21.89 │ 18.23 │ 29.12 │ 28.01 │ 21.56 │
│ Output Token Count (tokens) │ 100.00 │ 90.00 │ 110.00 │ 109.00 │ 99.00 │
│ Request Throughput (req/s) │ 0.49 │ - │ - │ - │ - │
└─────────────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-rate0.5/profile_export_aiperf.json
Gap = 15 / 0.5 = 30 seconds between each user's requests.
| Metric | What It Tells You |
|---|---|
| TTFT (Time to First Token) | Lower TTFT on subsequent turns indicates cache hits |
| TTFT by Turn Index | Compare Turn 0 vs Turn 1+ to measure cache benefit |
| Throughput | Higher throughput with caching enabled indicates cache effectiveness |
With effective caching:
- Turn 0 (first turn): Higher TTFT (cache miss, full prefill)
- Turn 1+: Lower TTFT (cache hit, reduced prefill)
Without caching or cache misses:
- Similar TTFT across all turns
- Higher variance in TTFT
- Verify
--user-centric-rateis set (not--request-rate) - Confirm
--num-usersis specified - Check if response latencies exceed the turn gap (triggers schedule reset)
Possible causes:
- Cache TTL shorter than your gap interval
- Cache not enabled on the server
- No shared system prompt configured
Solutions:
- Reduce gap by increasing
--user-centric-rateor decreasing--num-users - Verify server cache configuration
- Use
--shared-system-prompt-lengthto enable prefix sharing
- Use
--random-seedfor reproducible dataset sampling - Increase
--benchmark-durationfor more samples - Ensure server is warmed up before benchmarking
| Option | Reason |
|---|---|
--request-rate |
Use --user-centric-rate instead |
--arrival-pattern |
User-centric mode uses deterministic scheduling |
- Multi-Turn Tutorial — General multi-turn conversation benchmarking
