| sidebar-title | Arrival Patterns: Simulating Realistic Traffic |
|---|
When benchmarking with --request-rate, AIPerf can vary how requests arrive over time. The --arrival-pattern option controls the distribution of inter-arrival times, letting you simulate everything from perfectly regular traffic to bursty real-world patterns.
Real traffic doesn't arrive at perfectly regular intervals. Traffic comes in bursts—quiet periods followed by sudden spikes. How your server handles this variance affects real-world performance.
Constant Pattern: Poisson Pattern: Gamma (bursty):
| | | | | | | | | || | | | ||| | ||| |
└──────────────────▶ └──────────────────▶ └──────────────────▶
Perfect spacing Natural variance Clustered bursts
(unrealistic) (typical traffic) (stress testing)
# Default: Poisson (realistic)
aiperf profile --request-rate 50 ...
# Explicit: Constant (deterministic)
aiperf profile --request-rate 50 --arrival-pattern constant ...
# Bursty: Gamma with low smoothness
aiperf profile --request-rate 50 --arrival-pattern gamma --arrival-smoothness 0.5 ...--arrival-pattern constantRequests arrive at perfectly regular intervals: exactly 1/rate seconds apart.
Inter-arrival times:
10 QPS → every 100ms: |····|····|····|····|····|····|
0 100 200 300 400 500 600 ms
Use cases:
- Baseline measurements with no variance
- Debugging timing issues
- Comparing against variable patterns
- Deterministic, reproducible tests
--arrival-pattern poissonRequests arrive according to a Poisson process—the mathematical model for random events at a constant average rate. Inter-arrival times follow an exponential distribution.
Inter-arrival times (exponential):
10 QPS average: |··|······|·|···|····|··|·······|···|
Varied gaps, same average rate over time
Characteristics:
- Mean inter-arrival =
1/rate(same as constant) - Variance =
(1/rate)²(natural randomness) - Sometimes requests cluster, sometimes gaps appear
- Models real user behavior where arrivals are independent
Use cases:
- Default realistic traffic simulation
- Standard load testing
- Comparing to theoretical queueing models
--arrival-pattern gamma --arrival-smoothness <value>Gamma distribution generalizes Poisson with a smoothness parameter that controls how bursty or regular arrivals are:
| Smoothness | Behavior | Variance | Use Case |
|---|---|---|---|
< 1.0 |
Bursty — clustered arrivals with gaps | Higher | Stress testing, worst-case scenarios |
= 1.0 |
Poisson — natural randomness | Medium | Same as --arrival-pattern poisson |
> 1.0 |
Smooth — more regular arrivals | Lower | Controlled testing, less noise |
Smoothness = 0.5 (bursty):
|||| ||| ||||| ||
Clusters of requests with quiet gaps
Smoothness = 1.0 (Poisson):
| || | | | || | | || |
Natural variance
Smoothness = 2.0 (smooth):
| | | | | | | | | | | | | |
More regular, approaches constant
Mathematical note: The smoothness parameter is the Gamma distribution's shape parameter (k). Scale is automatically computed to maintain the correct mean rate.
# No --request-rate, just --concurrency
aiperf profile --concurrency 50 ...When you omit --request-rate and only specify --concurrency, AIPerf uses burst mode: zero delay between request dispatches, limited only by the concurrency semaphore.
Burst mode (concurrency=3):
[Req1]────────────────────────────▶
[Req2]────────────────────────────▶
[Req3]────────────────────────────▶
[Req4]──────────────────────▶ ← Starts when any slot frees
Use cases:
- Maximum throughput discovery
- Saturation testing
- Finding server capacity limits
AIPerf's --arrival-smoothness is compatible with vLLM's --burstiness parameter:
# Same distribution as vLLM with --burstiness 0.5
aiperf profile \
--request-rate 50 \
--arrival-pattern gamma \
--arrival-smoothness 0.5 \
...This allows direct comparison between AIPerf and vLLM benchmark results when using the same smoothness/burstiness value.
Compare how your server handles ideal vs realistic traffic:
# Run 1: Constant (baseline)
aiperf profile \
--model your-model \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--request-rate 100 \
--arrival-pattern constant \
--benchmark-duration 60 \
--output-dir results/constant
**Expected Output (Run 1):**INFO Starting AIPerf System INFO Using Request_Rate strategy with constant arrival pattern INFO AIPerf System is PROFILING
Profiling: [01:00] - Running for 60 seconds...
INFO Benchmark completed successfully INFO Results saved to: results/constant/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ ┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ │ Request Latency (ms) │ 178.45 │ 156.23 │ 212.34 │ 205.67 │ 176.89 │ │ Time to First Token (ms) │ 45.67 │ 38.12 │ 58.34 │ 56.23 │ 44.90 │ │ Inter Token Latency (ms) │ 11.23 │ 9.45 │ 14.67 │ 14.12 │ 11.01 │ │ Request Throughput (req/s) │ 98.45 │ - │ - │ - │ - │ └────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: results/constant/profile_export_aiperf.json
# Run 2: Poisson (realistic)
aiperf profile \
--model your-model \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--request-rate 100 \
--arrival-pattern poisson \
--benchmark-duration 60 \
--output-dir results/poisson
Expected Output (Run 2):
INFO Starting AIPerf System
INFO Using Request_Rate strategy with poisson arrival pattern
INFO AIPerf System is PROFILING
Profiling: [01:00] - Running for 60 seconds...
INFO Benchmark completed successfully
INFO Results saved to: results/poisson/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 182.34 │ 148.56 │ 267.89 │ 245.67 │ 179.12 │
│ Time to First Token (ms) │ 47.89 │ 35.67 │ 78.23 │ 72.45 │ 46.34 │
│ Inter Token Latency (ms) │ 11.67 │ 8.90 │ 19.34 │ 17.89 │ 11.23 │
│ Request Throughput (req/s) │ 96.78 │ - │ - │ - │ - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: results/poisson/profile_export_aiperf.json
Compare TTFT and throughput between runs. Higher variance under Poisson indicates sensitivity to traffic patterns.
Test how your server handles request bursts:
aiperf profile \
--model your-model \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--request-rate 100 \
--arrival-pattern gamma \
--arrival-smoothness 0.3 \
--benchmark-duration 120Sample Output (Successful Run):
INFO Starting AIPerf System
INFO Using Request_Rate strategy with gamma arrival pattern (smoothness: 0.3)
INFO AIPerf System is PROFILING
Profiling: [02:00] - Running for 120 seconds...
INFO Benchmark completed successfully
INFO Results saved to: artifacts/your-model-chat-rate100/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 198.67 │ 142.34 │ 398.12 │ 356.78 │ 189.45 │
│ Time to First Token (ms) │ 52.34 │ 34.56 │ 112.34 │ 98.67 │ 49.23 │
│ Inter Token Latency (ms) │ 12.89 │ 8.23 │ 28.45 │ 24.67 │ 12.01 │
│ Request Throughput (req/s) │ 93.45 │ - │ - │ - │ - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: artifacts/your-model-chat-rate100/profile_export_aiperf.json
Smoothness of 0.3 creates highly bursty traffic—several requests arrive nearly simultaneously, then quiet periods.
Reduce variance in measurements for controlled experiments:
aiperf profile \
--model your-model \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--request-rate 50 \
--arrival-pattern gamma \
--arrival-smoothness 5.0 \
--benchmark-duration 60Sample Output (Successful Run):
INFO Starting AIPerf System
INFO Using Request_Rate strategy with gamma arrival pattern (smoothness: 5.0)
INFO AIPerf System is PROFILING
Profiling: [01:00] - Running for 60 seconds...
INFO Benchmark completed successfully
INFO Results saved to: artifacts/your-model-chat-rate50/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 165.23 │ 148.90 │ 189.45 │ 184.56 │ 164.12 │
│ Time to First Token (ms) │ 42.67 │ 36.89 │ 52.34 │ 50.12 │ 42.01 │
│ Inter Token Latency (ms) │ 10.89 │ 9.23 │ 13.45 │ 13.01 │ 10.67 │
│ Request Throughput (req/s) │ 49.23 │ - │ - │ - │ - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: artifacts/your-model-chat-rate50/profile_export_aiperf.json
Smoothness of 5.0 produces very regular arrivals, reducing measurement noise while still having some natural variance.
Run multiple benchmarks with increasing burstiness to find where performance degrades:
for smoothness in 2.0 1.0 0.7 0.5 0.3; do
aiperf profile \
--model your-model \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--request-rate 100 \
--arrival-pattern gamma \
--arrival-smoothness $smoothness \
--benchmark-duration 60 \
--output-dir results/smoothness_$smoothness
doneUse constant arrivals during warmup, then realistic patterns for profiling:
aiperf profile \
--model your-model \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--request-rate 100 \
--arrival-pattern gamma \
--arrival-smoothness 0.8 \
--warmup-arrival-pattern constant \
--warmup-duration 30 \
--benchmark-duration 120| Option | Type | Default | Description |
|---|---|---|---|
--arrival-pattern |
str | poisson |
Pattern for request arrivals: constant, poisson, gamma |
--arrival-smoothness |
float | None | Gamma smoothness: <1 = bursty, 1 = Poisson, >1 = smooth. Defaults to 1.0 when using gamma pattern. |
--warmup-arrival-pattern |
str | Inherits | Override pattern for warmup phase |
Constraints:
--arrival-patternrequires--request-rateto be set--arrival-smoothnessonly applies when--arrival-pattern gamma- Cannot use with
--user-centric-rate(deterministic per-user scheduling) - Cannot use with
--fixed-schedule(timestamp-based scheduling)
| Goal | Pattern | Smoothness |
|---|---|---|
| Reproducible baseline | constant |
N/A |
| Realistic traffic simulation | poisson |
N/A |
| Match vLLM benchmark | gamma |
Same as vLLM --burstiness |
| Stress test burst handling | gamma |
0.3 - 0.7 |
| Reduce measurement noise | gamma |
2.0 - 5.0 |
| Maximum throughput | N/A (burst mode) | N/A |
For those who want to understand the statistical properties:
| Pattern | Distribution | Mean | Variance | CV (Coeff. of Variation) |
|---|---|---|---|---|
| Constant | Degenerate | 1/λ |
0 |
0 |
| Poisson | Exponential | 1/λ |
1/λ² |
1 |
| Gamma(k) | Gamma | 1/λ |
1/(k·λ²) |
1/√k |
Where λ = request rate and k = smoothness.
- CV (Coefficient of Variation) = standard deviation / mean
- Lower CV = more regular arrivals
- Gamma with k=1 equals Poisson (CV=1)
- As k→∞, Gamma approaches Constant (CV→0)
- Request Rate with Concurrency — Combining rate and concurrency
- Warmup Phase — Configuring warmup with different patterns
- Timing Modes Reference — Complete CLI compatibility matrix