| sidebar-title | Multi-Run Confidence Reporting |
|---|
Multi-run confidence reporting allows you to run the same benchmark configuration multiple times to quantify measurement variance, assess repeatability, and compute confidence intervals for key metrics. This helps answer the critical question: "Is this performance difference real or just noise?"
When you run a single benchmark, the results can vary due to:
- System jitter (GPU clocks, background tasks)
- Network variance
- Server internal scheduling and batching dynamics
- Periodic stalls or transient errors
By running multiple trials of the same benchmark, you can:
- Quantify variance: Understand how much results vary between runs
- Assess repeatability: Determine if your measurements are stable
- Compute confidence intervals: Get honest uncertainty estimates
- Make informed decisions: Know if performance differences are statistically meaningful
Multi-run mode automatically uses the simple UI by default for the best experience. The dashboard UI is not supported due to terminal control limitations.
When using --num-profile-runs > 1, AIPerf automatically sets --ui simple unless you explicitly specify a different UI:
# These are equivalent - simple UI is auto-selected
aiperf profile --num-profile-runs 5 ...
aiperf profile --num-profile-runs 5 --ui simple ...You'll see an informational message:
Multi-run mode: UI automatically set to 'simple' (use '--ui none' to disable UI output)
Simple UI (Default)
aiperf profile \
--num-profile-runs 5 \
...Shows progress bars for each run - works well with multi-run mode.
No UI
aiperf profile \
--num-profile-runs 5 \
--ui none \
...Minimal output, fastest execution - ideal for automated runs or CI/CD pipelines.
The dashboard UI (--ui dashboard) is incompatible with multi-run mode due to terminal control constraints. If you explicitly try to use it, you'll get an error:
aiperf profile --num-profile-runs 5 --ui dashboard ...ValueError: Dashboard UI is not supported with multi-run mode (--num-profile-runs > 1)
due to terminal control limitations. Please use '--ui simple' or '--ui none' instead.
This is a fundamental architectural limitation - Textual requires exclusive terminal control, which isn't possible when the orchestrator coordinates multiple subprocess runs.
If you need live dashboard updates, run benchmarks individually:
# Run each benchmark separately with live dashboard
aiperf profile --output-artifact-dir ./run1 --ui dashboard ...
aiperf profile --output-artifact-dir ./run2 --ui dashboard ...
aiperf profile --output-artifact-dir ./run3 --ui dashboard ...Run the same benchmark 5 times:
aiperf profile \
--model llama-3-8b \
--endpoint-type openai_chat \
--url http://localhost:8000/v1/chat/completions \
--num-profile-runs 5 \
--concurrency 10 \
--num-prompts 1000Use 99% confidence intervals instead of the default 95%:
aiperf profile \
--model llama-3-8b \
--endpoint-type openai_chat \
--url http://localhost:8000/v1/chat/completions \
--num-profile-runs 5 \
--confidence-level 0.99 \
--concurrency 10 \
--num-prompts 1000Add a 10-second cooldown between runs to reduce correlation:
aiperf profile \
--model llama-3-8b \
--endpoint-type openai_chat \
--url http://localhost:8000/v1/chat/completions \
--num-profile-runs 5 \
--profile-run-cooldown-seconds 10.0 \
--concurrency 10 \
--num-prompts 1000When --num-profile-runs > 1, AIPerf creates a hierarchical output structure with an auto-generated directory name:
artifacts/
llama-3-8b-openai-chat-concurrency_10/
profile_runs/
run_0001/
profile_export_aiperf.json
profile_export_aiperf.csv
profile_export.jsonl
inputs.json
run_0002/
...
run_0005/
...
aggregate/
profile_export_aiperf_aggregate.json
profile_export_aiperf_aggregate.csv
profile_export_aiperf_collated.json # only with adaptive convergence + records/raw export
The directory name is automatically generated based on your benchmark configuration:
- Model name: e.g.,
llama-3-8b(from--model) - Service kind and endpoint type: e.g.,
openai-chat(from--endpoint-type) - Stimulus: e.g.,
concurrency_10(from--concurrency) orrequest_rate_100(from--request-rate)
Examples:
artifacts/gpt-4-openai-chat-concurrency_50/artifacts/mistral-7b-openai-completions-request_rate_10/artifacts/llama-2-13b-nim-embeddings-concurrency_20/
Each run's artifacts are stored in separate directories (run_0001, run_0002, etc.) and include:
profile_export_aiperf.json- Complete metrics for that runprofile_export_aiperf.csv- CSV export for that runprofile_export.jsonl- Per-request recordsinputs.json- Input prompts used
This allows you to:
- Debug outliers by examining specific runs
- Compare individual runs
- Investigate anomalies
The aggregate/ directory contains statistics computed across all runs:
profile_export_aiperf_aggregate.json- Aggregated statisticsprofile_export_aiperf_aggregate.csv- Tabular view of aggregated metrics
For each metric, the aggregate output includes:
- mean: Average value across all runs
- std: Standard deviation (measure of spread)
- min: Minimum value observed
- max: Maximum value observed
- cv: Coefficient of Variation (normalized variability)
- se: Standard Error (uncertainty in the mean)
- ci_low, ci_high: Confidence interval bounds
- t_critical: t-distribution critical value used
{
"metadata": {
"aggregation_type": "confidence",
"num_profile_runs": 5,
"num_successful_runs": 5,
"confidence_level": 0.95,
"run_labels": ["run_0001", "run_0002", "run_0003", "run_0004", "run_0005"]
},
"metrics": {
"request_throughput_avg": {
"mean": 255.4,
"std": 12.3,
"min": 240.1,
"max": 270.2,
"cv": 0.048,
"se": 5.5,
"ci_low": 243.2,
"ci_high": 267.6,
"t_critical": 2.776,
"unit": "requests/sec"
},
"ttft_p99_ms": {
"mean": 152.7,
"std": 12.4,
"min": 138.2,
"max": 168.9,
"cv": 0.081,
"se": 5.55,
"ci_low": 140.3,
"ci_high": 165.1,
"t_critical": 2.776,
"unit": "ms"
}
}
}This section provides detailed mathematical definitions for each aggregate statistic computed across multiple runs.
Type: Aggregate Statistic
The average value of the metric across all successful runs.
Formula:
mean = sum(values) / nExample: If TTFT p99 values across 5 runs are [150ms, 152ms, 148ms, 155ms, 151ms], the mean is 151.2ms.
Type: Aggregate Statistic
Measures the spread or dispersion of metric values across runs. Uses sample standard deviation (N-1 degrees of freedom).
Formula:
std = sqrt(sum((x - mean)^2) / (n - 1))Example: For the TTFT values above, std ≈ 2.59ms, indicating low variability.
Type: Aggregate Statistic
The smallest value observed across all runs.
Example: For the TTFT values above, min = 148ms.
Type: Aggregate Statistic
The largest value observed across all runs.
Example: For the TTFT values above, max = 155ms.
Type: Aggregate Statistic
A normalized measure of variability, expressed as a ratio (not percentage). Useful for comparing variability across metrics with different scales.
Formula:
cv = std / abs(mean)Notes:
- Uses
abs(mean)to handle metrics that can be negative - Returns
infwhen mean is zero (division by zero) - Lower CV indicates more consistent measurements
Example: For the TTFT values above, CV = 2.59 / 151.2 ≈ 0.017 (1.7%), indicating excellent repeatability.
Type: Aggregate Statistic
Measures the uncertainty in the estimated mean. Decreases as sample size increases.
Formula:
se = std / sqrt(n)Example: For the TTFT values above with n=5, SE = 2.59 / sqrt(5) ≈ 1.16ms.
Notes:
- Smaller SE indicates more precise estimate of the true mean
- SE decreases proportionally to 1/sqrt(n)
Type: Aggregate Statistic
A range that likely contains the true population mean with a specified confidence level (default 95%).
Formula:
ci_low = mean - t_critical * se
ci_high = mean + t_critical * seWhere t_critical is the critical value from the t-distribution with (n-1) degrees of freedom.
Example: For the TTFT values above with 95% confidence:
- t_critical ≈ 2.776 (for n=5, df=4)
- CI =
[151.2 - 2.776 * 1.16, 151.2 + 2.776 * 1.16]= [148.0ms, 154.4ms]
We're 95% confident the true mean TTFT is between 148.0ms and 154.4ms.
Notes:
- Uses t-distribution (not normal) for mathematically precise critical values
- Confidence level configurable via
--confidence-level(default 0.95) - CI width decreases with more runs (larger n)
Type: Aggregate Statistic
The critical value from the t-distribution used to compute confidence intervals. Depends on sample size and confidence level.
Formula:
t_critical = t.ppf(1 - alpha/2, df)Where:
alpha = 1 - confidence_leveldf = n - 1(degrees of freedom)t.ppfis the percent point function (inverse CDF) of the t-distribution
Example:
- For n=5 runs and 95% confidence: t_critical ≈ 2.776
- For n=10 runs and 95% confidence: t_critical ≈ 2.262
- For n=5 runs and 99% confidence: t_critical ≈ 4.604
Notes:
- Computed using scipy.stats.t.ppf() for mathematical precision
- Larger sample sizes have smaller t-critical values (approach normal distribution)
- Higher confidence levels have larger t-critical values (wider intervals)
The CV is a normalized measure of variability: CV = std / mean
Interpretation Guidelines:
-
CV < 0.05 (5%): Excellent repeatability, low noise
- Results are very stable
- High confidence in measurements
- Small differences are likely meaningful
-
CV 0.05-0.10 (5-10%): Good repeatability, acceptable noise
- Results are reasonably stable
- Moderate confidence in measurements
- Medium-sized differences are likely meaningful
-
CV 0.10-0.20 (10-20%): Fair repeatability, moderate variance
- Results show noticeable variation
- Consider running more trials
- Only large differences are clearly meaningful
-
CV > 0.20 (>20%): High variability
- Results are unstable
- Investigate sources of variance
- Increase number of runs or use cooldown
- Be cautious about drawing conclusions
Example:
ttft_p99_ms: mean=152.7ms, cv=0.081 (8.1%)
This indicates good repeatability. The p99 TTFT varies by about 8% between runs, which is acceptable for most use cases.
The confidence interval tells you: "If we repeated this experiment many times, X% of the time the true mean would fall in this range."
Interpretation Guidelines:
-
Narrow CI: High precision, confident in the estimate
- The true mean is likely very close to the measured mean
- Small sample size may still be sufficient
-
Wide CI: Lower precision, more uncertainty
- The true mean could be anywhere in a broad range
- Consider increasing
--num-profile-runs - May need to investigate sources of variance
Example:
ttft_p99_ms: mean=152.7ms, 95% CI=[140.3, 165.1]
We're 95% confident the true mean p99 TTFT is between 140.3ms and 165.1ms. The 24.8ms width suggests moderate uncertainty with 5 runs.
When comparing two configurations, consider:
-
Do the confidence intervals overlap?
- No overlap → Strong evidence of a real difference
- Partial overlap → Likely a real difference, but less certain
- Complete overlap → Difference may not be meaningful
-
Is the difference larger than the CV?
- If Config A has mean=100ms (CV=10%) and Config B has mean=120ms
- Difference is 20%, which is 2× the CV
- This suggests a real difference
Example:
Config A: mean=150ms, CI=[145, 155]
Config B: mean=180ms, CI=[175, 185]
No overlap in CIs → Strong evidence that Config B is slower.
-
Quick check: 3 runs
- Minimum for basic statistics
- Good for initial exploration
-
Standard benchmarking: 5 runs
- Good balance of time and precision
- Recommended for most use cases
-
High-precision: 10 runs
- When you need very precise estimates
- When comparing small differences
- When variance is high
- High CV (>10%): More runs will reduce uncertainty
- Wide confidence intervals: More runs will narrow the CI
- Overlapping CIs when comparing: More runs may separate them
- Inconsistent results: More runs will clarify the true mean
Important: All runs use the same workload (prompts, ordering, scheduling) to ensure fair comparison.
AIPerf automatically:
- Sets
--random-seed 42if not specified (for multi-run consistency) - Uses the same prompts in the same order for all runs
- Uses the same request timing patterns
This ensures that observed variance is due to real system noise, not artificial differences in the workload.
You can specify your own seed:
aiperf profile \
--num-profile-runs 5 \
--random-seed 123 \
...All 5 runs will use seed 123, ensuring identical workloads.
When using multi-run with warmup:
aiperf profile \
--num-profile-runs 5 \
--warmup-request-count 100 \
...- By default, warmup runs once before the first profile run only
- Subsequent profile runs (2-5) measure steady-state performance without warmup
- Warmup metrics are automatically excluded from results
- Use
--profile-run-disable-warmup-after-first falseto run warmup before each run (useful for long cooldown periods)
This default behavior is more efficient and provides more accurate aggregate statistics by measuring steady-state performance.
Possible causes:
- System is under load from other processes
- Network instability
- Server batching/scheduling dynamics
- Insufficient warmup
Solutions:
- Use
--profile-run-cooldown-secondsto reduce correlation - Increase
--warmup-request-countto stabilize server - Run benchmarks during low-load periods
- Investigate server configuration
- Increase
--num-profile-runsto better characterize variance
If some runs fail, AIPerf will:
- Continue with remaining runs
- Compute statistics over successful runs only
- Report failed runs in aggregate metadata
Example output:
{
"metadata": {
"num_profile_runs": 5,
"num_successful_runs": 4,
"failed_runs": [
{"label": "run_0003", "error": "Connection timeout"}
]
}
}If fewer than 2 runs succeed, you'll get an error:
ValueError: Insufficient successful runs for confidence intervals.
Got 1 successful run(s), but need at least 2.
Consider increasing --num-profile-runs or investigating why runs are failing.
Solution: Increase --num-profile-runs or fix the underlying issue causing failures.
If --num-profile-runs is large and each run takes a long time:
-
Reduce run duration:
- Use fewer prompts:
--num-prompts 500instead of--num-prompts 5000 - Use shorter prompts:
--synthetic-input-tokens-mean 100
- Use fewer prompts:
-
Use cooldown strategically:
- Only add cooldown if you see high correlation between runs
- Start without cooldown and add if needed
-
Run overnight:
- For production validation with many runs
This provides a good balance of precision and time investment.
After running, look at the CV for your key metrics:
- CV < 10%: Results are trustworthy
- CV > 10%: Consider more runs or investigate variance
Always use warmup to eliminate cold-start effects:
--warmup-request-count 100For reproducible experiments:
--random-seed 42Save your command and results for future reference:
aiperf profile ... | tee benchmark_log.txtWhen comparing configurations:
- Use the same
--num-profile-runs - Use the same
--random-seed - Use the same workload parameters
Instead of always running a fixed number of trials, you can specify a convergence criterion that stops benchmarking early once metrics stabilize. This saves time when results converge quickly and runs to the maximum when they don't.
When --convergence-metric is set, AIPerf switches from FixedTrialsStrategy to AdaptiveStrategy:
- Runs at least
min(3, num_profile_runs)trials - After each run, checks whether the convergence criterion is satisfied
- Stops early if converged, otherwise continues up to
--num-profile-runs
The IID property of runs is preserved — convergence operates on independent run-level statistics and never feeds aggregated data back into the decision loop.
Three statistical methods are available via --convergence-mode:
CI Width (default): Stops when the Student's t confidence interval width relative to the mean falls below the threshold. Operates on run-level summary statistics.
aiperf profile \
--num-profile-runs 10 \
--convergence-metric request_latency \
--convergence-mode ci_width \
--convergence-threshold 0.10 \
--convergence-stat avg \
--concurrency 10 \
...CV (Coefficient of Variation): Stops when the CV (std/mean) across run-level values drops below the threshold.
aiperf profile \
--num-profile-runs 10 \
--convergence-metric time_to_first_token \
--convergence-mode cv \
--convergence-threshold 0.05 \
--convergence-stat p99 \
--concurrency 10 \
...Distribution (KS Test): Uses a two-sample Kolmogorov-Smirnov test on per-request JSONL data to detect when the latest run's distribution matches prior runs. Catches bimodal behavior and tail shifts that summary statistics miss.
aiperf profile \
--num-profile-runs 10 \
--convergence-metric inter_token_latency \
--convergence-mode distribution \
--convergence-threshold 0.10 \
--export-level records \
--concurrency 10 \
...Distribution mode requires
--export-level recordsor--export-level rawbecause it reads per-request JSONL data. It is rejected with--export-level summary.
For ci_width and cv, a lower threshold is stricter (harder to converge). For distribution, the threshold is a KS test p-value — convergence triggers when p_value > threshold, so a higher threshold is stricter. AIPerf logs this at runtime:
Note: distribution mode converges when KS p-value > threshold
(higher threshold = stricter, opposite of ci_width/cv)
| Flag | Description | Default |
|---|---|---|
--convergence-metric |
Target metric name (e.g., request_latency, time_to_first_token) |
None (disabled) |
--convergence-mode |
Statistical method: ci_width, cv, or distribution |
ci_width |
--convergence-threshold |
Convergence threshold (0–1) | 0.10 |
--convergence-stat |
Statistic to evaluate: avg, p50, p90, p95, p99, min, max |
avg |
All convergence flags require --num-profile-runs > 1. The --convergence-stat flag applies to ci_width and cv modes only (not distribution).
When --num-profile-runs is 2, AIPerf adjusts the minimum runs for convergence checks accordingly and logs a warning about reduced statistical power:
WARNING: --num-profile-runs=2 is below the recommended minimum of 3.
Convergence checks will have reduced statistical power.
For meaningful convergence, 3+ runs is recommended.
If --convergence-metric contains a typo, AIPerf warns after the minimum runs complete with zero matching values:
WARNING: Convergence metric 'tttf' (stat 'avg') not found in any run's summary metrics;
convergence will never trigger. Check --convergence-metric spelling.
Available metrics: ['inter_token_latency', 'request_latency', 'request_throughput', 'time_to_first_token']
When adaptive convergence is enabled and --export-level is records or raw, AIPerf produces an additional profile_export_aiperf_collated.json in the aggregate directory. This reads per-request JSONL from all runs, combines them into a single population per metric, and computes true combined percentiles (p50, p90, p95, p99).
This complements the standard confidence aggregation — confidence aggregation operates on run-level summary stats, while detailed aggregation gives a combined distribution view over all requests.
artifacts/
llama-3-8b-openai-chat-concurrency_10/
profile_runs/
run_0001/
profile_export_aiperf.json
profile_export_aiperf.csv
profile_export.jsonl
inputs.json
run_0002/
...
run_0003/ # may stop here if converged
...
aggregate/
profile_export_aiperf_aggregate.json
profile_export_aiperf_aggregate.csv
profile_export_aiperf_collated.json # per-request combined percentiles
{
"schema_version": "1.0.0",
"aiperf_version": "0.5.0",
"metadata": {
"aggregation_type": "detailed",
"num_profile_runs": 3,
"num_successful_runs": 3,
"failed_runs": [],
"run_labels": ["run_0001", "run_0002", "run_0003"]
},
"metrics": {
"time_to_first_token": {
"combined": {
"mean": 45.23,
"std": 12.8,
"p50": 42.1,
"p90": 58.7,
"p95": 65.3,
"p99": 78.9,
"count": 3000
},
"per_run": [
{"label": "run_0001", "mean": 44.8, "count": 1000},
{"label": "run_0002", "mean": 45.1, "count": 1000},
{"label": "run_0003", "mean": 45.8, "count": 1000}
]
}
}
}Multi-run works with all AIPerf features:
With GPU telemetry:
aiperf profile \
--num-profile-runs 5 \
--gpu-telemetry-url http://localhost:9400/metrics \
...With server metrics:
aiperf profile \
--num-profile-runs 5 \
--server-metrics-url http://localhost:8000/metrics \
...With trace replay:
aiperf profile \
--num-profile-runs 5 \
--trace-file my_trace.jsonl \
...Load aggregate results in Python:
import json
with open('artifacts/aggregate/profile_export_aiperf_aggregate.json') as f:
agg = json.load(f)
# Get throughput statistics
throughput = agg['metrics']['request_throughput_avg']
print(f"Mean: {throughput['mean']:.2f} req/s")
print(f"CV: {throughput['cv']:.1%}")
print(f"95% CI: [{throughput['ci_low']:.2f}, {throughput['ci_high']:.2f}]")Multi-run confidence reporting helps you:
- ✅ Quantify measurement variance
- ✅ Assess repeatability with CV
- ✅ Compute confidence intervals
- ✅ Make statistically informed decisions
- ✅ Debug outliers with per-run artifacts
Quick Start:
aiperf profile --num-profile-runs 5 [other options]With Adaptive Convergence:
aiperf profile --num-profile-runs 10 --convergence-metric request_latency --convergence-mode ci_width [other options]With Adaptive Convergence:
aiperf profile --num-profile-runs 10 --convergence-metric request_latency --convergence-mode ci_width [other options]Key Metrics:
- CV < 10%: Good repeatability
- Narrow CI: High precision
- No CI overlap: Strong evidence of difference
For more details, see:
- CLI Options - Full parameter reference
- Metrics Reference - Detailed metric descriptions
- Architecture - How multi-run orchestration works