Skip to content

Commit 6ebc186

Browse files
committed
More benchmarking tool
Signed-off-by: Adam Gutglick <[email protected]>
1 parent 0182ef9 commit 6ebc186

File tree

6 files changed

+452
-358
lines changed

6 files changed

+452
-358
lines changed

bench-orchestrator/README.md

Lines changed: 47 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -16,20 +16,24 @@ This installs the `vx-bench` command.
1616

1717
```bash
1818
# Run TPC-H benchmarks with DataFusion and DuckDB
19+
# A comparison table is automatically displayed after the run
1920
vx-bench run tpch --engine datafusion,duckdb --format parquet,vortex
2021

2122
# List recent benchmark runs
2223
vx-bench list
2324

24-
# Compare the two most recent runs
25-
vx-bench compare --runs latest,<previous-run-id>
25+
# Compare engine:format combinations within a single run
26+
vx-bench compare --run latest
27+
28+
# Compare multiple runs (2 or more)
29+
vx-bench compare --runs run1,run2,run3
2630
```
2731

2832
## Commands
2933

3034
### `run` - Execute Benchmarks
3135

32-
Run benchmark suites across multiple engines and formats.
36+
Run benchmark suites across multiple engines and formats. After completion, a comparison table is automatically displayed if there are multiple engine:format combinations.
3337

3438
```bash
3539
vx-bench run <benchmark> [options]
@@ -53,19 +57,25 @@ vx-bench run <benchmark> [options]
5357

5458
### `compare` - Compare Results
5559

56-
Compare benchmark results between runs or specific configurations.
60+
Compare benchmark results within a run or across multiple runs. Results are displayed in a pivot table format.
5761

5862
```bash
5963
vx-bench compare [options]
6064
```
6165

6266
**Options:**
6367

64-
- `--runs, -r`: Two run IDs to compare, comma-separated
65-
- `--base, -b`: Base reference (`engine:format@run`)
66-
- `--target, -t`: Target reference (`engine:format@run`)
68+
- `--run`: Single run for within-run comparison (compares different engine:format combinations)
69+
- `--runs, -r`: Multiple runs to compare, comma-separated (2 or more)
70+
- `--baseline`: Baseline for comparison (engine:format for within-run, or run label for multi-run)
71+
- `--engine`: Filter results to a specific engine
72+
- `--format`: Filter results to a specific format
6773
- `--threshold`: Significance threshold (default: 0.10 = 10%)
6874

75+
**Within-run comparison** (`--run`): Compares different engine:format combinations within a single run. Output shows one row per query, with columns for each engine:format combo.
76+
77+
**Multi-run comparison** (`--runs`): Compares the same benchmarks across multiple runs. Output shows one row per (query, engine, format) combination, with columns for each run.
78+
6979
### `list` - List Benchmark Runs
7080

7181
```bash
@@ -149,10 +159,11 @@ Compare performance across different query engines:
149159

150160
```bash
151161
# Run all engines on the same data
162+
# Comparison table is displayed automatically after the run
152163
vx-bench run tpch -e datafusion,duckdb -f parquet -l engine-comparison
153164

154-
# View results
155-
vx-bench show latest
165+
# Or compare within the run later
166+
vx-bench compare --run engine-comparison
156167
```
157168

158169
### 4. Format Performance Analysis
@@ -167,10 +178,11 @@ vx-bench run tpch \
167178
-i 10 \
168179
-l format-analysis
169180

170-
# Compare specific format pairs
171-
vx-bench compare \
172-
--base "datafusion:parquet@format-analysis" \
173-
--target "datafusion:vortex@format-analysis" \
181+
# Compare within the run (table shown automatically after run too)
182+
vx-bench compare --run format-analysis
183+
184+
# Use a specific baseline
185+
vx-bench compare --run format-analysis --baseline datafusion:parquet
174186
```
175187

176188
### 5. Memory Usage Analysis
@@ -246,33 +258,34 @@ vx-bench clean --older-than "30 days" --no-keep-labeled
246258
| duckdb | parquet, vortex, vortex-compact, duckdb |
247259
| lance | lance |
248260

249-
## Target Reference Syntax
261+
## Output Format
250262

251-
When using `--base` and `--target` options, use this format:
263+
Comparison results are displayed in a pivot table format:
252264

265+
**Within-run comparison** (`--run`):
253266
```
254-
engine:format@run
267+
┌───────┬──────────────────────┬────────────────────────┐
268+
│ Query │ duckdb:parquet (base)│ duckdb:vortex │
269+
├───────┼──────────────────────┼────────────────────────┤
270+
│ 1 │ 100.5ms │ 80.2ms (0.80x) │
271+
│ 2 │ 200.1ms │ 150.0ms (0.75x) │
272+
└───────┴──────────────────────┴────────────────────────┘
255273
```
256274

257-
- `engine`: Engine name (`datafusion`, `duckdb`, `lance`) or `*` for wildcard
258-
- `format`: Format name (`parquet`, `vortex`, etc.) or `*` for wildcard
259-
- `run`: Run ID, label, or `latest`
260-
261-
Examples:
262-
263-
- `duckdb:parquet@latest` - DuckDB with Parquet from the latest run
264-
- `*:vortex@baseline` - All engines with Vortex from the "baseline" run
265-
- `datafusion:*@2025-01-15` - All formats with DataFusion from a specific run
266-
267-
## Output Formats
268-
269-
### Terminal Output
270-
271-
Default output uses rich formatting with color-coded ratios:
275+
**Multi-run comparison** (`--runs`):
276+
```
277+
┌───────┬────────┬─────────┬──────────────┬──────────────────┐
278+
│ Query │ Engine │ Format │ run1 (base) │ run2 │
279+
├───────┼────────┼─────────┼──────────────┼──────────────────┤
280+
│ 1 │ duckdb │ parquet │ 100ms │ 95ms (0.95x) │
281+
│ 1 │ duckdb │ vortex │ 80ms │ 75ms (0.94x) │
282+
└───────┴────────┴─────────┴──────────────┴──────────────────┘
283+
```
272284

273-
- Green (with up arrow): Improvement (>10% faster)
274-
- Red (with down arrow): Regression (>10% slower)
275-
- Yellow: Neutral (within 10%)
285+
Ratios are color-coded:
286+
- **Green**: Improvement (>10% faster, ratio < 0.9)
287+
- **Red**: Regression (>10% slower, ratio > 1.1)
288+
- **Yellow**: Neutral (within 10%)
276289

277290
## Data Storage
278291

bench-orchestrator/bench_orchestrator/cli.py

Lines changed: 96 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,13 @@
66
from datetime import datetime, timedelta
77
from typing import Annotated
88

9+
import pandas as pd
910
import typer
1011
from rich.console import Console
1112
from rich.table import Table
1213

13-
from .comparison.analyzer import BenchmarkAnalyzer, TargetRef
14-
from .comparison.reporter import BenchmarkReporter
14+
from .comparison import analyzer
15+
from .comparison.reporter import pivot_comparison_table
1516
from .config import (
1617
ENGINE_FORMATS,
1718
Benchmark,
@@ -49,6 +50,10 @@ def parse_queries(value: str | None) -> list[int] | None:
4950
return [int(q.strip()) for q in value.split(",")]
5051

5152

53+
def run_ref_auto_complete() -> list[str]:
54+
return list(map(lambda x: x.run_id, ResultStore().list_runs(limit=None)))
55+
56+
5257
@app.command()
5358
def run(
5459
benchmark: Annotated[Benchmark, typer.Argument(help="Benchmark suite to run")],
@@ -154,103 +159,124 @@ def run(
154159

155160
console.print(f"\n[green]Results saved to run: {ctx.metadata.run_id}[/green]")
156161

162+
# Show comparison table if we have multiple engine:format combinations
163+
df = store.load_results(ctx.metadata.run_id)
164+
if not df.empty:
165+
try:
166+
pivot = analyzer.compare_within_run(df)
167+
table = pivot_comparison_table(pivot)
168+
console.print()
169+
console.print(table)
170+
except ValueError:
171+
# Not enough combinations to compare
172+
pass
173+
157174

158175
@app.command()
159176
def compare(
160-
base: Annotated[
177+
runs: Annotated[
161178
str | None,
162-
typer.Option("--base", "-b", help="Base reference (engine:format@run)"),
179+
typer.Option("--runs", "-r", help="Runs to compare (comma-separated, 2 or more)"),
163180
] = None,
164-
target: Annotated[
181+
run: Annotated[
165182
str | None,
166-
typer.Option("--target", "-t", help="Target reference (engine:format@run)"),
183+
typer.Option("--run", help="Single run for within-run comparison", autocompletion=run_ref_auto_complete),
167184
] = None,
168-
runs: Annotated[
185+
baseline: Annotated[
169186
str | None,
170-
typer.Option("--runs", "-r", help="Two runs to compare (comma-separated)"),
187+
typer.Option("--baseline", help="Baseline engine:format for within-run comparison"),
171188
] = None,
172189
threshold: Annotated[float, typer.Option("--threshold", help="Significance threshold (default 10%)")] = 0.10,
190+
filter_engine: Annotated[
191+
str | None, typer.Option("--engine", help="Filter only for results that use a specific engine")
192+
] = None,
193+
filter_format: Annotated[
194+
str | None, typer.Option("--format", help="Filter only for results that use a specific file format")
195+
] = None,
173196
) -> None:
174197
"""Compare benchmark results."""
175198
store = ResultStore()
176199

177-
if runs:
178-
# Compare two full runs
179-
run_refs = [r.strip() for r in runs.split(",")]
180-
if len(run_refs) != 2:
181-
console.print("[red]--runs requires exactly two run references[/red]")
200+
if run:
201+
# Within-run comparison
202+
run_meta = store.get_run(run)
203+
if not run_meta:
204+
console.print(f"[red]Run not found: {run}[/red]")
182205
raise typer.Exit(1)
183206

184-
base_run = store.get_run(run_refs[0])
185-
target_run = store.get_run(run_refs[1])
207+
df = store.load_results(run_meta.run_id)
186208

187-
if not base_run:
188-
console.print(f"[red]Run not found: {run_refs[0]}[/red]")
209+
if df.empty:
210+
console.print("[red]No results found[/red]")
189211
raise typer.Exit(1)
190-
if not target_run:
191-
console.print(f"[red]Run not found: {run_refs[1]}[/red]")
192-
raise typer.Exit(1)
193-
194-
base_df = store.load_results(base_run.run_id)
195-
target_df = store.load_results(target_run.run_id)
196212

197-
base_label = base_run.label or base_run.run_id[:20]
198-
target_label = target_run.label or target_run.run_id[:20]
213+
# Parse baseline if provided
214+
baseline_engine = None
215+
baseline_format = None
216+
if baseline:
217+
if ":" in baseline:
218+
baseline_engine, baseline_format = baseline.split(":", 1)
219+
else:
220+
console.print("[red]--baseline must be engine:format[/red]")
221+
raise typer.Exit(1)
199222

200-
elif base and target:
201-
# Compare specific configurations
202-
base_ref = TargetRef.parse(base)
203-
target_ref = TargetRef.parse(target)
223+
try:
224+
pivot = analyzer.compare_within_run(df, baseline_engine, baseline_format, filter_engine, filter_format)
225+
except ValueError as e:
226+
console.print(f"[red]{e}[/red]")
227+
raise typer.Exit(1)
204228

205-
base_run = store.get_run(base_ref.run)
206-
target_run = store.get_run(target_ref.run)
229+
table = pivot_comparison_table(pivot, threshold)
230+
console.print(table)
231+
return
207232

208-
if not base_run:
209-
console.print(f"[red]Run not found: {base_ref.run}[/red]")
210-
raise typer.Exit(1)
211-
if not target_run:
212-
console.print(f"[red]Run not found: {target_ref.run}[/red]")
233+
elif runs:
234+
# Compare multiple runs (2 or more)
235+
run_refs = [r.strip() for r in runs.split(",")]
236+
if len(run_refs) < 2:
237+
console.print("[red]--runs requires at least two run references[/red]")
213238
raise typer.Exit(1)
214239

215-
base_df = store.load_results(base_run.run_id)
216-
target_df = store.load_results(target_run.run_id)
217-
218-
# Apply filters
219-
base_analyzer = BenchmarkAnalyzer(base_df)
220-
target_analyzer = BenchmarkAnalyzer(target_df)
240+
# Load all runs
241+
run_data: list[tuple[str, pd.DataFrame]] = []
242+
for ref in run_refs:
243+
run_meta = store.get_run(ref)
244+
if not run_meta:
245+
console.print(f"[red]Run not found: {ref}[/red]")
246+
raise typer.Exit(1)
247+
label = run_meta.label or run_meta.run_id[:16]
248+
df = store.load_results(run_meta.run_id)
249+
if df.empty:
250+
console.print(f"[red]No results for run: {ref}[/red]")
251+
raise typer.Exit(1)
252+
run_data.append((label, df))
253+
254+
# Use baseline option if provided, otherwise first run
255+
baseline_label = None
256+
if baseline:
257+
# Find matching label
258+
for label, _ in run_data:
259+
if baseline in label:
260+
baseline_label = label
261+
break
262+
if baseline_label is None:
263+
console.print(f"[red]Baseline not found: {baseline}[/red]")
264+
raise typer.Exit(1)
221265

222-
base_df = base_analyzer.filter_by_ref(base_ref)
223-
target_df = target_analyzer.filter_by_ref(target_ref)
266+
try:
267+
pivot = analyzer.compare_runs(run_data, baseline_label, filter_engine, filter_format)
268+
except ValueError as e:
269+
console.print(f"[red]{e}[/red]")
270+
raise typer.Exit(1)
224271

225-
base_label = base
226-
target_label = target
272+
table = pivot_comparison_table(pivot, threshold, row_keys=["query", "engine", "format"])
273+
console.print(table)
274+
return
227275

228276
else:
229-
console.print("[red]Must specify either --runs or --base/--target[/red]")
230-
raise typer.Exit(1)
231-
232-
if base_df.empty:
233-
console.print("[red]No results found for base[/red]")
234-
raise typer.Exit(1)
235-
if target_df.empty:
236-
console.print("[red]No results found for target[/red]")
277+
console.print("[red]Must specify either --runs or --run[/red]")
237278
raise typer.Exit(1)
238279

239-
# Perform comparison
240-
analyzer = BenchmarkAnalyzer(base_df)
241-
comparison = analyzer.compare(base_df, target_df)
242-
stats = analyzer.summary_stats(comparison)
243-
244-
reporter = BenchmarkReporter(comparison, stats, threshold)
245-
246-
table = reporter.to_rich_table(
247-
title="Benchmark Comparison",
248-
base_label=base_label,
249-
target_label=target_label,
250-
)
251-
console.print(table)
252-
reporter.print_summary()
253-
254280

255281
@app.command("list")
256282
def list_runs(
@@ -305,7 +331,7 @@ def list_runs(
305331

306332
@app.command()
307333
def show(
308-
run_ref: Annotated[str, typer.Argument(help="Run ID, label, or 'latest'")],
334+
run_ref: Annotated[str, typer.Argument(help="Run ID, label, or 'latest'", autocompletion=run_ref_auto_complete)],
309335
) -> None:
310336
"""Show details of a specific run."""
311337
store = ResultStore()

0 commit comments

Comments
 (0)