vortex-data
diff --git a/‎bench-orchestrator/README.md‎
Lines changed: 47 additions & 34 deletions b/‎bench-orchestrator/README.md‎
Lines changed: 47 additions & 34 deletions
diff --git a/‎bench-orchestrator/bench_orchestrator/cli.py‎
Lines changed: 96 additions & 70 deletions b/‎bench-orchestrator/bench_orchestrator/cli.py‎
Lines changed: 96 additions & 70 deletions
@@ -16,20 +16,24 @@ This installs the `vx-bench` command.
 
 ```bash
 # Run TPC-H benchmarks with DataFusion and DuckDB
+# A comparison table is automatically displayed after the run
 vx-bench run tpch --engine datafusion,duckdb --format parquet,vortex
 
 # List recent benchmark runs
 vx-bench list
 
-# Compare the two most recent runs
-vx-bench compare --runs latest,<previous-run-id>
+# Compare engine:format combinations within a single run
+vx-bench compare --run latest
+
+# Compare multiple runs (2 or more)
+vx-bench compare --runs run1,run2,run3
 ```
 
 ## Commands
 
 ### `run` - Execute Benchmarks
 
-Run benchmark suites across multiple engines and formats.
+Run benchmark suites across multiple engines and formats. After completion, a comparison table is automatically displayed if there are multiple engine:format combinations.
 
 ```bash
 vx-bench run <benchmark> [options]
@@ -53,19 +57,25 @@ vx-bench run <benchmark> [options]
 
 ### `compare` - Compare Results
 
-Compare benchmark results between runs or specific configurations.
+Compare benchmark results within a run or across multiple runs. Results are displayed in a pivot table format.
 
 ```bash
 vx-bench compare [options]
 ```
 
 **Options:**
 
-- `--runs, -r`: Two run IDs to compare, comma-separated
-- `--base, -b`: Base reference (`engine:format@run`)
-- `--target, -t`: Target reference (`engine:format@run`)
+- `--run`: Single run for within-run comparison (compares different engine:format combinations)
+- `--runs, -r`: Multiple runs to compare, comma-separated (2 or more)
+- `--baseline`: Baseline for comparison (engine:format for within-run, or run label for multi-run)
+- `--engine`: Filter results to a specific engine
+- `--format`: Filter results to a specific format
 - `--threshold`: Significance threshold (default: 0.10 = 10%)
 
+**Within-run comparison** (`--run`): Compares different engine:format combinations within a single run. Output shows one row per query, with columns for each engine:format combo.
+
+**Multi-run comparison** (`--runs`): Compares the same benchmarks across multiple runs. Output shows one row per (query, engine, format) combination, with columns for each run.
+
 ### `list` - List Benchmark Runs
 
 ```bash
@@ -149,10 +159,11 @@ Compare performance across different query engines:
 
 ```bash
 # Run all engines on the same data
+# Comparison table is displayed automatically after the run
 vx-bench run tpch -e datafusion,duckdb -f parquet -l engine-comparison
 
-# View results
-vx-bench show latest
+# Or compare within the run later
+vx-bench compare --run engine-comparison
 ```
 
 ### 4. Format Performance Analysis
@@ -167,10 +178,11 @@ vx-bench run tpch \
   -i 10 \
   -l format-analysis
 
-# Compare specific format pairs
-vx-bench compare \
-  --base "datafusion:parquet@format-analysis" \
-  --target "datafusion:vortex@format-analysis" \
+# Compare within the run (table shown automatically after run too)
+vx-bench compare --run format-analysis
+
+# Use a specific baseline
+vx-bench compare --run format-analysis --baseline datafusion:parquet
 ```
 
 ### 5. Memory Usage Analysis
@@ -246,33 +258,34 @@ vx-bench clean --older-than "30 days" --no-keep-labeled
 | duckdb     | parquet, vortex, vortex-compact, duckdb   |
 | lance      | lance                                      |
 
-## Target Reference Syntax
+## Output Format
 
-When using `--base` and `--target` options, use this format:
+Comparison results are displayed in a pivot table format:
 
+**Within-run comparison** (`--run`):
 ```
-engine:format@run
+┌───────┬──────────────────────┬────────────────────────┐
+│ Query │ duckdb:parquet (base)│ duckdb:vortex          │
+├───────┼──────────────────────┼────────────────────────┤
+│     1 │ 100.5ms              │ 80.2ms (0.80x)         │
+│     2 │ 200.1ms              │ 150.0ms (0.75x)        │
+└───────┴──────────────────────┴────────────────────────┘
 ```
 
-- `engine`: Engine name (`datafusion`, `duckdb`, `lance`) or `*` for wildcard
-- `format`: Format name (`parquet`, `vortex`, etc.) or `*` for wildcard
-- `run`: Run ID, label, or `latest`
-
-Examples:
-
-- `duckdb:parquet@latest` - DuckDB with Parquet from the latest run
-- `*:vortex@baseline` - All engines with Vortex from the "baseline" run
-- `datafusion:*@2025-01-15` - All formats with DataFusion from a specific run
-
-## Output Formats
-
-### Terminal Output
-
-Default output uses rich formatting with color-coded ratios:
+**Multi-run comparison** (`--runs`):
+```
+┌───────┬────────┬─────────┬──────────────┬──────────────────┐
+│ Query │ Engine │ Format  │ run1 (base)  │ run2             │
+├───────┼────────┼─────────┼──────────────┼──────────────────┤
+│     1 │ duckdb │ parquet │ 100ms        │ 95ms (0.95x)     │
+│     1 │ duckdb │ vortex  │ 80ms         │ 75ms (0.94x)     │
+└───────┴────────┴─────────┴──────────────┴──────────────────┘
+```
 
-- Green (with up arrow): Improvement (>10% faster)
-- Red (with down arrow): Regression (>10% slower)
-- Yellow: Neutral (within 10%)
+Ratios are color-coded:
+- **Green**: Improvement (>10% faster, ratio < 0.9)
+- **Red**: Regression (>10% slower, ratio > 1.1)
+- **Yellow**: Neutral (within 10%)
 
 ## Data Storage
 
 
@@ -6,12 +6,13 @@
 from datetime import datetime, timedelta
 from typing import Annotated
 
+import pandas as pd
 import typer
 from rich.console import Console
 from rich.table import Table
 
-from .comparison.analyzer import BenchmarkAnalyzer, TargetRef
-from .comparison.reporter import BenchmarkReporter
+from .comparison import analyzer
+from .comparison.reporter import pivot_comparison_table
 from .config import (
     ENGINE_FORMATS,
     Benchmark,
@@ -49,6 +50,10 @@ def parse_queries(value: str | None) -> list[int] | None:
     return [int(q.strip()) for q in value.split(",")]
 
 
+def run_ref_auto_complete() -> list[str]:
+    return list(map(lambda x: x.run_id, ResultStore().list_runs(limit=None)))
+
+
 @app.command()
 def run(
     benchmark: Annotated[Benchmark, typer.Argument(help="Benchmark suite to run")],
@@ -154,103 +159,124 @@ def run(
 
     console.print(f"\n[green]Results saved to run: {ctx.metadata.run_id}[/green]")
 
+    # Show comparison table if we have multiple engine:format combinations
+    df = store.load_results(ctx.metadata.run_id)
+    if not df.empty:
+        try:
+            pivot = analyzer.compare_within_run(df)
+            table = pivot_comparison_table(pivot)
+            console.print()
+            console.print(table)
+        except ValueError:
+            # Not enough combinations to compare
+            pass
+
 
 @app.command()
 def compare(
-    base: Annotated[
+    runs: Annotated[
         str | None,
-        typer.Option("--base", "-b", help="Base reference (engine:format@run)"),
+        typer.Option("--runs", "-r", help="Runs to compare (comma-separated, 2 or more)"),
     ] = None,
-    target: Annotated[
+    run: Annotated[
         str | None,
-        typer.Option("--target", "-t", help="Target reference (engine:format@run)"),
+        typer.Option("--run", help="Single run for within-run comparison", autocompletion=run_ref_auto_complete),
     ] = None,
-    runs: Annotated[
+    baseline: Annotated[
         str | None,
-        typer.Option("--runs", "-r", help="Two runs to compare (comma-separated)"),
+        typer.Option("--baseline", help="Baseline engine:format for within-run comparison"),
     ] = None,
     threshold: Annotated[float, typer.Option("--threshold", help="Significance threshold (default 10%)")] = 0.10,
+    filter_engine: Annotated[
+        str | None, typer.Option("--engine", help="Filter only for results that use a specific engine")
+    ] = None,
+    filter_format: Annotated[
+        str | None, typer.Option("--format", help="Filter only for results that use a specific file format")
+    ] = None,
 ) -> None:
     """Compare benchmark results."""
     store = ResultStore()
 
-    if runs:
-        # Compare two full runs
-        run_refs = [r.strip() for r in runs.split(",")]
-        if len(run_refs) != 2:
-            console.print("[red]--runs requires exactly two run references[/red]")
+    if run:
+        # Within-run comparison
+        run_meta = store.get_run(run)
+        if not run_meta:
+            console.print(f"[red]Run not found: {run}[/red]")
             raise typer.Exit(1)
 
-        base_run = store.get_run(run_refs[0])
-        target_run = store.get_run(run_refs[1])
+        df = store.load_results(run_meta.run_id)
 
-        if not base_run:
-            console.print(f"[red]Run not found: {run_refs[0]}[/red]")
+        if df.empty:
+            console.print("[red]No results found[/red]")
             raise typer.Exit(1)
-        if not target_run:
-            console.print(f"[red]Run not found: {run_refs[1]}[/red]")
-            raise typer.Exit(1)
-
-        base_df = store.load_results(base_run.run_id)
-        target_df = store.load_results(target_run.run_id)
 
-        base_label = base_run.label or base_run.run_id[:20]
-        target_label = target_run.label or target_run.run_id[:20]
+        # Parse baseline if provided
+        baseline_engine = None
+        baseline_format = None
+        if baseline:
+            if ":" in baseline:
+                baseline_engine, baseline_format = baseline.split(":", 1)
+            else:
+                console.print("[red]--baseline must be engine:format[/red]")
+                raise typer.Exit(1)
 
-    elif base and target:
-        # Compare specific configurations
-        base_ref = TargetRef.parse(base)
-        target_ref = TargetRef.parse(target)
+        try:
+            pivot = analyzer.compare_within_run(df, baseline_engine, baseline_format, filter_engine, filter_format)
+        except ValueError as e:
+            console.print(f"[red]{e}[/red]")
+            raise typer.Exit(1)
 
-        base_run = store.get_run(base_ref.run)
-        target_run = store.get_run(target_ref.run)
+        table = pivot_comparison_table(pivot, threshold)
+        console.print(table)
+        return
 
-        if not base_run:
-            console.print(f"[red]Run not found: {base_ref.run}[/red]")
-            raise typer.Exit(1)
-        if not target_run:
-            console.print(f"[red]Run not found: {target_ref.run}[/red]")
+    elif runs:
+        # Compare multiple runs (2 or more)
+        run_refs = [r.strip() for r in runs.split(",")]
+        if len(run_refs) < 2:
+            console.print("[red]--runs requires at least two run references[/red]")
             raise typer.Exit(1)
 
-        base_df = store.load_results(base_run.run_id)
-        target_df = store.load_results(target_run.run_id)
-
-        # Apply filters
-        base_analyzer = BenchmarkAnalyzer(base_df)
-        target_analyzer = BenchmarkAnalyzer(target_df)
+        # Load all runs
+        run_data: list[tuple[str, pd.DataFrame]] = []
+        for ref in run_refs:
+            run_meta = store.get_run(ref)
+            if not run_meta:
+                console.print(f"[red]Run not found: {ref}[/red]")
+                raise typer.Exit(1)
+            label = run_meta.label or run_meta.run_id[:16]
+            df = store.load_results(run_meta.run_id)
+            if df.empty:
+                console.print(f"[red]No results for run: {ref}[/red]")
+                raise typer.Exit(1)
+            run_data.append((label, df))
+
+        # Use baseline option if provided, otherwise first run
+        baseline_label = None
+        if baseline:
+            # Find matching label
+            for label, _ in run_data:
+                if baseline in label:
+                    baseline_label = label
+                    break
+            if baseline_label is None:
+                console.print(f"[red]Baseline not found: {baseline}[/red]")
+                raise typer.Exit(1)
 
-        base_df = base_analyzer.filter_by_ref(base_ref)
-        target_df = target_analyzer.filter_by_ref(target_ref)
+        try:
+            pivot = analyzer.compare_runs(run_data, baseline_label, filter_engine, filter_format)
+        except ValueError as e:
+            console.print(f"[red]{e}[/red]")
+            raise typer.Exit(1)
 
-        base_label = base
-        target_label = target
+        table = pivot_comparison_table(pivot, threshold, row_keys=["query", "engine", "format"])
+        console.print(table)
+        return
 
     else:
-        console.print("[red]Must specify either --runs or --base/--target[/red]")
-        raise typer.Exit(1)
-
-    if base_df.empty:
-        console.print("[red]No results found for base[/red]")
-        raise typer.Exit(1)
-    if target_df.empty:
-        console.print("[red]No results found for target[/red]")
+        console.print("[red]Must specify either --runs or --run[/red]")
         raise typer.Exit(1)
 
-    # Perform comparison
-    analyzer = BenchmarkAnalyzer(base_df)
-    comparison = analyzer.compare(base_df, target_df)
-    stats = analyzer.summary_stats(comparison)
-
-    reporter = BenchmarkReporter(comparison, stats, threshold)
-
-    table = reporter.to_rich_table(
-        title="Benchmark Comparison",
-        base_label=base_label,
-        target_label=target_label,
-    )
-    console.print(table)
-    reporter.print_summary()
-
 
 @app.command("list")
 def list_runs(
@@ -305,7 +331,7 @@ def list_runs(
 
 @app.command()
 def show(
-    run_ref: Annotated[str, typer.Argument(help="Run ID, label, or 'latest'")],
+    run_ref: Annotated[str, typer.Argument(help="Run ID, label, or 'latest'", autocompletion=run_ref_auto_complete)],
 ) -> None:
     """Show details of a specific run."""
     store = ResultStore()