Skip to content

Latest commit

 

History

History
303 lines (207 loc) · 8.73 KB

File metadata and controls

303 lines (207 loc) · 8.73 KB

bench-orchestrator

A Python CLI tool for orchestrating Vortex benchmark runs, storing results, and comparing performance across different engines and formats.

Installation

The best way to install the orchestrator seems to be:

uv tool install "bench_orchestrator @ ./bench-orchestrator/"

This installs the vx-bench command.

Quick Start

# Run TPC-H benchmarks with DataFusion and DuckDB
# A comparison table is automatically displayed after the run
vx-bench run tpch --engine datafusion,duckdb --format parquet,vortex

# List recent benchmark runs
vx-bench list

# Compare engine:format combinations within a single run
vx-bench compare --run latest

# Compare multiple runs (2 or more)
vx-bench compare --runs run1,run2,run3

Commands

run - Execute Benchmarks

Run benchmark suites across multiple engines and formats. After completion, a comparison table is automatically displayed if there are multiple engine:format combinations.

vx-bench run <benchmark> [options]

Arguments:

  • benchmark: Benchmark suite to run (tpch, tpcds, clickbench, fineweb, gh-archive, public-bi, statpopgen)

Options:

  • --engine, -e: Engines to benchmark, comma-separated (default: datafusion,duckdb)
  • --format, -f: Formats to benchmark, comma-separated (default: parquet,vortex)
  • --queries, -q: Specific queries to run (e.g., 1,2,5)
  • --exclude-queries: Queries to skip
  • --iterations, -i: Iterations per query (default: 5)
  • --label, -l: Label for this run (useful for later reference)
  • --track-memory: Enable memory usage tracking
  • --build/--no-build: Build binaries before running (default: build)

compare - Compare Results

Compare benchmark results within a run or across multiple runs. Results are displayed in a pivot table format.

vx-bench compare [options]

Options:

  • --run: Single run for within-run comparison (compares different engine:format combinations)
  • --runs, -r: Multiple runs to compare, comma-separated (2 or more)
  • --baseline: Baseline for comparison (engine:format for within-run, or run label for multi-run)
  • --engine: Filter results to a specific engine
  • --format: Filter results to a specific format
  • --threshold: Significance threshold (default: 0.10 = 10%)

Within-run comparison (--run): Compares different engine:format combinations within a single run. Output shows one row per query, with columns for each engine:format combo.

Multi-run comparison (--runs): Compares the same benchmarks across multiple runs. Output shows one row per (query, engine, format) combination, with columns for each run.

list - List Benchmark Runs

vx-bench list [options]

Options:

  • --benchmark, -b: Filter by benchmark suite
  • --since: Time filter (e.g., 7 days, 2 weeks)
  • --limit, -n: Maximum runs to show (default: 20)

show - Show Run Details

vx-bench show <run-ref>

Arguments:

  • run-ref: Run ID, label, or latest

build - Build Binaries

Build benchmark binaries without running benchmarks.

vx-bench build [options]

Options:

  • --engine, -e: Engines to build (default: all)

clean - Clean Old Results

vx-bench clean --older-than "30 days" [options]

Options:

  • --older-than: Delete runs older than (required)
  • --keep-labeled: Don't delete labeled runs (default: true)
  • --dry-run, -n: Show what would be deleted

Example Workflows

1. Basic Performance Comparison

Run benchmarks on your current branch and compare against a baseline:

# First, run benchmarks on your baseline (e.g., main branch)
git checkout main
vx-bench run tpch -e datafusion -f parquet,vortex -l baseline

# Switch to your feature branch and run again
git checkout feature/my-optimization
vx-bench run tpch -e datafusion -f parquet,vortex -l feature

# Compare the runs
vx-bench compare --runs baseline,feature

2. Quick Regression Check

Run a subset of queries to quickly check for regressions:

# Run only queries 1, 6, and 12 (fast queries)
vx-bench run tpch -q 1,6,12 -i 3 -l quick-check

# Compare against previous run
vx-bench compare --runs latest,<previous-run-id>

3. Cross-Engine Comparison

Compare performance across different query engines:

# Run all engines on the same data
# Comparison table is displayed automatically after the run
vx-bench run tpch -e datafusion,duckdb -f parquet -l engine-comparison

# Or compare within the run later
vx-bench compare --run engine-comparison

4. Format Performance Analysis

Analyze how different storage formats perform:

# Run comprehensive format comparison
vx-bench run tpch \
  -e datafusion \
  -f parquet,vortex,vortex-compact \
  -i 10 \
  -l format-analysis

# Compare within the run (table shown automatically after run too)
vx-bench compare --run format-analysis

# Use a specific baseline
vx-bench compare --run format-analysis --baseline datafusion:parquet

5. Memory Usage Analysis

Track memory usage alongside performance:

vx-bench run tpch \
  -e datafusion \
  -f vortex \
  --track-memory \
  -l memory-profiling

vx-bench show memory-profiling

6. Scale Factor Testing

Test performance at different data scales:

# Run at SF1
vx-bench run tpch -s 1 -l sf1

# Run at SF10
vx-bench run tpch -s 10 -l sf10

# Compare scaling behavior
vx-bench compare --runs sf1,sf10

7. Excluding Problematic Queries

Skip queries that are known to fail or take too long:

# Exclude queries 15 and 21 (complex queries)
vx-bench run tpch --exclude-queries 15,21 -l partial-run

8. Historical Analysis

Find runs from the past week and compare trends:

# List recent runs
vx-bench list --since "7 days" --benchmark tpch

# Compare two specific historical runs
vx-bench compare --runs <run-id-1>,<run-id-2>

9. Cleanup Old Results

Keep your results directory manageable:

# Preview what would be deleted
vx-bench clean --older-than "30 days" --dry-run

# Delete old runs but keep labeled ones
vx-bench clean --older-than "30 days" --keep-labeled

# Delete all old runs including labeled
vx-bench clean --older-than "30 days" --no-keep-labeled

Supported Engines and Formats

Engine Supported Formats
datafusion parquet, vortex, vortex-compact, lance
duckdb parquet, vortex, vortex-compact, duckdb
lance lance

Output Format

Comparison results are displayed in a pivot table format:

Within-run comparison (--run):

┌───────┬──────────────────────┬────────────────────────┐
│ Query │ duckdb:parquet (base)│ duckdb:vortex          │
├───────┼──────────────────────┼────────────────────────┤
│     1 │ 100.5ms              │ 80.2ms (0.80x)         │
│     2 │ 200.1ms              │ 150.0ms (0.75x)        │
└───────┴──────────────────────┴────────────────────────┘

Multi-run comparison (--runs):

┌───────┬────────┬─────────┬──────────────┬──────────────────┐
│ Query │ Engine │ Format  │ run1 (base)  │ run2             │
├───────┼────────┼─────────┼──────────────┼──────────────────┤
│     1 │ duckdb │ parquet │ 100ms        │ 95ms (0.95x)     │
│     1 │ duckdb │ vortex  │ 80ms         │ 75ms (0.94x)     │
└───────┴────────┴─────────┴──────────────┴──────────────────┘

Ratios are color-coded:

  • Green: Improvement (>10% faster, ratio < 0.9)
  • Red: Regression (>10% slower, ratio > 1.1)
  • Yellow: Neutral (within 10%)

Data Storage

Results are stored in <workspace>/target/vortex-bench/runs/. Each run creates a directory containing:

  • metadata.json: Run configuration and environment info
  • results.jsonl: Raw benchmark results (JSON lines format)

Build Configuration

Benchmarks are built with:

  • Profile: release_debug
  • RUSTFLAGS: -C target-cpu=native -C force-frame-pointers=yes

This enables native CPU optimizations while preserving debug symbols for profiling.