A Python CLI tool for orchestrating Vortex benchmark runs, storing results, and comparing performance across different engines and formats.
The best way to install the orchestrator seems to be:
uv tool install "bench_orchestrator @ ./bench-orchestrator/"This installs the vx-bench command.
# Run TPC-H benchmarks with DataFusion and DuckDB
# A comparison table is automatically displayed after the run
vx-bench run tpch --engine datafusion,duckdb --format parquet,vortex
# List recent benchmark runs
vx-bench list
# Compare engine:format combinations within a single run
vx-bench compare --run latest
# Compare multiple runs (2 or more)
vx-bench compare --runs run1,run2,run3Run benchmark suites across multiple engines and formats. After completion, a comparison table is automatically displayed if there are multiple engine:format combinations.
vx-bench run <benchmark> [options]Arguments:
benchmark: Benchmark suite to run (tpch,tpcds,clickbench,fineweb,gh-archive,public-bi,statpopgen)
Options:
--engine, -e: Engines to benchmark, comma-separated (default:datafusion,duckdb)--format, -f: Formats to benchmark, comma-separated (default:parquet,vortex)--queries, -q: Specific queries to run (e.g.,1,2,5)--exclude-queries: Queries to skip--iterations, -i: Iterations per query (default: 5)--label, -l: Label for this run (useful for later reference)--track-memory: Enable memory usage tracking--build/--no-build: Build binaries before running (default: build)
Compare benchmark results within a run or across multiple runs. Results are displayed in a pivot table format.
vx-bench compare [options]Options:
--run: Single run for within-run comparison (compares different engine:format combinations)--runs, -r: Multiple runs to compare, comma-separated (2 or more)--baseline: Baseline for comparison (engine:format for within-run, or run label for multi-run)--engine: Filter results to a specific engine--format: Filter results to a specific format--threshold: Significance threshold (default: 0.10 = 10%)
Within-run comparison (--run): Compares different engine:format combinations within a single run. Output shows one row per query, with columns for each engine:format combo.
Multi-run comparison (--runs): Compares the same benchmarks across multiple runs. Output shows one row per (query, engine, format) combination, with columns for each run.
vx-bench list [options]Options:
--benchmark, -b: Filter by benchmark suite--since: Time filter (e.g.,7 days,2 weeks)--limit, -n: Maximum runs to show (default: 20)
vx-bench show <run-ref>Arguments:
run-ref: Run ID, label, orlatest
Build benchmark binaries without running benchmarks.
vx-bench build [options]Options:
--engine, -e: Engines to build (default: all)
vx-bench clean --older-than "30 days" [options]Options:
--older-than: Delete runs older than (required)--keep-labeled: Don't delete labeled runs (default: true)--dry-run, -n: Show what would be deleted
Run benchmarks on your current branch and compare against a baseline:
# First, run benchmarks on your baseline (e.g., main branch)
git checkout main
vx-bench run tpch -e datafusion -f parquet,vortex -l baseline
# Switch to your feature branch and run again
git checkout feature/my-optimization
vx-bench run tpch -e datafusion -f parquet,vortex -l feature
# Compare the runs
vx-bench compare --runs baseline,featureRun a subset of queries to quickly check for regressions:
# Run only queries 1, 6, and 12 (fast queries)
vx-bench run tpch -q 1,6,12 -i 3 -l quick-check
# Compare against previous run
vx-bench compare --runs latest,<previous-run-id>Compare performance across different query engines:
# Run all engines on the same data
# Comparison table is displayed automatically after the run
vx-bench run tpch -e datafusion,duckdb -f parquet -l engine-comparison
# Or compare within the run later
vx-bench compare --run engine-comparisonAnalyze how different storage formats perform:
# Run comprehensive format comparison
vx-bench run tpch \
-e datafusion \
-f parquet,vortex,vortex-compact \
-i 10 \
-l format-analysis
# Compare within the run (table shown automatically after run too)
vx-bench compare --run format-analysis
# Use a specific baseline
vx-bench compare --run format-analysis --baseline datafusion:parquetTrack memory usage alongside performance:
vx-bench run tpch \
-e datafusion \
-f vortex \
--track-memory \
-l memory-profiling
vx-bench show memory-profilingTest performance at different data scales:
# Run at SF1
vx-bench run tpch -s 1 -l sf1
# Run at SF10
vx-bench run tpch -s 10 -l sf10
# Compare scaling behavior
vx-bench compare --runs sf1,sf10Skip queries that are known to fail or take too long:
# Exclude queries 15 and 21 (complex queries)
vx-bench run tpch --exclude-queries 15,21 -l partial-runFind runs from the past week and compare trends:
# List recent runs
vx-bench list --since "7 days" --benchmark tpch
# Compare two specific historical runs
vx-bench compare --runs <run-id-1>,<run-id-2>Keep your results directory manageable:
# Preview what would be deleted
vx-bench clean --older-than "30 days" --dry-run
# Delete old runs but keep labeled ones
vx-bench clean --older-than "30 days" --keep-labeled
# Delete all old runs including labeled
vx-bench clean --older-than "30 days" --no-keep-labeled| Engine | Supported Formats |
|---|---|
| datafusion | parquet, vortex, vortex-compact, lance |
| duckdb | parquet, vortex, vortex-compact, duckdb |
| lance | lance |
Comparison results are displayed in a pivot table format:
Within-run comparison (--run):
┌───────┬──────────────────────┬────────────────────────┐
│ Query │ duckdb:parquet (base)│ duckdb:vortex │
├───────┼──────────────────────┼────────────────────────┤
│ 1 │ 100.5ms │ 80.2ms (0.80x) │
│ 2 │ 200.1ms │ 150.0ms (0.75x) │
└───────┴──────────────────────┴────────────────────────┘
Multi-run comparison (--runs):
┌───────┬────────┬─────────┬──────────────┬──────────────────┐
│ Query │ Engine │ Format │ run1 (base) │ run2 │
├───────┼────────┼─────────┼──────────────┼──────────────────┤
│ 1 │ duckdb │ parquet │ 100ms │ 95ms (0.95x) │
│ 1 │ duckdb │ vortex │ 80ms │ 75ms (0.94x) │
└───────┴────────┴─────────┴──────────────┴──────────────────┘
Ratios are color-coded:
- Green: Improvement (>10% faster, ratio < 0.9)
- Red: Regression (>10% slower, ratio > 1.1)
- Yellow: Neutral (within 10%)
Results are stored in <workspace>/target/vortex-bench/runs/. Each run creates a directory containing:
metadata.json: Run configuration and environment inforesults.jsonl: Raw benchmark results (JSON lines format)
Benchmarks are built with:
- Profile:
release_debug - RUSTFLAGS:
-C target-cpu=native -C force-frame-pointers=yes
This enables native CPU optimizations while preserving debug symbols for profiling.