Architecture

Introduction

SpiceBench is an open-source benchmark for data and AI platforms. It measures the full operational data lifecycle - ingestion, acceleration, and query serving - under the conditions AI applications and agents actually face. Unlike static benchmarks (ClickBench, TPC-H) that run queries on pre-created datasets, SpiceBench benchmarks concurrent ingestion, acceleration/materialization, and query execution against a pre-generated data archive, capturing the real tension between ingestion throughput, materialization freshness, and query latency.

System Overview

┌──────────────────────────────────────────────────────────────┐
│                    GitHub Actions / CI                       │
│  (schedule, manual dispatch, or PR trigger)                  │
└──────────────────┬───────────────────────────────────────────┘
                   │ starts
                   ▼
┌──────────────────────────────────────────────────────────────┐
│                      SpiceBench Run                          │
│                                                              │
│  ┌──────────┐   ┌────────────────┐   ┌──────────────────┐    │
│  │ 1. Setup │──▶│ 2. Benchmark   │──▶│  3. Teardown     │    │
│  │ (JSON-RPC│   │    (timed)     │   │  (JSON-RPC)      │    │
│  │  adapter)│   │                │   │                  │    │
│  └──────────┘   │ concurrent ETL │   └──────────────────┘    │
│                 │ + query load   │                           │
│                 └────────────────┘                           │
└──────────────────────────────────────────────────────────────┘
        │                    │                    │
        ▼                    ▼                    ▼
┌──────────────┐  ┌─────────────────┐  ┌──────────────────┐
│ System Under │  │  ETL Pipeline   │  │ Metrics (OTel)   │
│    Test      │  │  (S3 → SUT)     │  │ → telemetry      │
│  (via ADBC)  │  │                 │  │   .spiceai.io    │
└──────────────┘  └─────────────────┘  └──────────────────┘

Run Lifecycle

A Run is a single end-to-end execution of the benchmark targeting one system. Every Run proceeds through three sequential phases:

Phase 1: Setup (not timed)

SpiceBench connects to a system adapter via JSON-RPC 2.0 (over stdio or HTTP) and calls:

setup(run_id, metadata, datasets, etl_sink_type) - Returns ADBC driver configuration (driver name + connection kwargs) for query execution and can optionally provision the System Under Test (SUT) or create/register benchmark tables.

The adapter response from setup tells SpiceBench which ADBC driver to use and how to connect. For manually prepared systems, this phase can be limited to returning connection details.

Phase 2: Benchmark (timed)

The current main benchmark path runs a single timed stage. SpiceBench starts the query workload, starts ETL, and keeps both running until the pipeline completes, fails, or is cancelled.

Included in the timer:

concurrent query execution
ETL processing and sink writes
checkpoint pause and validation windows when --validate-results is enabled
final shutdown after ETL completion

Excluded from the timer:

archive download and extraction
adapter setup and teardown

The current main binary exports p99 latency as telemetry for comparison across runs, but it does not run a separate baseline stage or a baseline-regression fail gate.

Phase 3: Teardown (not timed)

SpiceBench calls teardown(run_id) on the adapter so it can optionally deprovision resources, drop tables, or perform any final cleanup.

Teardown always runs, even if the benchmark phase encounters errors, and adapters may implement it as a no-op when cleanup is handled externally or artifacts should be retained.

Data Flow

┌─────────────────┐     ┌────────────────┐     ┌──────────────────┐
│ spicebench    │────▶│  S3 (archive)  │────▶│  ETL Pipeline    │
│ generate      │     │  .tar.zst      │     │  download +      │
│                 │     │                │     │  extract +       │
└─────────────────┘     └────────────────┘     │  rehydrate       │
                                               └────────┬─────────┘
                                                        │
                                          ┌──────────┬──────────┐
                                          ▼          ▼          
                                    ┌──────────┐ ┌──────────┐
                                    │ ADBC     │ │ Null     │
                                    │ Bulk     │ │ Sink     │
                                    │ Ingest   │ │ (/dev/   │
                                    │          │ │  null)   │
                                    └──────────┘ └──────────┘

Data Generation

The spicebench generate subcommand produces versioned datasets and writes the resulting archive to S3 or a local file. It supports:

Configurable scale factors (SF1, SF10, SF100, etc.)
Multi-step generation for simulating streaming data arrival
Version metadata (version.json) for downstream ETL

The shipped spicebench generate subcommand currently emits create-only batches and records zero mutation ratios in version.json.

ETL Pipeline

The ETL pipeline downloads the archive from S3, extracts it locally, and processes the raw batches:

Download the .tar.zst archive from S3 and extract it locally
Read raw Parquet batches from the extracted archive
Rehydrate records (restore from columnar + apply mutations)
Split by operation type (create/update/delete)
Append __created_at timestamps for freshness tracking
Strip internal columns (__op, __key_*)
Write to the configured sink

See Data Generation & ETL for details.

Query Execution

The current benchmark path uses the ADBC driver returned by the adapter's setup() response to execute queries directly against the SUT. Adapter transport is still JSON-RPC over stdio or HTTP, but the benchmark data plane is ADBC-based.

Future AI-Native Extension

SpiceBench currently measures ingestion-to-query behavior. A planned extension is an ingestion-to-prompt/RAG benchmark pipeline that adds evaluation stages above SQL execution:

Text-to-SQL stage
- natural language request → SQL generation
- evaluate generation validity, execution success, and result correctness
Search & retrieval stage
- keyword/vector/hybrid retrieval over continuously ingested data
- evaluate retrieval quality (recall@k, nDCG) and retrieval latency
Context engineering stage
- chunk/rank/assemble context for model prompts
- evaluate context quality, citation grounding, and token efficiency
End-to-end AI freshness stage
- measure time from source event creation to retrievable context and answer inclusion

This extends SpiceBench from an operational SQL benchmark into an AI-native data benchmark for application and agent workloads.

Metrics Pipeline

┌────────────────────────────────────────────────────────────┐
│                    SpiceBench Process                       │
│                                                            │
│  ┌──────────────┐  ┌───────────────┐  ┌────────────────┐  │
│  │ Query Driver │  │ SUT Metrics   │  │ Health Monitor │  │
│  │ (per-query   │  │ Scraper       │  │ (/health,      │  │
│  │  stats)      │  │ (every 5s)    │  │  /v1/ready)    │  │
│  └──────┬───────┘  └───────┬───────┘  └───────┬────────┘  │
│         │                  │                   │           │
│         ▼                  ▼                   ▼           │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              OpenTelemetry SDK                       │   │
│  │  (17+ instruments: gauges, counters, histograms)    │   │
│  └──────────┬──────────────────────────┬───────────────┘   │
│             │                          │                   │
│             ▼                          ▼                   │
│  ┌──────────────────┐      ┌───────────────────────┐      │
│  │ OtelArrowExporter│      │ StreamingOtlpExporter │      │
│  │ (Arrow Flight)   │      │ (OTLP, every 5s)      │      │
│  └────────┬─────────┘      └──────────┬────────────┘      │
└───────────┼────────────────────────────┼──────────────────┘
            │                            │
            ▼                            ▼
  telemetry.spiceai.io          --otlp-endpoint
  (Arrow Flight ingest)        (custom OTLP collector)
            │
            ▼
      SpiceBench.com
   (leaderboard + details)

Metrics are collected from three sources:

Query driver - Per-query latency statistics (median, min, max, p99), iteration counts, pass/fail status
SUT metrics scraper - Resource usage (CPU, memory, disk I/O, IOPS) and ingestion progress (rows, bytes, throughput) obtained by periodically calling the adapter's metrics() JSON-RPC method
Health monitor - Endpoint latency for /health and /v1/ready probes

See Metrics & Telemetry for the full instrument list.

System Adapter Protocol

The system adapter protocol is a JSON-RPC 2.0 interface that decouples SpiceBench from any specific data platform. The core runtime methods are:

Method	Purpose
`setup(run_id, metadata, datasets, etl_sink_type)`	Provision the SUT, create tables, return ADBC driver config
`teardown(run_id)`	Deprovision resources
`metrics(run_id)`	Return resource usage and ingestion stats (optional)
`rpc.methods`	Report supported JSON-RPC methods

Adapters communicate over stdio (SpiceBench spawns the adapter as a child process) or HTTP (SpiceBench connects to a running adapter server).

See System Adapters for the full protocol specification.

Execution Mode Flag

The CLI still accepts --system-adapter-execution-mode, but the current main spicebench binary does not branch on it yet.

In the shipped benchmark path, SpiceBench:

calls adapter setup
builds the ADBC query path from the returned driver configuration
runs ETL and query execution itself
optionally scrapes adapter metrics
calls adapter teardown

Custom adapters can still expose additional RPC methods for their own workflows, but those are outside the core benchmark path currently driven by spicebench.

Checkpoint Validation

SpiceBench supports checkpoint-based result validation to verify query correctness during active data ingestion:

The spicebench checkpoint subcommand pre-computes expected query results at specific ETL steps and stores them as Parquet files in S3
During a benchmark run, when the ETL pipeline reaches a checkpoint step, it pauses ingestion
SpiceBench runs the scenario workload and compares results against the stored expected results
After validation, ETL resumes

This ensures the SUT returns correct results under concurrent read/write load.

Crate Architecture

spicebench (binary)
├── test-framework          Core benchmark engine
├── system-adapter-protocol JSON-RPC client/server
├── adbc_client             ADBC connection pooling
├── flight_client           Arrow Flight client
├── telemetry               OTel metrics + export
│   └── otel-arrow          OTel → Arrow conversion
├── etl                     ETL pipeline + sinks
│   └── data-generation     Dataset generation
├── checkpointer            Checkpoint capture
└── util                    Shared utilities

See Crate Reference for per-crate API details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Introduction

System Overview

Run Lifecycle

Phase 1: Setup (not timed)

Phase 2: Benchmark (timed)

Phase 3: Teardown (not timed)

Data Flow

Data Generation

ETL Pipeline

Query Execution

Future AI-Native Extension

Metrics Pipeline

System Adapter Protocol

Execution Mode Flag

Checkpoint Validation

Crate Architecture

FilesExpand file tree

architecture.md

Latest commit

History

architecture.md

File metadata and controls

Architecture

Introduction

System Overview

Run Lifecycle

Phase 1: Setup (not timed)

Phase 2: Benchmark (timed)

Phase 3: Teardown (not timed)

Data Flow

Data Generation

ETL Pipeline

Query Execution

Future AI-Native Extension

Metrics Pipeline

System Adapter Protocol

Execution Mode Flag

Checkpoint Validation

Crate Architecture