SpiceBench uses a two-stage data pipeline: spicebench generate produces versioned raw archives, and the ETL pipeline reads those archives, rehydrates records, and ingests them into the System Under Test.
The main spicebench binary benchmarks ingestion and querying against a pre-generated archive. Archive download and extraction happen before the timed benchmark starts.
The data-generation crate (accessed via spicebench generate) produces versioned datasets locally and then packages them into a .tar.zst archive. The archive is either written to a local path or uploaded to S3.
Only the finalized archive is uploaded to S3. Individual Parquet files are never uploaded directly.
s3://{bucket}/{prefix}/{scenario}/{version}/
└── data.tar.zst # Archive uploaded by the generator
The archive contains the generated data in the following layout:
version.json
tables/{table_name}/batch-000000.parquet
tables/{table_name}/batch-000001.parquet
tables/{table_name}/batch-000002-part-000.parquet
tables/{table_name}/batch-000002-part-001.parquet
...
Abbreviated example:
{
"version": "1.0",
"scenario": "tpch",
"scale_factor": 1.0,
"num_steps": 10,
"dataset_type": "tpch",
"mutations": {
"update_ratio": 0.0,
"delete_ratio": 0.0
}
}The full file also includes per-table metadata used by ETL.
Per-table metadata is embedded in version.json and includes:
- Schema
- Primary key columns
- Time column
- Batch IDs
- Batch part counts
| Dataset | Type | Description |
|---|---|---|
| TPC-H | tpch |
8 standard TPC-H benchmark tables |
| Simple Sequence | simple_sequence |
Simple integer sequence tables for testing |
| Table | Primary Key |
|---|---|
customer |
c_custkey |
lineitem |
l_orderkey, l_linenumber |
nation |
n_nationkey |
orders |
o_orderkey |
part |
p_partkey |
partsupp |
ps_partkey, ps_suppkey |
region |
r_regionkey |
supplier |
s_suppkey |
The ETL pipeline understands three raw operation codes:
| Operation | Internal Column | Description |
|---|---|---|
| Create | __op = "c" |
New row insertion |
| Update | __op = "u" |
Modify existing row (tracked by primary key) |
| Delete | __op = "d" |
Remove existing row (tracked by primary key) |
The current spicebench generate CLI emits create-only batches and records zero mutation ratios in version.json by default.
spicebench generate \
--scale-factor 1 \
--bucket my-benchmark-data \
--region us-west-2 \
--prefix raw \
--num-steps 10To write a local archive instead of uploading to S3, use --output-archive ./tpch-sf1.tar.zst.
Pre-generated TPC-H datasets are maintained on MinIO (spicebench bucket, us-east-1) for benchmark runs. Two dataset families are available, both generated via the data_generation_run.yml GitHub Actions workflow.
Standard TPC-H data with inserts only — no updates or deletes. Currently used as the default for benchmark runs.
| S3 Path | Scale Factor | Steps | Checkpoint Interval |
|---|---|---|---|
data-gen/tpch/0.01 |
0.01 | 20 | 10 |
data-gen/tpch/0.1 |
0.1 | 20 | 10 |
data-gen/tpch/1.0 |
1.0 | 20 | 10 |
data-gen/tpch/10.0 |
10.0 | 20 | 10 |
TPC-H data with insert, update, and delete operations mixed into each step. This dataset will become the default once mutation support is fully validated, replacing the insert-only dataset above.
| S3 Path | Scale Factor | Steps | Checkpoint Interval | Update Ratio | Delete Ratio |
|---|---|---|---|---|---|
data-gen-mut-v6/tpch/1.0 |
1.0 | 20 | 5 | 0.1 | 0.05 |
The ETL pipeline reads a generated archive, processes raw batches, and writes to a configurable sink.
- Read raw Parquet batches from the extracted archive
- Rehydrate records and append the time column
- Split rows by operation type (
__op) - Append
__created_atfor freshness tracking - Strip internal columns (
__op,__key_*) - Write the resulting batches to the configured sink
Instead of downloading from S3, standalone ETL can read a local archive directly:
spicebench etl \
--scenario tpch \
--scale-factor 1 \
--archive-file ./tpch-sf1.tar.zst \
--sink nullWrites directly to the SUT via ADBC bulk ingest.
spicebench etl \
--scenario tpch \
--scale-factor 1 \
--bucket my-data \
--prefix raw \
--sink adbc \
--adbc-driver flightsql \
--adbc-uri "grpcs://my-platform.example.com:443" \
--adbc-option username="" \
--adbc-option password="$API_KEY" \
--adbc-create-tablesWhen using FlightSQL, ETL automatically sets adbc.flight.sql.client_option.with_max_msg_size to 78643200 (75 MiB) unless you explicitly override that option with --adbc-option.
Databricks example:
spicebench etl \
--scenario tpch \
--scale-factor 1 \
--bucket my-data \
--prefix raw \
--sink adbc \
--adbc-driver databricks \
--adbc-uri "databricks://token:${DATABRICKS_TOKEN}@${DATABRICKS_ENDPOINT}:443/${DATABRICKS_HTTP_PATH}" \
--adbc-catalog main \
--adbc-schema tpch \
--adbc-create-tablesDiscards all writes. Useful for measuring source and ETL throughput without sink overhead.
spicebench etl \
--scenario tpch \
--scale-factor 1 \
--bucket my-data \
--prefix raw \
--sink nullThe ETL pipeline transitions through these states:
NotStarted -> Initialized -> Running -> Paused -> Running -> ... -> Stopped
| State | Description |
|---|---|
NotStarted |
Pipeline created but not initialized |
Initialized |
Storage connected and metadata loaded |
Running |
Actively processing batches |
Paused |
Temporarily paused for checkpoint validation |
Stopped(Completed) |
All batches processed successfully |
Stopped(Cancelled) |
Pipeline cancelled by user or system |
Stopped(Error) |
Pipeline stopped due to an error |
In the current main benchmark path:
- SpiceBench downloads and extracts the data archive before timed execution
- SpiceBench calls adapter
setupand prepares the ADBC query path - The timed benchmark starts, then ETL runs concurrently with query execution
- At checkpoint boundaries, ETL can pause for result validation when
--validate-resultsis enabled and checkpoints are available - After ETL completes, SpiceBench stops the benchmark and then calls adapter
teardown
The ETL sink type is selected with --etl-sink:
adbc: direct ADBC ingest. The adapter'ssetupresponse provides write-side ADBC config
The spicebench checkpoint subcommand captures expected query results at specific ETL steps so benchmark runs can validate correctness while ingestion is active.
- Generate checkpoints by replaying ETL into DuckDB
- Execute the scenario's query workload at configured checkpoint intervals
- Write each checkpoint result set as Parquet
- Upload checkpoint files and a manifest to S3
- During benchmark runs, pause ETL at checkpoint boundaries and compare live results against the stored checkpoint data
s3://{bucket}/{prefix}/
├── checkpoints.json
└── checkpoints/
└── {scenario}/
└── {checkpoint_idx}/
├── {query_idx_0}.parquet
├── {query_idx_1}.parquet
└── ...
{
"scenarios": {
"tpch": {
"checkpoint_indexes": [5, 10],
"query_indexes": [0, 1, 2, 3, 4],
"checkpoint_interval_steps": 5
}
}
}Enable checkpoint validation with --validate-results:
spicebench run \
--scenario tpch \
--system-adapter-name myplatform \
--system-adapter-http-url http://127.0.0.1:8080/jsonrpc \
--validate-resultsDuring the benchmark, when ETL reaches a checkpoint step:
- ETL pauses
- SpiceBench runs the scenario workload against the current system state
- Results are compared against stored checkpoint output
- ETL resumes if validation succeeds