Data Generation & ETL

SpiceBench uses a two-stage data pipeline: spicebench generate produces versioned raw archives, and the ETL pipeline reads those archives, rehydrates records, and ingests them into the System Under Test.

The main spicebench binary benchmarks ingestion and querying against a pre-generated archive. Archive download and extraction happen before the timed benchmark starts.

Data Generation

The data-generation crate (accessed via spicebench generate) produces versioned datasets locally and then packages them into a .tar.zst archive. The archive is either written to a local path or uploaded to S3.

S3 Layout

Only the finalized archive is uploaded to S3. Individual Parquet files are never uploaded directly.

s3://{bucket}/{prefix}/{scenario}/{version}/
└── data.tar.zst                    # Archive uploaded by the generator

Archive Contents

The archive contains the generated data in the following layout:

version.json
tables/{table_name}/batch-000000.parquet
tables/{table_name}/batch-000001.parquet
tables/{table_name}/batch-000002-part-000.parquet
tables/{table_name}/batch-000002-part-001.parquet
...

Version Metadata (`version.json`)

Abbreviated example:

{
    "version": "1.0",
    "scenario": "tpch",
    "scale_factor": 1.0,
    "num_steps": 10,
    "dataset_type": "tpch",
    "mutations": {
        "update_ratio": 0.0,
        "delete_ratio": 0.0
    }
}

The full file also includes per-table metadata used by ETL.

Table Metadata

Per-table metadata is embedded in version.json and includes:

Schema
Primary key columns
Time column
Batch IDs
Batch part counts

Supported Datasets

Dataset	Type	Description
TPC-H	`tpch`	8 standard TPC-H benchmark tables
Simple Sequence	`simple_sequence`	Simple integer sequence tables for testing

TPC-H Tables

Table	Primary Key
`customer`	`c_custkey`
`lineitem`	`l_orderkey`, `l_linenumber`
`nation`	`n_nationkey`
`orders`	`o_orderkey`
`part`	`p_partkey`
`partsupp`	`ps_partkey`, `ps_suppkey`
`region`	`r_regionkey`
`supplier`	`s_suppkey`

Operation Codes

The ETL pipeline understands three raw operation codes:

Operation	Internal Column	Description
Create	`__op = "c"`	New row insertion
Update	`__op = "u"`	Modify existing row (tracked by primary key)
Delete	`__op = "d"`	Remove existing row (tracked by primary key)

The current spicebench generate CLI emits create-only batches and records zero mutation ratios in version.json by default.

Running Data Generation

spicebench generate \
    --scale-factor 1 \
    --bucket my-benchmark-data \
    --region us-west-2 \
    --prefix raw \
    --num-steps 10

To write a local archive instead of uploading to S3, use --output-archive ./tpch-sf1.tar.zst.

Production Datasets

Pre-generated TPC-H datasets are maintained on MinIO (spicebench bucket, us-east-1) for benchmark runs. Two dataset families are available, both generated via the data_generation_run.yml GitHub Actions workflow.

Insert-Only (`data-gen` prefix)

Standard TPC-H data with inserts only — no updates or deletes. Currently used as the default for benchmark runs.

S3 Path	Scale Factor	Steps	Checkpoint Interval
`data-gen/tpch/0.01`	0.01	20	10
`data-gen/tpch/0.1`	0.1	20	10
`data-gen/tpch/1.0`	1.0	20	10
`data-gen/tpch/10.0`	10.0	20	10

With Mutations (`data-gen-mut-v6` prefix)

TPC-H data with insert, update, and delete operations mixed into each step. This dataset will become the default once mutation support is fully validated, replacing the insert-only dataset above.

S3 Path	Scale Factor	Steps	Checkpoint Interval	Update Ratio	Delete Ratio
`data-gen-mut-v6/tpch/1.0`	1.0	20	5	0.1	0.05

ETL Pipeline

The ETL pipeline reads a generated archive, processes raw batches, and writes to a configurable sink.

Processing Steps

Read raw Parquet batches from the extracted archive
Rehydrate records and append the time column
Split rows by operation type (__op)
Append __created_at for freshness tracking
Strip internal columns (__op, __key_*)
Write the resulting batches to the configured sink

Local Archive Mode

Instead of downloading from S3, standalone ETL can read a local archive directly:

spicebench etl \
    --scenario tpch \
    --scale-factor 1 \
    --archive-file ./tpch-sf1.tar.zst \
    --sink null

Sinks

ADBC Sink

Writes directly to the SUT via ADBC bulk ingest.

spicebench etl \
    --scenario tpch \
    --scale-factor 1 \
    --bucket my-data \
    --prefix raw \
    --sink adbc \
    --adbc-driver flightsql \
    --adbc-uri "grpcs://my-platform.example.com:443" \
    --adbc-option username="" \
    --adbc-option password="$API_KEY" \
    --adbc-create-tables

When using FlightSQL, ETL automatically sets adbc.flight.sql.client_option.with_max_msg_size to 78643200 (75 MiB) unless you explicitly override that option with --adbc-option.

Databricks example:

spicebench etl \
    --scenario tpch \
    --scale-factor 1 \
    --bucket my-data \
    --prefix raw \
    --sink adbc \
    --adbc-driver databricks \
    --adbc-uri "databricks://token:${DATABRICKS_TOKEN}@${DATABRICKS_ENDPOINT}:443/${DATABRICKS_HTTP_PATH}" \
    --adbc-catalog main \
    --adbc-schema tpch \
    --adbc-create-tables

Null Sink

Discards all writes. Useful for measuring source and ETL throughput without sink overhead.

spicebench etl \
    --scenario tpch \
    --scale-factor 1 \
    --bucket my-data \
    --prefix raw \
    --sink null

Pipeline States

The ETL pipeline transitions through these states:

NotStarted -> Initialized -> Running -> Paused -> Running -> ... -> Stopped

State	Description
`NotStarted`	Pipeline created but not initialized
`Initialized`	Storage connected and metadata loaded
`Running`	Actively processing batches
`Paused`	Temporarily paused for checkpoint validation
`Stopped(Completed)`	All batches processed successfully
`Stopped(Cancelled)`	Pipeline cancelled by user or system
`Stopped(Error)`	Pipeline stopped due to an error

ETL within SpiceBench

In the current main benchmark path:

SpiceBench downloads and extracts the data archive before timed execution
SpiceBench calls adapter setup and prepares the ADBC query path
The timed benchmark starts, then ETL runs concurrently with query execution
At checkpoint boundaries, ETL can pause for result validation when --validate-results is enabled and checkpoints are available
After ETL completes, SpiceBench stops the benchmark and then calls adapter teardown

The ETL sink type is selected with --etl-sink:

adbc: direct ADBC ingest. The adapter's setup response provides write-side ADBC config

Checkpointing

The spicebench checkpoint subcommand captures expected query results at specific ETL steps so benchmark runs can validate correctness while ingestion is active.

How It Works

Generate checkpoints by replaying ETL into DuckDB
Execute the scenario's query workload at configured checkpoint intervals
Write each checkpoint result set as Parquet
Upload checkpoint files and a manifest to S3
During benchmark runs, pause ETL at checkpoint boundaries and compare live results against the stored checkpoint data

S3 Checkpoint Layout

s3://{bucket}/{prefix}/
├── checkpoints.json
└── checkpoints/
    └── {scenario}/
        └── {checkpoint_idx}/
            ├── {query_idx_0}.parquet
            ├── {query_idx_1}.parquet
            └── ...

Checkpoint Manifest

{
    "scenarios": {
        "tpch": {
            "checkpoint_indexes": [5, 10],
            "query_indexes": [0, 1, 2, 3, 4],
            "checkpoint_interval_steps": 5
        }
    }
}

Using Checkpoints in SpiceBench

Enable checkpoint validation with --validate-results:

spicebench run \
    --scenario tpch \
    --system-adapter-name myplatform \
    --system-adapter-http-url http://127.0.0.1:8080/jsonrpc \
    --validate-results

During the benchmark, when ETL reaches a checkpoint step:

ETL pauses
SpiceBench runs the scenario workload against the current system state
Results are compared against stored checkpoint output
ETL resumes if validation succeeds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Generation & ETL

Data Generation

S3 Layout

Archive Contents

Version Metadata (`version.json`)

Table Metadata

Supported Datasets

TPC-H Tables

Operation Codes

Running Data Generation

Production Datasets

Insert-Only (`data-gen` prefix)

With Mutations (`data-gen-mut-v6` prefix)

ETL Pipeline

Processing Steps

Local Archive Mode

Sinks

ADBC Sink

Null Sink

Pipeline States

ETL within SpiceBench

Checkpointing

How It Works

S3 Checkpoint Layout

Checkpoint Manifest

Using Checkpoints in SpiceBench

FilesExpand file tree

data-generation-and-etl.md

Latest commit

History

data-generation-and-etl.md

File metadata and controls

Data Generation & ETL

Data Generation

S3 Layout

Archive Contents

Version Metadata (version.json)

Table Metadata

Supported Datasets

TPC-H Tables

Operation Codes

Running Data Generation

Production Datasets

Insert-Only (data-gen prefix)

With Mutations (data-gen-mut-v6 prefix)

ETL Pipeline

Processing Steps

Local Archive Mode

Sinks

ADBC Sink

Null Sink

Pipeline States

ETL within SpiceBench

Checkpointing

How It Works

S3 Checkpoint Layout

Checkpoint Manifest

Using Checkpoints in SpiceBench

Version Metadata (`version.json`)

Insert-Only (`data-gen` prefix)

With Mutations (`data-gen-mut-v6` prefix)