A rigorous, reproducible, open benchmark framework for evaluating event sourcing databases and append-only log systems.
This project exists to define a credible performance standard for event stores — one that measures real-world behavior under realistic workloads, not synthetic best-case scenarios.
The benchmark is built with:
- Rust — high-precision workload execution and measurement
- Python — analysis, visualization, and reporting
Run the run_and_report.sh script to execute the benchmark. You may need to install Rust and Python.
# Setup Python virtual environment
python3 -m venv .venv && source .venv/bin/activate
pip install -r python/requirements.txt
# Run the benchmark...
./scripts/run_and_report.shThen open results/published/index.html in your browser to view the results.
Most existing benchmarks for event stores:
- Measure only peak append throughput
- Ignore latency percentiles
- Skip recovery and crash behavior
- Do not model realistic workload shapes
- Are difficult to reproduce
- Favor a specific implementation
This project aims to correct that.
We treat benchmarking as an engineering discipline — not a marketing exercise.
This benchmark suite is built around the following principles:
Benchmarks must model real event-sourced applications:
- Many small streams
- Some hot streams
- Heavy-tailed (Zipf-like) distributions
- Tag/category filtering
- Concurrent writers
- Catch-up subscribers
- Mixed read/write workloads
Synthetic “write 1 million events to one stream” tests are insufficient.
We measure:
- p50
- p95
- p99
- p99.9
Average throughput alone is misleading.
Latency distribution under contention is what matters.
We explicitly test:
- Crash during write
- Restart and recovery
- WAL replay cost
- Index rebuild time
- Checkpoint recovery
If a store claims durability, it must survive termination mid-transaction.
All benchmarks must be:
- Deterministic (fixed random seeds)
- Configurable via versioned YAML definitions
- Hardware documented
- OS and fsync mode documented
- Repeatable across environments
Raw results must be published alongside summarized results.
The benchmark must not favor a specific implementation.
Adapters are used to interface with different systems, but workloads are defined independently of implementation details.
event-store-benchmark/
├── README.md
├── SPEC.md
├── workloads/
│ ├── append_only.yaml
│ ├── concurrent_writers.yaml
│ ├── mixed_read_write.yaml
│ ├── tag_queries.yaml
│ ├── replay.yaml
│ └── crash_recovery.yaml
├── rust/
│ ├── bench-core/
│ ├── adapters/
│ │ ├── eventstore/
│ │ ├── postgres/
│ │ ├── sqlite/
│ │ └── kafka/
│ └── cli/
├── python/
│ ├── analysis/
│ ├── plotting/
│ ├── notebooks/
│ └── report_generator.py
├── results/
│ ├── raw/
│ ├── processed/
│ └── published/
└── docs/
├── methodology.md
├── environment.md
└── reproducibility.md
Responsible for:
- Workload execution
- Concurrency control
- Precise latency measurement
- Resource usage tracking
- Crash injection
- Raw metrics output
The Rust engine produces structured output (JSON or CSV):
{
"workload": "concurrent_writers",
"events_written": 1000000,
"throughput_eps": 425000,
"latency": {
"p50": 0.8,
"p95": 2.1,
"p99": 5.4,
"p999": 11.2
},
"cpu_percent": 78,
"memory_mb": 512,
"recovery_time_ms": 1420
}No analysis logic lives in Rust — only measurement.
Responsible for:
- Aggregating benchmark runs
- Computing statistical comparisons
- Plotting latency distributions
- Generating tables for publication
- Producing PDF/HTML reports
- Detecting regressions between runs
This separation prevents analytical overhead from contaminating benchmark execution.
All workloads are defined declaratively in YAML.
Example:
name: concurrent_writers
duration_seconds: 60
writers: 8
event_size_bytes: 1024
streams:
distribution: zipf
unique_streams: 100000
conflict_rate: 0.05
durability: fsync_onEach workload defines:
- Event size
- Stream distribution
- Writer concurrency
- Conflict behavior
- Tag cardinality
- Read/write ratio
- Durability mode
- Duration or target event count
The suite includes:
Single writer, sequential appends.
Multiple concurrent writers with configurable conflict injection.
Write-heavy workloads with background reads.
High- and low-cardinality tag distributions. Intersection queries.
Subscriber catching up while writes continue.
Random process termination. Measure recovery time and consistency.
Sustained load (hours). Measure drift, fragmentation, index growth.
Each benchmark run captures:
- Throughput (events/sec)
- Latency percentiles
- CPU utilization
- Memory usage
- Disk I/O
- Index size growth
- Write amplification (if available)
- Recovery time
- Error/conflict rates
Each published result must document:
- CPU model
- Core count
- RAM
- Disk type (NVMe, SSD, HDD)
- Filesystem
- OS version
- Fsync configuration
- Kernel tuning (if any)
- Store configuration
Benchmarks must be run on isolated machines.
Each target system implements a common Rust trait:
trait EventStoreAdapter {
fn append(&self, stream: &str, events: &[Event]) -> Result<()>;
fn read_stream(&self, stream: &str, from: u64) -> Result<Vec<Event>>;
fn query_by_tag(&self, tag: &str) -> Result<Vec<Event>>;
fn crash(&self);
fn recover(&self);
}This allows the same workload to run across different systems.
The benchmark CLI is esbs. You can run it from ./target/release/esbs after building, or via Cargo:
cargo run -p esbs -- <command> [options]Global options:
--log <level>: set log verbosity (trace,debug,info,warn,error). Default:info.-h, --help: show help-V, --version: show version
Execute a workload against a specific adapter and write raw results to a timestamped folder.
Usage:
esbs run \
--store <adapter> \
--workload <path/to/workload.yaml> \
[--output results/raw] \
[--uri <connection-uri>] \
[--option key=value ...] \
[--seed <u64>]Parameters:
--store <adapter>: adapter name. Currently supported:umadb.--workload <file>: path to a YAML workload definition (seeworkloads/).--output <dir>: base directory for raw results. Default:results/raw.--uri <string>: connection URI for the store (e.g.,http://localhost:50051).--option key=value: repeatable key/value options for the adapter. For UmaDB:api_key=umadb:...ca_path=/path/to/server.pem(TLS with self-signed certs)batch_size=1000(read hint)
--seed <u64>: deterministic RNG seed. Default:42.
Examples:
# Basic UmaDB run (insecure gRPC)
esbs run --store umadb \
--workload workloads/concurrent_writers.yaml \
--uri http://localhost:50051 \
--seed 42
# UmaDB with TLS and API key
esbs run --store umadb \
--workload workloads/concurrent_writers.yaml \
--uri https://localhost:50051 \
--option ca_path=server.pem \
--option api_key=umadb:example-api-key \
--option batch_size=1000
# Custom results location
esbs run --store umadb \
--workload workloads/concurrent_writers.yaml \
--uri http://localhost:50051 \
--output results/raw/lab-aOutputs:
summary.json: overall stats (duration, throughput, latency percentiles, adapter/workload IDs)samples.jsonl: per-append samples (timestamp, latency, ok/error)run.meta.json: minimal locator for the Python reporting layer
List workload YAML files in a directory (defaults to workloads/).
Usage:
esbs list-workloads [--path <dir>]Example:
esbs list-workloadsList available adapters.
Usage:
esbs list-storesCurrent output:
umadb
Tips:
- Prefer
--seedto make runs comparable across machines. - Use
--log debugfor more detailed progress during development. - Start UmaDB quickly with the provided
docker-compose.yml(docker compose up -d umadb).
Published benchmark reports must include:
- Workload definition
- Raw metrics
- Summary tables
- Latency distribution graphs
- Environment specification
- Exact commit hash of benchmark suite
- Exact version of target system
Transparency is mandatory.
This benchmark suite does not:
- Optimize systems for artificial workloads
- Hide durability settings
- Benchmark in-memory-only configurations
- Publish results without reproducibility metadata
- Declare “winners”
The goal is measurement, not marketing.
Contributions are welcome for:
- New workload definitions
- New system adapters
- Improved statistical analysis
- Improved reporting templates
- Environment automation scripts
All contributions must preserve:
- Determinism
- Reproducibility
- Neutrality
This project aims to become:
- A reference benchmark for event sourcing systems
- A research-grade measurement framework
- A regression detection tool for event store developers
- A shared standard for comparing durability and performance trade-offs
If adopted broadly, this could meaningfully improve the quality of performance claims in the event sourcing ecosystem.
- Install Rust (stable toolchain)
- Install Python 3.11+
- Clone the repository
- Run sample workloads
- Generate example reports
- Validate results on your hardware
Full setup instructions are in docs/reproducibility.md.
Open source under MIT.
