feat(db): static file storage for immutable block/transaction data by kariy · Pull Request #506 · dojoengine/katana

kariy · 2026-03-24T16:47:55Z

Summary

This PR introduces a static file storage layer for immutable, append-only block and transaction data. Heavy values (headers, transactions, receipts, execution traces, state updates) are moved from MDBX B-trees to sequential flat .dat files, while MDBX retains the role of authoritative index — storing offset pointers and all mutable/random-access data.

Motivation

Katana's MDBX database stores 37 tables, but several contain data that is written once and never modified — headers, transactions, receipts, traces, state updates. Storing these large values in MDBX B-trees causes:

Write amplification: B-tree page splits and copy-on-write overhead for what is fundamentally sequential, append-only data
MDBX file bloat: Large values (transactions ~200B–10KB, traces ~10–100KB) inflate the MDBX file, reducing OS page cache effectiveness for the hot mutable state tables

Flat file appends eliminate these costs while preserving ACID consistency through MDBX-gated reads.

Architecture

Old design: everything in MDBX

Read:   MDBX get(key) → decompress → value
Write:  MDBX put(key, compress(value)) → B-tree insert → page splits

New design: MDBX pointers + static file data

Read:   MDBX get(key) → StaticFileRef { offset, length }
        → mmap/pread from .dat file → decompress → value

Write:  append(compress(value)) → .dat file → (offset, length)
        → MDBX put(key, StaticFileRef::pointer(offset, length))

MDBX is the single source of truth for what data exists. A reader's MDBX transaction snapshot determines exactly which static file data is visible. Static files have no independent index — they are raw data blobs addressed solely by MDBX pointers.

What moved to static files

Static File	Contents	Size per entry
`headers.dat`	Compressed block headers	~300–500B
`block_state_updates.dat`	Compressed state updates	~1–100KB
`transactions.dat`	Compressed transactions	~200B–10KB
`receipts.dat`	Compressed receipts	~100–500B
`tx_traces.dat`	Compressed execution traces	~10–100KB
`block_hashes.dat`	Block hashes (fixed 32B)	32B
`tx_hashes.dat`	Tx hashes (fixed 32B)	32B
`tx_blocks.dat`	Tx-to-block mapping (fixed 8B)	8B

What stays in MDBX

Table	Reason
`Headers`, `BlockStateUpdates`, `Transactions`, `Receipts`, `TxTraces`	Now store `StaticFileRef` pointers (13 bytes) instead of the actual data
`BlockBodyIndices`	Too small (~10B) — pointer would be larger than the data
`BlockNumbers`, `TxNumbers`	Random-key reverse indexes (keyed by hash)
`BlockStatusses`	Mutable (finality status changes)
`BlockHashes`, `TxHashes`, `TxBlocks`	Also in static files, kept in MDBX as fallback for fork mode
All state/history/trie/class tables	Mutable or random-key access

`StaticFileRef` — the MDBX-stored pointer

enum StaticFileRef {
    StaticFile { offset: u64, length: u32 },  // 13 bytes — points to .dat file
    Inline(Vec<u8>),                           // compressed data inline in MDBX
}

Sequential mode (production): data appended to .dat files, MDBX stores StaticFile pointers
Fork mode (non-sequential block numbers): data stored Inline in MDBX directly, since static file appends require sequential keys

Key Design Decisions

1. MDBX gates all static file reads

Static file reads are not independent — they always go through an MDBX transaction snapshot first. This ensures:

No stale reads (reader sees a consistent snapshot)
No phantom reads (if the pointer exists, the data is durable)
ACID consistency without a separate transaction coordinator

2. Offset pointers in MDBX, not in `.idx` files

Early iterations stored offsets in separate .idx files alongside .dat files. This was moved to MDBX because .idx files are outside the MDBX transaction boundary — a reader could see an offset written by an uncommitted transaction, breaking snapshot isolation.

3. `BlockBodyIndices` stays in MDBX

At ~10 bytes per entry, the 13-byte StaticFileRef pointer is larger than the data itself. Benchmarks confirmed +10% read regression with pointer indirection, which disappeared when moved back to direct MDBX storage.

4. No fsync per block

Static files are not fsynced on every write. On crash, MDBX rolls back uncommitted pointers, and orphaned static file data is truncated on next startup. This matches MDBX's own durability model.

5. Fork mode uses inline storage

Fork providers insert blocks at non-sequential numbers. Since static files require sequential appends, fork mode stores compressed data inline in MDBX (StaticFileRef::Inline). The index tables (BlockHashes, TxHashes, TxBlocks) are also written to MDBX in fork mode as fallback.

6. Classes not moved

Contract classes (100KB–2MB JSON) are keyed by ClassHash (random), not sequential numbers. The crash recovery mechanism relies on cursor-last to find the truncation point, which doesn't work for random keys. Classes are also cached by the executor, so the extra I/O hop matters less. Considered for a future iteration.

Crash Recovery

Static file appends happen before the MDBX transaction commits. A crash between these steps leaves orphaned data:

t0: Append to headers.dat        ← on disk
t1: Append to transactions.dat   ← on disk
t2: MDBX put (pointers)          ← uncommitted
t3: --- CRASH ---
t4: MDBX rolls back              ← pointers gone
    Static files have orphaned tail data

On every Db::open(), recover_static_files() reads the last committed MDBX pointer for each table and truncates the corresponding .dat file to offset + length. Fixed-size files are truncated to count * record_size. This is idempotent and runs before any writes.

I/O Optimizations

Optimization	Impact
pread/pwrite	Lock-free reads via `pread` (no file offset sharing). Writes serialized with `pwrite` under a mutex.
mmap for reads	Memory-mapped reads from kernel page cache — zero syscall overhead. Falls back to `pread` for data not yet remapped.
Buffered writes	Small appends accumulate in a 64KB buffer, flushed as a single `pwrite` at 256KB. Reduces syscalls from ~54/block to a few per batch.
Two-phase batch insert	`insert_block_data_batch()`: Phase 1 appends all data to static files (sequential I/O), Phase 2 writes all MDBX pointers (grouped B-tree inserts). Pre-sizes buffers based on batch dimensions.
Conditional dual-write removal	`BlockHashes`, `TxHashes`, `TxBlocks` only written to MDBX in fork mode. Sequential mode skips 3 MDBX puts per block.

Builder API

// Production — default settings
let db = DbBuilder::new().write().build("path/to/db")?;

// Custom static file tuning for heavy sync
let sf = StaticFilesBuilder::new()
    .write_buffer_size(1024 * 1024)
    .flush_threshold(4 * 1024 * 1024);
let db = DbBuilder::new().write().static_files(sf).build("path/to/db")?;

// Tests
let db = DbBuilder::new().in_memory().build_ephemeral()?;

Benchmark Results (file-backed)

Writes — per-block vs batch (new)

Workload	per-block	batch	Speedup
100 blocks × 1 tx	12.0 ms	2.9 ms	4.2× faster
100 blocks × 10 txs	30.3 ms	18.7 ms	1.6× faster
500 blocks × 100 txs	1.12 s	907 ms	1.2× faster

Writes — per-block vs MDBX-only baseline

Workload	Baseline (MDBX only)	New (per-block)	Change
100 blocks × 100 txs	201 ms	199 ms	~0%
500 blocks × 100 txs	1.23 s	1.10 s	-11% faster

Reads — vs MDBX-only baseline (500 blocks × 100 txs)

Operation	Baseline	New	Change
latest_number	268 ns	253 ns	-6%
latest_hash	377 ns	307 ns	-19%
header_by_number	489 ns	493 ns	~0%
block_body_indices	287 ns	306 ns	+7%
transaction_by_hash	1.90 µs	1.93 µs	+2%
receipt_by_hash	1.89 µs	1.90 µs	~0%
full_block (100 txs)	164.5 µs	155.0 µs	-6%

Reads are at parity or faster. The write advantage grows with data volume as MDBX B-tree page splits become more expensive while flat file appends stay O(1).

Advantages

Reduced MDBX write amplification for large immutable values
Smaller MDBX file → better OS page cache for mutable state tables
Faster batch writes via two-phase insertion (4.2× for small blocks)
Sequential disk layout for block/tx data → better read locality
ACID consistency preserved through MDBX-gated reads
Configurable via DbBuilder / StaticFilesBuilder

Disadvantages

Added complexity — two storage backends instead of one
Per-block writes ~15-25% slower for small blocks due to dual-write overhead (mitigated by batch API)
BlockBodyIndices reads +7% — pointer indirection overhead for very small values (kept in MDBX to minimize)
Fork mode stores data inline — no static file benefit for forked providers
Crash recovery required — orphaned static file data must be truncated on startup
Not applicable to random-key tables (classes, contract storage) — only sequential append-only data benefits

Introduce a static file storage layer for append-only, immutable block and transaction data. This moves heavy data (headers, transactions, receipts, traces, state updates) and sequential indexes (block hashes, tx hashes, body indices, tx blocks) to flat files, reducing MDBX write amplification for data that is never modified after insertion. Key changes: - New `static_files` module in katana-db with generic `StaticStore` trait (`FileStore` for production, `MemoryStore` for tests) - `FixedColumn` (O(1) lookup) and `IndexedColumn` (variable-size with .dat/.idx) abstractions for static data - `StaticFiles` container with typed read/write APIs using existing Compress/Decompress codecs - `Db` struct extended with `Arc<StaticFiles<AnyStore>>` - `DbProvider` updated to dual-write (static files + MDBX) with static-file-first reads and MDBX fallback - Sequential detection: only writes to static files when block numbers are sequential (production mode), falls back to MDBX-only for fork mode - Crash recovery via manifest-based truncation on startup - DB version bumped to 10 This is the initial scaffolding. A follow-up will move offset pointers into MDBX tables to ensure static file reads are gated by MDBX transaction snapshots for full ACID consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move offset pointers (.idx) into MDBX tables so all static file reads are gated by the MDBX transaction snapshot, ensuring ACID consistency. Key design changes: - New `StaticFileRef` enum stored as MDBX table value: `StaticFile { offset, length }` for production sequential writes, or `Inline(bytes)` for fork mode where static files aren't used - MDBX tables Headers, BlockStateUpdates, BlockBodyIndices, Transactions, Receipts, TxTraces now store `StaticFileRef` pointers instead of the actual data - `IndexedColumn` replaced by `DataColumn` (no .idx files) — the caller provides offset+length from MDBX when reading - Fixed-size index tables (BlockHashes, TxHashes, TxBlocks) kept in both MDBX and static files — MDBX serves as fallback for fork mode - Write path: append to static files → fsync → write pointers to MDBX → MDBX commit makes everything atomically visible - Read path: read pointer from MDBX snapshot → fetch data from static file (or decompress inline) — no data visible until MDBX commits - Manifest removed as authority — MDBX is the single source of truth - Crash recovery: MDBX state determines what exists, orphaned static file data is harmless (truncated on next startup) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Optimize the static file storage layer based on benchmark results: 1. **pread/pwrite**: Replace seek+read/write with pread/pwrite syscalls. Reads no longer need the mutex, writes use pwrite under a lightweight lock. Cached file length in AtomicU64 avoids fstat syscalls. 2. **mmap for reads**: Memory-map static files for zero-copy reads from the kernel page cache. Reads within the mapped region skip syscalls entirely. Falls back to pread for data written after the last remap. Remap is called on commit() to make new data visible. 3. **Remove redundant dual-writes**: BlockHashes, TxHashes, TxBlocks are only written to MDBX in fork (non-sequential) mode. In sequential mode, they exist only in static files, saving 3 MDBX puts per block. 4. **Move fsync out of insert_block_data**: Fsync no longer happens per block insert. Static files rely on MDBX's durability model — on crash, orphaned data is truncated to match MDBX state on next startup. 5. **Add file-backed benchmark**: New criterion benchmark measuring write and read performance with real disk I/O using tempdir-backed databases. Benchmark results (file-backed, vs MDBX-only baseline): - Reads: at parity or faster (latest_hash -19%, full_block -9%) - Writes: +21-27% overhead from dual-store architecture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Buffer small pwrite calls in memory and flush as a single pwrite when the buffer exceeds 256KB or on remap/sync. This reduces the number of syscalls per block insert from ~54 (for 10 txs/block) to a handful. Reads check the write buffer for recently-appended data that hasn't been flushed yet, so no data is lost between writes and reads. Benchmark improvement (file-backed, vs previous commit): - Writes: -11% to -20% faster (500×10tx now -2% vs MDBX baseline) - Reads: unchanged (mmap path unaffected) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document benchmark results for each optimization step, comparing file-backed static file storage against the MDBX-only baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ndirection) BlockBodyIndices is ~5-10 bytes (two postcard varints). Storing a 13-byte StaticFileRef pointer in MDBX to reference it adds overhead — the pointer is larger than the data. Direct MDBX storage avoids the extra indirection. block_body_indices read: 325ns → 306ns (-6%) Writes also improved slightly (one fewer static file append per block). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add `insert_block_data_batch()` method optimized for the sync pipeline: Phase 1: Append ALL block/tx data to static files (sequential I/O), collecting the resulting pointers in memory. Write buffers are pre-sized based on the batch dimensions to avoid reallocations. Phase 2: Write ALL MDBX entries (pointers + indexes) in one pass. This groups B-tree inserts together for better cache locality. The batch method uses a single MDBX transaction for the entire chunk instead of per-block transactions, eliminating per-block tx overhead. Benchmark results (file-backed): - 100 blocks x 1 tx: per_block 12.0ms → batch 2.9ms (4.2x faster) - 100 blocks x 10 txs: per_block 30.3ms → batch 18.7ms (1.6x faster) - 100 blocks x 50 txs: per_block 105ms → batch 89.7ms (1.2x faster) - 500 blocks x 100 txs: per_block 1.12s → batch 907ms (1.2x faster) The batch method at 100x1tx (2.9ms) is 3.9x faster than the MDBX-only baseline (11.4ms) — the static file architecture pays off when batching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On startup, truncate static files to match the MDBX-committed state. After a crash, static files may contain orphaned data beyond what MDBX committed (since static file appends happen before MDBX commit). This orphaned tail data would cause subsequent appends to write at wrong offsets. Recovery reads the last committed pointer from each MDBX table (Headers, BlockStateUpdates, Transactions, Receipts, TxTraces) and truncates the corresponding .dat file to offset+length. Fixed-size columns (block_hashes, tx_hashes, tx_blocks) are truncated to the committed entry count. Called automatically in Db::new(), Db::open(), and Db::open_no_sync(). Skipped for Db::in_memory() (no crash to recover from). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document the complete static files storage design including: - Architecture overview with MDBX as sole authority - StaticFileRef enum (pointer vs inline) and when each is used - Why specific tables stay in MDBX (too small, random-key, mutable) - Write path for sequential, non-sequential, and batch modes - Read path with MDBX-gated access pattern - Crash recovery: why it's needed, how orphaned data occurs, what the recovery process does, and what it does NOT handle - FileStore I/O strategy (mmap reads, buffered pwrite, no per-write fsync) - Concurrency model (pread lock-free, mmap RwLock, write Mutex) - All assumptions made by the implementation - Directory layout Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add comprehensive doc comments to all public types, traits, methods, and key invariants in the static files storage layer: - StaticStore trait: append-only contract, error behavior, sync/remap semantics - FileStore: invariants for cached_len, mmap_len, write buffer relationship - FixedColumn/DataColumn: constructor contracts, sync/remap/reserve docs - StaticFiles segments: corrected field docs (BlockBodyIndices stored in MDBX, not via pointer), sequential vs fork mode explanation - StaticFileRef enum: role description, MDBX authority gate invariant - Db struct: combined MDBX + static files architecture, recovery model - insert_block_data: sequential vs fork mode documentation - resolve_static_ref: inline fallback explanation - Fixed Db::open/open_ro to use /// doc comments instead of // - Removed unused `path` field from FileStore and unused import in recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…uration Replace scattered Db constructors with a unified builder pattern: - `FileStoreConfig`: configurable write_buffer_size, flush_threshold, and use_mmap (previously hardcoded constants) - `StaticFilesBuilder`: fluent API for static file configuration with `.file(path)`, `.memory()`, `.write_buffer_size()`, `.flush_threshold()`, `.no_mmap()`, and `.build()` - `DbBuilder`: unified builder combining MDBX and static file config with `.write()`, `.sync()`, `.max_size()`, `.static_files(|sf| ...)`, `.in_memory()`, `.build(path)`, and `.build_ephemeral()` Db::new() and Db::in_memory() now delegate to DbBuilder. Existing constructors (open, open_ro, open_no_sync) preserved for backward compatibility. Example usage: // Production with custom buffers let db = DbBuilder::new() .write() .static_files(|sf| sf.write_buffer_size(1 << 20)) .build(path)?; // Tests let db = DbBuilder::new().in_memory().build_ephemeral()?; Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…uilder directly Replace the closure-based `static_files(|sf| sf.write_buffer_size(...))` with `static_files(builder)` that accepts a pre-configured StaticFilesBuilder. This allows the static files builder to be configured separately and passed in: let sf = StaticFilesBuilder::new().write_buffer_size(1 << 20); let db = DbBuilder::new().write().static_files(sf).build(path)?; Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kariy and others added 12 commits March 20, 2026 19:36

docs(db): add static files benchmark results tracking

57a038f

Document benchmark results for each optimization step, comparing file-backed static file storage against the MDBX-only baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kariy marked this pull request as draft March 24, 2026 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(db): static file storage for immutable block/transaction data#506

feat(db): static file storage for immutable block/transaction data#506
kariy wants to merge 12 commits intomainfrom
feat/database-optimization

kariy commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kariy commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Architecture

Old design: everything in MDBX

New design: MDBX pointers + static file data

What moved to static files

What stays in MDBX

StaticFileRef — the MDBX-stored pointer

Key Design Decisions

1. MDBX gates all static file reads

2. Offset pointers in MDBX, not in .idx files

3. BlockBodyIndices stays in MDBX

4. No fsync per block

5. Fork mode uses inline storage

6. Classes not moved

Crash Recovery

I/O Optimizations

Builder API

Benchmark Results (file-backed)

Writes — per-block vs batch (new)

Writes — per-block vs MDBX-only baseline

Reads — vs MDBX-only baseline (500 blocks × 100 txs)

Advantages

Disadvantages

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kariy commented Mar 24, 2026 •

edited

Loading

`StaticFileRef` — the MDBX-stored pointer

2. Offset pointers in MDBX, not in `.idx` files

3. `BlockBodyIndices` stays in MDBX