Skip to content

feat(db): static file storage for immutable block/transaction data#506

Draft
kariy wants to merge 12 commits intomainfrom
feat/database-optimization
Draft

feat(db): static file storage for immutable block/transaction data#506
kariy wants to merge 12 commits intomainfrom
feat/database-optimization

Conversation

@kariy
Copy link
Copy Markdown
Member

@kariy kariy commented Mar 24, 2026

Summary

This PR introduces a static file storage layer for immutable, append-only block and transaction data. Heavy values (headers, transactions, receipts, execution traces, state updates) are moved from MDBX B-trees to sequential flat .dat files, while MDBX retains the role of authoritative index — storing offset pointers and all mutable/random-access data.

Motivation

Katana's MDBX database stores 37 tables, but several contain data that is written once and never modified — headers, transactions, receipts, traces, state updates. Storing these large values in MDBX B-trees causes:

  • Write amplification: B-tree page splits and copy-on-write overhead for what is fundamentally sequential, append-only data
  • MDBX file bloat: Large values (transactions ~200B–10KB, traces ~10–100KB) inflate the MDBX file, reducing OS page cache effectiveness for the hot mutable state tables

Flat file appends eliminate these costs while preserving ACID consistency through MDBX-gated reads.

Architecture

Old design: everything in MDBX

Read:   MDBX get(key) → decompress → value
Write:  MDBX put(key, compress(value)) → B-tree insert → page splits

New design: MDBX pointers + static file data

Read:   MDBX get(key) → StaticFileRef { offset, length }
        → mmap/pread from .dat file → decompress → value

Write:  append(compress(value)) → .dat file → (offset, length)
        → MDBX put(key, StaticFileRef::pointer(offset, length))

MDBX is the single source of truth for what data exists. A reader's MDBX transaction snapshot determines exactly which static file data is visible. Static files have no independent index — they are raw data blobs addressed solely by MDBX pointers.

What moved to static files

Static File Contents Size per entry
headers.dat Compressed block headers ~300–500B
block_state_updates.dat Compressed state updates ~1–100KB
transactions.dat Compressed transactions ~200B–10KB
receipts.dat Compressed receipts ~100–500B
tx_traces.dat Compressed execution traces ~10–100KB
block_hashes.dat Block hashes (fixed 32B) 32B
tx_hashes.dat Tx hashes (fixed 32B) 32B
tx_blocks.dat Tx-to-block mapping (fixed 8B) 8B

What stays in MDBX

Table Reason
Headers, BlockStateUpdates, Transactions, Receipts, TxTraces Now store StaticFileRef pointers (13 bytes) instead of the actual data
BlockBodyIndices Too small (~10B) — pointer would be larger than the data
BlockNumbers, TxNumbers Random-key reverse indexes (keyed by hash)
BlockStatusses Mutable (finality status changes)
BlockHashes, TxHashes, TxBlocks Also in static files, kept in MDBX as fallback for fork mode
All state/history/trie/class tables Mutable or random-key access

StaticFileRef — the MDBX-stored pointer

enum StaticFileRef {
    StaticFile { offset: u64, length: u32 },  // 13 bytes — points to .dat file
    Inline(Vec<u8>),                           // compressed data inline in MDBX
}
  • Sequential mode (production): data appended to .dat files, MDBX stores StaticFile pointers
  • Fork mode (non-sequential block numbers): data stored Inline in MDBX directly, since static file appends require sequential keys

Key Design Decisions

1. MDBX gates all static file reads

Static file reads are not independent — they always go through an MDBX transaction snapshot first. This ensures:

  • No stale reads (reader sees a consistent snapshot)
  • No phantom reads (if the pointer exists, the data is durable)
  • ACID consistency without a separate transaction coordinator

2. Offset pointers in MDBX, not in .idx files

Early iterations stored offsets in separate .idx files alongside .dat files. This was moved to MDBX because .idx files are outside the MDBX transaction boundary — a reader could see an offset written by an uncommitted transaction, breaking snapshot isolation.

3. BlockBodyIndices stays in MDBX

At ~10 bytes per entry, the 13-byte StaticFileRef pointer is larger than the data itself. Benchmarks confirmed +10% read regression with pointer indirection, which disappeared when moved back to direct MDBX storage.

4. No fsync per block

Static files are not fsynced on every write. On crash, MDBX rolls back uncommitted pointers, and orphaned static file data is truncated on next startup. This matches MDBX's own durability model.

5. Fork mode uses inline storage

Fork providers insert blocks at non-sequential numbers. Since static files require sequential appends, fork mode stores compressed data inline in MDBX (StaticFileRef::Inline). The index tables (BlockHashes, TxHashes, TxBlocks) are also written to MDBX in fork mode as fallback.

6. Classes not moved

Contract classes (100KB–2MB JSON) are keyed by ClassHash (random), not sequential numbers. The crash recovery mechanism relies on cursor-last to find the truncation point, which doesn't work for random keys. Classes are also cached by the executor, so the extra I/O hop matters less. Considered for a future iteration.

Crash Recovery

Static file appends happen before the MDBX transaction commits. A crash between these steps leaves orphaned data:

t0: Append to headers.dat        ← on disk
t1: Append to transactions.dat   ← on disk
t2: MDBX put (pointers)          ← uncommitted
t3: --- CRASH ---
t4: MDBX rolls back              ← pointers gone
    Static files have orphaned tail data

On every Db::open(), recover_static_files() reads the last committed MDBX pointer for each table and truncates the corresponding .dat file to offset + length. Fixed-size files are truncated to count * record_size. This is idempotent and runs before any writes.

I/O Optimizations

Optimization Impact
pread/pwrite Lock-free reads via pread (no file offset sharing). Writes serialized with pwrite under a mutex.
mmap for reads Memory-mapped reads from kernel page cache — zero syscall overhead. Falls back to pread for data not yet remapped.
Buffered writes Small appends accumulate in a 64KB buffer, flushed as a single pwrite at 256KB. Reduces syscalls from ~54/block to a few per batch.
Two-phase batch insert insert_block_data_batch(): Phase 1 appends all data to static files (sequential I/O), Phase 2 writes all MDBX pointers (grouped B-tree inserts). Pre-sizes buffers based on batch dimensions.
Conditional dual-write removal BlockHashes, TxHashes, TxBlocks only written to MDBX in fork mode. Sequential mode skips 3 MDBX puts per block.

Builder API

// Production — default settings
let db = DbBuilder::new().write().build("path/to/db")?;

// Custom static file tuning for heavy sync
let sf = StaticFilesBuilder::new()
    .write_buffer_size(1024 * 1024)
    .flush_threshold(4 * 1024 * 1024);
let db = DbBuilder::new().write().static_files(sf).build("path/to/db")?;

// Tests
let db = DbBuilder::new().in_memory().build_ephemeral()?;

Benchmark Results (file-backed)

Writes — per-block vs batch (new)

Workload per-block batch Speedup
100 blocks × 1 tx 12.0 ms 2.9 ms 4.2× faster
100 blocks × 10 txs 30.3 ms 18.7 ms 1.6× faster
500 blocks × 100 txs 1.12 s 907 ms 1.2× faster

Writes — per-block vs MDBX-only baseline

Workload Baseline (MDBX only) New (per-block) Change
100 blocks × 100 txs 201 ms 199 ms ~0%
500 blocks × 100 txs 1.23 s 1.10 s -11% faster

Reads — vs MDBX-only baseline (500 blocks × 100 txs)

Operation Baseline New Change
latest_number 268 ns 253 ns -6%
latest_hash 377 ns 307 ns -19%
header_by_number 489 ns 493 ns ~0%
block_body_indices 287 ns 306 ns +7%
transaction_by_hash 1.90 µs 1.93 µs +2%
receipt_by_hash 1.89 µs 1.90 µs ~0%
full_block (100 txs) 164.5 µs 155.0 µs -6%

Reads are at parity or faster. The write advantage grows with data volume as MDBX B-tree page splits become more expensive while flat file appends stay O(1).

Advantages

  • Reduced MDBX write amplification for large immutable values
  • Smaller MDBX file → better OS page cache for mutable state tables
  • Faster batch writes via two-phase insertion (4.2× for small blocks)
  • Sequential disk layout for block/tx data → better read locality
  • ACID consistency preserved through MDBX-gated reads
  • Configurable via DbBuilder / StaticFilesBuilder

Disadvantages

  • Added complexity — two storage backends instead of one
  • Per-block writes ~15-25% slower for small blocks due to dual-write overhead (mitigated by batch API)
  • BlockBodyIndices reads +7% — pointer indirection overhead for very small values (kept in MDBX to minimize)
  • Fork mode stores data inline — no static file benefit for forked providers
  • Crash recovery required — orphaned static file data must be truncated on startup
  • Not applicable to random-key tables (classes, contract storage) — only sequential append-only data benefits

kariy and others added 12 commits March 20, 2026 19:36
Introduce a static file storage layer for append-only, immutable block
and transaction data. This moves heavy data (headers, transactions,
receipts, traces, state updates) and sequential indexes (block hashes,
tx hashes, body indices, tx blocks) to flat files, reducing MDBX write
amplification for data that is never modified after insertion.

Key changes:
- New `static_files` module in katana-db with generic `StaticStore` trait
  (`FileStore` for production, `MemoryStore` for tests)
- `FixedColumn` (O(1) lookup) and `IndexedColumn` (variable-size with
  .dat/.idx) abstractions for static data
- `StaticFiles` container with typed read/write APIs using existing
  Compress/Decompress codecs
- `Db` struct extended with `Arc<StaticFiles<AnyStore>>`
- `DbProvider` updated to dual-write (static files + MDBX) with
  static-file-first reads and MDBX fallback
- Sequential detection: only writes to static files when block numbers
  are sequential (production mode), falls back to MDBX-only for fork mode
- Crash recovery via manifest-based truncation on startup
- DB version bumped to 10

This is the initial scaffolding. A follow-up will move offset pointers
into MDBX tables to ensure static file reads are gated by MDBX
transaction snapshots for full ACID consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move offset pointers (.idx) into MDBX tables so all static file reads
are gated by the MDBX transaction snapshot, ensuring ACID consistency.

Key design changes:
- New `StaticFileRef` enum stored as MDBX table value: `StaticFile {
  offset, length }` for production sequential writes, or `Inline(bytes)`
  for fork mode where static files aren't used
- MDBX tables Headers, BlockStateUpdates, BlockBodyIndices, Transactions,
  Receipts, TxTraces now store `StaticFileRef` pointers instead of
  the actual data
- `IndexedColumn` replaced by `DataColumn` (no .idx files) — the caller
  provides offset+length from MDBX when reading
- Fixed-size index tables (BlockHashes, TxHashes, TxBlocks) kept in both
  MDBX and static files — MDBX serves as fallback for fork mode
- Write path: append to static files → fsync → write pointers to MDBX →
  MDBX commit makes everything atomically visible
- Read path: read pointer from MDBX snapshot → fetch data from static
  file (or decompress inline) — no data visible until MDBX commits
- Manifest removed as authority — MDBX is the single source of truth
- Crash recovery: MDBX state determines what exists, orphaned static
  file data is harmless (truncated on next startup)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Optimize the static file storage layer based on benchmark results:

1. **pread/pwrite**: Replace seek+read/write with pread/pwrite syscalls.
   Reads no longer need the mutex, writes use pwrite under a lightweight
   lock. Cached file length in AtomicU64 avoids fstat syscalls.

2. **mmap for reads**: Memory-map static files for zero-copy reads from
   the kernel page cache. Reads within the mapped region skip syscalls
   entirely. Falls back to pread for data written after the last remap.
   Remap is called on commit() to make new data visible.

3. **Remove redundant dual-writes**: BlockHashes, TxHashes, TxBlocks are
   only written to MDBX in fork (non-sequential) mode. In sequential
   mode, they exist only in static files, saving 3 MDBX puts per block.

4. **Move fsync out of insert_block_data**: Fsync no longer happens per
   block insert. Static files rely on MDBX's durability model — on crash,
   orphaned data is truncated to match MDBX state on next startup.

5. **Add file-backed benchmark**: New criterion benchmark measuring write
   and read performance with real disk I/O using tempdir-backed databases.

Benchmark results (file-backed, vs MDBX-only baseline):
- Reads: at parity or faster (latest_hash -19%, full_block -9%)
- Writes: +21-27% overhead from dual-store architecture

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Buffer small pwrite calls in memory and flush as a single pwrite when
the buffer exceeds 256KB or on remap/sync. This reduces the number of
syscalls per block insert from ~54 (for 10 txs/block) to a handful.

Reads check the write buffer for recently-appended data that hasn't
been flushed yet, so no data is lost between writes and reads.

Benchmark improvement (file-backed, vs previous commit):
- Writes: -11% to -20% faster (500×10tx now -2% vs MDBX baseline)
- Reads: unchanged (mmap path unaffected)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document benchmark results for each optimization step, comparing
file-backed static file storage against the MDBX-only baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ndirection)

BlockBodyIndices is ~5-10 bytes (two postcard varints). Storing a 13-byte
StaticFileRef pointer in MDBX to reference it adds overhead — the pointer
is larger than the data. Direct MDBX storage avoids the extra indirection.

block_body_indices read: 325ns → 306ns (-6%)
Writes also improved slightly (one fewer static file append per block).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `insert_block_data_batch()` method optimized for the sync pipeline:

Phase 1: Append ALL block/tx data to static files (sequential I/O),
collecting the resulting pointers in memory. Write buffers are
pre-sized based on the batch dimensions to avoid reallocations.

Phase 2: Write ALL MDBX entries (pointers + indexes) in one pass.
This groups B-tree inserts together for better cache locality.

The batch method uses a single MDBX transaction for the entire chunk
instead of per-block transactions, eliminating per-block tx overhead.

Benchmark results (file-backed):
- 100 blocks x 1 tx:   per_block 12.0ms → batch 2.9ms  (4.2x faster)
- 100 blocks x 10 txs: per_block 30.3ms → batch 18.7ms (1.6x faster)
- 100 blocks x 50 txs: per_block 105ms  → batch 89.7ms (1.2x faster)
- 500 blocks x 100 txs: per_block 1.12s → batch 907ms  (1.2x faster)

The batch method at 100x1tx (2.9ms) is 3.9x faster than the MDBX-only
baseline (11.4ms) — the static file architecture pays off when batching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On startup, truncate static files to match the MDBX-committed state.
After a crash, static files may contain orphaned data beyond what MDBX
committed (since static file appends happen before MDBX commit). This
orphaned tail data would cause subsequent appends to write at wrong
offsets.

Recovery reads the last committed pointer from each MDBX table
(Headers, BlockStateUpdates, Transactions, Receipts, TxTraces) and
truncates the corresponding .dat file to offset+length. Fixed-size
columns (block_hashes, tx_hashes, tx_blocks) are truncated to the
committed entry count.

Called automatically in Db::new(), Db::open(), and Db::open_no_sync().
Skipped for Db::in_memory() (no crash to recover from).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the complete static files storage design including:
- Architecture overview with MDBX as sole authority
- StaticFileRef enum (pointer vs inline) and when each is used
- Why specific tables stay in MDBX (too small, random-key, mutable)
- Write path for sequential, non-sequential, and batch modes
- Read path with MDBX-gated access pattern
- Crash recovery: why it's needed, how orphaned data occurs, what
  the recovery process does, and what it does NOT handle
- FileStore I/O strategy (mmap reads, buffered pwrite, no per-write fsync)
- Concurrency model (pread lock-free, mmap RwLock, write Mutex)
- All assumptions made by the implementation
- Directory layout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive doc comments to all public types, traits, methods, and
key invariants in the static files storage layer:

- StaticStore trait: append-only contract, error behavior, sync/remap semantics
- FileStore: invariants for cached_len, mmap_len, write buffer relationship
- FixedColumn/DataColumn: constructor contracts, sync/remap/reserve docs
- StaticFiles segments: corrected field docs (BlockBodyIndices stored in MDBX,
  not via pointer), sequential vs fork mode explanation
- StaticFileRef enum: role description, MDBX authority gate invariant
- Db struct: combined MDBX + static files architecture, recovery model
- insert_block_data: sequential vs fork mode documentation
- resolve_static_ref: inline fallback explanation
- Fixed Db::open/open_ro to use /// doc comments instead of //
- Removed unused `path` field from FileStore and unused import in recovery

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uration

Replace scattered Db constructors with a unified builder pattern:

- `FileStoreConfig`: configurable write_buffer_size, flush_threshold,
  and use_mmap (previously hardcoded constants)
- `StaticFilesBuilder`: fluent API for static file configuration with
  `.file(path)`, `.memory()`, `.write_buffer_size()`, `.flush_threshold()`,
  `.no_mmap()`, and `.build()`
- `DbBuilder`: unified builder combining MDBX and static file config with
  `.write()`, `.sync()`, `.max_size()`, `.static_files(|sf| ...)`,
  `.in_memory()`, `.build(path)`, and `.build_ephemeral()`

Db::new() and Db::in_memory() now delegate to DbBuilder. Existing
constructors (open, open_ro, open_no_sync) preserved for backward
compatibility.

Example usage:
  // Production with custom buffers
  let db = DbBuilder::new()
      .write()
      .static_files(|sf| sf.write_buffer_size(1 << 20))
      .build(path)?;

  // Tests
  let db = DbBuilder::new().in_memory().build_ephemeral()?;

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uilder directly

Replace the closure-based `static_files(|sf| sf.write_buffer_size(...))` with
`static_files(builder)` that accepts a pre-configured StaticFilesBuilder.

This allows the static files builder to be configured separately and passed in:

  let sf = StaticFilesBuilder::new().write_buffer_size(1 << 20);
  let db = DbBuilder::new().write().static_files(sf).build(path)?;

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kariy kariy marked this pull request as draft March 24, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant