feat(db): static file storage for immutable block/transaction data#506
Draft
feat(db): static file storage for immutable block/transaction data#506
Conversation
Introduce a static file storage layer for append-only, immutable block and transaction data. This moves heavy data (headers, transactions, receipts, traces, state updates) and sequential indexes (block hashes, tx hashes, body indices, tx blocks) to flat files, reducing MDBX write amplification for data that is never modified after insertion. Key changes: - New `static_files` module in katana-db with generic `StaticStore` trait (`FileStore` for production, `MemoryStore` for tests) - `FixedColumn` (O(1) lookup) and `IndexedColumn` (variable-size with .dat/.idx) abstractions for static data - `StaticFiles` container with typed read/write APIs using existing Compress/Decompress codecs - `Db` struct extended with `Arc<StaticFiles<AnyStore>>` - `DbProvider` updated to dual-write (static files + MDBX) with static-file-first reads and MDBX fallback - Sequential detection: only writes to static files when block numbers are sequential (production mode), falls back to MDBX-only for fork mode - Crash recovery via manifest-based truncation on startup - DB version bumped to 10 This is the initial scaffolding. A follow-up will move offset pointers into MDBX tables to ensure static file reads are gated by MDBX transaction snapshots for full ACID consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move offset pointers (.idx) into MDBX tables so all static file reads
are gated by the MDBX transaction snapshot, ensuring ACID consistency.
Key design changes:
- New `StaticFileRef` enum stored as MDBX table value: `StaticFile {
offset, length }` for production sequential writes, or `Inline(bytes)`
for fork mode where static files aren't used
- MDBX tables Headers, BlockStateUpdates, BlockBodyIndices, Transactions,
Receipts, TxTraces now store `StaticFileRef` pointers instead of
the actual data
- `IndexedColumn` replaced by `DataColumn` (no .idx files) — the caller
provides offset+length from MDBX when reading
- Fixed-size index tables (BlockHashes, TxHashes, TxBlocks) kept in both
MDBX and static files — MDBX serves as fallback for fork mode
- Write path: append to static files → fsync → write pointers to MDBX →
MDBX commit makes everything atomically visible
- Read path: read pointer from MDBX snapshot → fetch data from static
file (or decompress inline) — no data visible until MDBX commits
- Manifest removed as authority — MDBX is the single source of truth
- Crash recovery: MDBX state determines what exists, orphaned static
file data is harmless (truncated on next startup)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Optimize the static file storage layer based on benchmark results: 1. **pread/pwrite**: Replace seek+read/write with pread/pwrite syscalls. Reads no longer need the mutex, writes use pwrite under a lightweight lock. Cached file length in AtomicU64 avoids fstat syscalls. 2. **mmap for reads**: Memory-map static files for zero-copy reads from the kernel page cache. Reads within the mapped region skip syscalls entirely. Falls back to pread for data written after the last remap. Remap is called on commit() to make new data visible. 3. **Remove redundant dual-writes**: BlockHashes, TxHashes, TxBlocks are only written to MDBX in fork (non-sequential) mode. In sequential mode, they exist only in static files, saving 3 MDBX puts per block. 4. **Move fsync out of insert_block_data**: Fsync no longer happens per block insert. Static files rely on MDBX's durability model — on crash, orphaned data is truncated to match MDBX state on next startup. 5. **Add file-backed benchmark**: New criterion benchmark measuring write and read performance with real disk I/O using tempdir-backed databases. Benchmark results (file-backed, vs MDBX-only baseline): - Reads: at parity or faster (latest_hash -19%, full_block -9%) - Writes: +21-27% overhead from dual-store architecture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Buffer small pwrite calls in memory and flush as a single pwrite when the buffer exceeds 256KB or on remap/sync. This reduces the number of syscalls per block insert from ~54 (for 10 txs/block) to a handful. Reads check the write buffer for recently-appended data that hasn't been flushed yet, so no data is lost between writes and reads. Benchmark improvement (file-backed, vs previous commit): - Writes: -11% to -20% faster (500×10tx now -2% vs MDBX baseline) - Reads: unchanged (mmap path unaffected) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document benchmark results for each optimization step, comparing file-backed static file storage against the MDBX-only baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ndirection) BlockBodyIndices is ~5-10 bytes (two postcard varints). Storing a 13-byte StaticFileRef pointer in MDBX to reference it adds overhead — the pointer is larger than the data. Direct MDBX storage avoids the extra indirection. block_body_indices read: 325ns → 306ns (-6%) Writes also improved slightly (one fewer static file append per block). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `insert_block_data_batch()` method optimized for the sync pipeline: Phase 1: Append ALL block/tx data to static files (sequential I/O), collecting the resulting pointers in memory. Write buffers are pre-sized based on the batch dimensions to avoid reallocations. Phase 2: Write ALL MDBX entries (pointers + indexes) in one pass. This groups B-tree inserts together for better cache locality. The batch method uses a single MDBX transaction for the entire chunk instead of per-block transactions, eliminating per-block tx overhead. Benchmark results (file-backed): - 100 blocks x 1 tx: per_block 12.0ms → batch 2.9ms (4.2x faster) - 100 blocks x 10 txs: per_block 30.3ms → batch 18.7ms (1.6x faster) - 100 blocks x 50 txs: per_block 105ms → batch 89.7ms (1.2x faster) - 500 blocks x 100 txs: per_block 1.12s → batch 907ms (1.2x faster) The batch method at 100x1tx (2.9ms) is 3.9x faster than the MDBX-only baseline (11.4ms) — the static file architecture pays off when batching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On startup, truncate static files to match the MDBX-committed state. After a crash, static files may contain orphaned data beyond what MDBX committed (since static file appends happen before MDBX commit). This orphaned tail data would cause subsequent appends to write at wrong offsets. Recovery reads the last committed pointer from each MDBX table (Headers, BlockStateUpdates, Transactions, Receipts, TxTraces) and truncates the corresponding .dat file to offset+length. Fixed-size columns (block_hashes, tx_hashes, tx_blocks) are truncated to the committed entry count. Called automatically in Db::new(), Db::open(), and Db::open_no_sync(). Skipped for Db::in_memory() (no crash to recover from). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the complete static files storage design including: - Architecture overview with MDBX as sole authority - StaticFileRef enum (pointer vs inline) and when each is used - Why specific tables stay in MDBX (too small, random-key, mutable) - Write path for sequential, non-sequential, and batch modes - Read path with MDBX-gated access pattern - Crash recovery: why it's needed, how orphaned data occurs, what the recovery process does, and what it does NOT handle - FileStore I/O strategy (mmap reads, buffered pwrite, no per-write fsync) - Concurrency model (pread lock-free, mmap RwLock, write Mutex) - All assumptions made by the implementation - Directory layout Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive doc comments to all public types, traits, methods, and key invariants in the static files storage layer: - StaticStore trait: append-only contract, error behavior, sync/remap semantics - FileStore: invariants for cached_len, mmap_len, write buffer relationship - FixedColumn/DataColumn: constructor contracts, sync/remap/reserve docs - StaticFiles segments: corrected field docs (BlockBodyIndices stored in MDBX, not via pointer), sequential vs fork mode explanation - StaticFileRef enum: role description, MDBX authority gate invariant - Db struct: combined MDBX + static files architecture, recovery model - insert_block_data: sequential vs fork mode documentation - resolve_static_ref: inline fallback explanation - Fixed Db::open/open_ro to use /// doc comments instead of // - Removed unused `path` field from FileStore and unused import in recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uration
Replace scattered Db constructors with a unified builder pattern:
- `FileStoreConfig`: configurable write_buffer_size, flush_threshold,
and use_mmap (previously hardcoded constants)
- `StaticFilesBuilder`: fluent API for static file configuration with
`.file(path)`, `.memory()`, `.write_buffer_size()`, `.flush_threshold()`,
`.no_mmap()`, and `.build()`
- `DbBuilder`: unified builder combining MDBX and static file config with
`.write()`, `.sync()`, `.max_size()`, `.static_files(|sf| ...)`,
`.in_memory()`, `.build(path)`, and `.build_ephemeral()`
Db::new() and Db::in_memory() now delegate to DbBuilder. Existing
constructors (open, open_ro, open_no_sync) preserved for backward
compatibility.
Example usage:
// Production with custom buffers
let db = DbBuilder::new()
.write()
.static_files(|sf| sf.write_buffer_size(1 << 20))
.build(path)?;
// Tests
let db = DbBuilder::new().in_memory().build_ephemeral()?;
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uilder directly Replace the closure-based `static_files(|sf| sf.write_buffer_size(...))` with `static_files(builder)` that accepts a pre-configured StaticFilesBuilder. This allows the static files builder to be configured separately and passed in: let sf = StaticFilesBuilder::new().write_buffer_size(1 << 20); let db = DbBuilder::new().write().static_files(sf).build(path)?; Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a static file storage layer for immutable, append-only block and transaction data. Heavy values (headers, transactions, receipts, execution traces, state updates) are moved from MDBX B-trees to sequential flat
.datfiles, while MDBX retains the role of authoritative index — storing offset pointers and all mutable/random-access data.Motivation
Katana's MDBX database stores 37 tables, but several contain data that is written once and never modified — headers, transactions, receipts, traces, state updates. Storing these large values in MDBX B-trees causes:
Flat file appends eliminate these costs while preserving ACID consistency through MDBX-gated reads.
Architecture
Old design: everything in MDBX
New design: MDBX pointers + static file data
MDBX is the single source of truth for what data exists. A reader's MDBX transaction snapshot determines exactly which static file data is visible. Static files have no independent index — they are raw data blobs addressed solely by MDBX pointers.
What moved to static files
headers.datblock_state_updates.dattransactions.datreceipts.dattx_traces.datblock_hashes.dattx_hashes.dattx_blocks.datWhat stays in MDBX
Headers,BlockStateUpdates,Transactions,Receipts,TxTracesStaticFileRefpointers (13 bytes) instead of the actual dataBlockBodyIndicesBlockNumbers,TxNumbersBlockStatussesBlockHashes,TxHashes,TxBlocksStaticFileRef— the MDBX-stored pointer.datfiles, MDBX storesStaticFilepointersInlinein MDBX directly, since static file appends require sequential keysKey Design Decisions
1. MDBX gates all static file reads
Static file reads are not independent — they always go through an MDBX transaction snapshot first. This ensures:
2. Offset pointers in MDBX, not in
.idxfilesEarly iterations stored offsets in separate
.idxfiles alongside.datfiles. This was moved to MDBX because.idxfiles are outside the MDBX transaction boundary — a reader could see an offset written by an uncommitted transaction, breaking snapshot isolation.3.
BlockBodyIndicesstays in MDBXAt ~10 bytes per entry, the 13-byte
StaticFileRefpointer is larger than the data itself. Benchmarks confirmed +10% read regression with pointer indirection, which disappeared when moved back to direct MDBX storage.4. No fsync per block
Static files are not fsynced on every write. On crash, MDBX rolls back uncommitted pointers, and orphaned static file data is truncated on next startup. This matches MDBX's own durability model.
5. Fork mode uses inline storage
Fork providers insert blocks at non-sequential numbers. Since static files require sequential appends, fork mode stores compressed data inline in MDBX (
StaticFileRef::Inline). The index tables (BlockHashes,TxHashes,TxBlocks) are also written to MDBX in fork mode as fallback.6. Classes not moved
Contract classes (100KB–2MB JSON) are keyed by
ClassHash(random), not sequential numbers. The crash recovery mechanism relies on cursor-last to find the truncation point, which doesn't work for random keys. Classes are also cached by the executor, so the extra I/O hop matters less. Considered for a future iteration.Crash Recovery
Static file appends happen before the MDBX transaction commits. A crash between these steps leaves orphaned data:
On every
Db::open(),recover_static_files()reads the last committed MDBX pointer for each table and truncates the corresponding.datfile tooffset + length. Fixed-size files are truncated tocount * record_size. This is idempotent and runs before any writes.I/O Optimizations
pread(no file offset sharing). Writes serialized withpwriteunder a mutex.preadfor data not yet remapped.pwriteat 256KB. Reduces syscalls from ~54/block to a few per batch.insert_block_data_batch(): Phase 1 appends all data to static files (sequential I/O), Phase 2 writes all MDBX pointers (grouped B-tree inserts). Pre-sizes buffers based on batch dimensions.BlockHashes,TxHashes,TxBlocksonly written to MDBX in fork mode. Sequential mode skips 3 MDBX puts per block.Builder API
Benchmark Results (file-backed)
Writes — per-block vs batch (new)
Writes — per-block vs MDBX-only baseline
Reads — vs MDBX-only baseline (500 blocks × 100 txs)
Reads are at parity or faster. The write advantage grows with data volume as MDBX B-tree page splits become more expensive while flat file appends stay O(1).
Advantages
DbBuilder/StaticFilesBuilderDisadvantages
BlockBodyIndicesreads +7% — pointer indirection overhead for very small values (kept in MDBX to minimize)