Skip to content

Latest commit

 

History

History
1280 lines (1022 loc) · 94 KB

File metadata and controls

1280 lines (1022 loc) · 94 KB

[WIP] SurrealKV Architecture

SurrealKV is a versioned, embedded key-value database built on an LSM (Log-Structured Merge) tree architecture. It is designed specifically for use within SurrealDB, with the goal of reducing dependency on external storage engine (RocksDB). This approach allows the storage layer to evolve in alignment with SurrealDB’s requirements and access patterns.

This document provides a comprehensive overview of the internal design, data flow, and key subsystems.

Table of Contents

  1. Overall Architecture
  2. Write Path
  3. Read Path
  4. Versioning and MVCC
  5. Compaction
  6. Value Log (VLog)
  7. VLog Garbage Collection
  8. Recovery
  9. Checkpoint and Restore

Overall Architecture

SurrealKV is organized into four main layers: the Client API, Core Components, Storage Layer, and Background Tasks.

Data Flow

Data flows through the layers as follows:

  1. Writes: Client -> Transaction (buffer) -> CommitPipeline -> WAL + MemTable -> Background Flush -> SSTables
  2. Reads: Client -> Transaction -> Snapshot -> MemTable(s) -> SSTables -> (VLog if pointer)
  3. Compaction: Background task merges SSTables, respecting snapshot visibility, updating VLog discard stats
┌─────────────────────────────────────────────────────────────────────────────┐
│                              CLIENT LAYER                                   │
│  ┌─────────────┐      ┌─────────────────────────────────────────────────┐   │
│  │    Tree     │──────│                  Transaction                    │   │
│  │    API      │      │  ┌─────────┐  ┌─────────┐  ┌─────────────────┐  │   │
│  └─────────────┘      │  │write_set│  │ get()   │  │ commit()        │  │   │
│                       │  │ buffer  │  │ set()   │  │ rollback()      │  │   │
│                       │  └─────────┘  └─────────┘  └─────────────────┘  │   │
│                       └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                              CORE COMPONENTS                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │   Oracle     │  │   Commit     │  │   Active     │  │  Immutable   │    │
│  │  (conflict   │  │   Pipeline   │  │  MemTable    │  │  MemTables   │    │
│  │  detection)  │  │ (lock-free)  │  │  (skiplist)  │  │   (queue)    │    │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘    │
│         │                 │                 │                 │            │
│         └────────────────-┼─────────────────┼─────────────────┘             │
│                          ▼                 ▼                               │
│                    ┌──────────────────────────────────────┐                │
│                    │           Snapshot                   │                │
│                    │   (point-in-time consistent view)    │                │
│                    └──────────────────────────────────────┘                │
└────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                              STORAGE LAYER                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │     WAL      │  │   SSTables   │  │  Value Log   │  │   Manifest   │    │
│  │  (durability)│  │   (L0..Ln)   │  │  (WiscKey)   │  │   (levels)   │    │
│  │  32KB blocks │  │  sorted runs │  │ large values │  │  table meta  │    │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘    │
└────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                            BACKGROUND TASKS                                │
│  ┌──────────────────┐  ┌───────────────────┐  ┌──────────────────┐         │
│  │  MemTable Flush  │  │ Level Compaction  │  │     VLog GC      │         │
│  │  (imm -> L0)     │  │ (merge SSTables)  │  │ (reclaim space)  │         │
│  └──────────────────┘  └───────────────────┘  └──────────────────┘         │
└────────────────────────────────────────────────────────────────────────────┘

Layer Details

Client Layer: The Tree struct is the main entry point, providing methods to begin transactions, create checkpoints, and manage the database lifecycle. Each Transaction maintains its own write buffer (write_set) for staging changes before commit, enabling atomic multi-key updates and rollback support.

Core Components: These in-memory structures coordinate between the client API and persistent storage:

  • The Oracle assigns monotonically increasing sequence numbers and detects write-write conflicts between concurrent transactions
  • The CommitPipeline queues pending commits and ensures WAL writes are serialized while allowing concurrent memtable updates
  • The MemTable is a lock-free skip list that holds recent writes before they are flushed to disk
  • Snapshots capture the visible sequence number at transaction start, providing isolation from concurrent commits

Storage Layer: All durable state lives here:

  • The WAL ensures committed data survives crashes by persisting entries before they become visible
  • SSTables are immutable sorted files organized into levels, with Level 0 containing recently flushed data
  • The VLog stores large values separately, reducing SSTable size and compaction cost
  • The Manifest tracks which SSTables exist at each level and their key ranges

Background Tasks: These run asynchronously to maintain system health:

  • Flush converts immutable memtables into Level 0 SSTables
  • Compaction merges overlapping SSTables to reduce read amplification and reclaim space
  • VLog GC removes stale entries from value log files when their discard ratio exceeds a threshold

Component Overview

Component Purpose Key Files
Tree Public API for database operations src/lsm.rs
Transaction ACID transaction with MVCC src/transaction.rs
Oracle Transaction timestamp and conflict detection src/oracle.rs
CommitPipeline Lock-free commit queue (inspired by Pebble) src/commit.rs
MemTable In-memory skip list for recent writes src/memtable/
Snapshot Point-in-time consistent view src/snapshot.rs
WAL Write-ahead log for durability src/wal/
SSTable Sorted string table on disk src/sstable/
VLog Value log for large value separation src/vlog.rs
Manifest Level metadata and SSTable tracking src/levels/

Directory Structure

database_path/
├── wal/                    # Write-ahead log segments
│   ├── 00000000000000000001.wal
│   └── 00000000000000000002.wal
├── sstables/               # SSTable files
│   ├── 00000000000000000001.sst
│   └── 00000000000000000002.sst
├── manifest/               # Level manifest files
│   └── 00000000000000000001.manifest
├── vlog/                   # Value log files (if enabled)
│   ├── 00000000000000000001.vlog
│   └── 00000000000000000002.vlog
├── versioned_index/        # B+tree index (if enabled)
└── LOCK                    # Lock file

Write Path

This section covers how data flows from a transaction's set() call through to durable storage.

Overview

A write operation proceeds through three main phases:

  1. Transaction Phase: The application buffers writes in the transaction's local write_set. No locks are held, no disk I/O occurs, and other transactions cannot see these changes.

  2. Commit Phase: When the application calls commit(), the transaction coordinates with the Oracle for conflict detection, assigns sequence numbers, and enters the commit pipeline.

  3. Background Phase: Asynchronously, full memtables are rotated and flushed to Level 0 SSTables.

The design minimizes contention during concurrent writes. The WAL write is kept as short as possible, while memtable insertion and visibility publishing can proceed in parallel.

┌────────────────────────────────────────────────────────────────────────────┐
│                              TRANSACTION PHASE                             │
│                                                                            │
│  txn.set(key, value)  ──►  Buffer in write_set  ──►  txn.commit()          │
│                            (BTreeMap)                                      │
└────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                                 COMMIT PHASE                               │
│                                                                            │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐         │
│  │     Oracle      │───►│  Assign SeqNum  │───►│  Create Batch   │         │
│  │ (conflict check)│    │  (atomic inc)   │    │  (entries)      │         │
│  └─────────────────┘    └─────────────────┘    └─────────────────┘         │
└────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                         COMMIT PIPELINE (Serialized)                       │
│                                                                            │
│  ┌─────────────────┐    ┌─────────────────┐                                │
│  │ Acquire Write   │───►│   Write to WAL  │                                │
│  │     Mutex       │    │  (durability)   │                                │
│  └─────────────────┘    └─────────────────┘                                │
│                                  │                                         │
│                                  ▼                                         │
│                    ┌─────────────────────────────┐                         │
│                    │  Value Size > Threshold?    │                         │
│                    └─────────────────────────────┘                         │
│                         │                  │                               │
│                    Yes  ▼                  ▼  No                           │
│            ┌─────────────────┐    ┌─────────────────┐                      │
│            │ Append to VLog  │    │  Store Inline   │                      │
│            │ Create Pointer  │    │  (in SSTable)   │                      │
│            └─────────────────┘    └─────────────────┘                      │
│                         │                  │                               │
│                         └────────┬─────────┘                               │
│                                  ▼                                         │
│                    ┌─────────────────────────────┐                         │
│                    │   Apply to Active MemTable  │                         │
│                    │       (skip list insert)    │                         │
│                    └─────────────────────────────┘                         │
│                                  │                                         │
│                                  ▼                                         │
│                    ┌─────────────────────────────┐                         │
│                    │  Publish Visibility         │                         │
│                    │  (update visible_seq_num)   │                         │
│                    └─────────────────────────────┘                         │
└────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                               BACKGROUND TASKS                             │
│                                                                            │
│  ┌─────────────────────────┐    ┌─────────────────────────────────────┐    │
│  │  MemTable Full?         │───►│  Rotate: Active → Immutable Queue   │    │
│  │  (size > threshold)     │Yes │  Create new empty Active MemTable   │    │
│  └─────────────────────────┘    └─────────────────────────────────────┘    │
│                                              │                             │
│                                              ▼                             │
│                                 ┌─────────────────────────────────────┐    │
│                                 │  Flush Immutable to L0 SSTable      │    │
│                                 │  (background task, async)           │    │
│                                 └─────────────────────────────────────┘    │
└────────────────────────────────────────────────────────────────────────────┘

Key Implementation Details

Transaction Buffering

All writes within a transaction are buffered in a BTreeMap<Key, Vec<Entry>> called write_set. The BTreeMap maintains keys in sorted order, which enables efficient range scans during read-your-own-writes checks. Multiple writes to the same key within a transaction are stored as a vector, preserving the write order.

This buffering approach provides:

  • Read-your-own-writes (RYOW): Uncommitted changes are visible to subsequent reads within the same transaction
  • Savepoint support: The write set can be partially rolled back to a previous savepoint
  • Atomic commit: All buffered writes are applied together or not at all

Commit Pipeline

The commit pipeline is inspired by Pebble's commit design. The key insight is that while WAL writes must be serialized (to maintain a consistent log order), other commit steps can proceed concurrently.

The pipeline operates in three stages:

  1. Write: Acquire the write mutex, append entries to WAL, release mutex
  2. Apply: Insert entries into the active memtable (lock-free skip list)
  3. Publish: Update visible_seq_num to make entries visible to readers

A semaphore limits concurrent commits (default: 8) to prevent memory exhaustion from too many in-flight transactions.

Value Separation (WiscKey)

When the Value Log is enabled, the commit path checks each value's size against vlog_value_threshold (default: 1KB):

  • Large values (> threshold): Written to VLog file, a 25-byte ValuePointer is stored in the memtable/SSTable
  • Small values (<= threshold): Stored inline in the memtable/SSTable

This trade-off reduces write amplification during compaction (fewer bytes to merge) at the cost of an extra read for large values. The threshold can be tuned based on workload characteristics.

Durability Modes

SurrealKV offers two durability levels:

  • Eventual (default): Data is written to the OS buffer cache but not fsynced. This is fast but data may be lost if the system crashes before the OS flushes to disk.
  • Immediate: Each commit calls fsync before returning. This guarantees durability but reduces throughput.

Sequence Numbers

Every entry receives a monotonically increasing 56-bit sequence number (the upper 56 bits of the 64-bit trailer). Sequence numbers serve multiple purposes:

  • MVCC visibility: Readers with snapshot seq=N see only entries with seq <= N
  • Ordering: During compaction, newer entries supersede older ones with the same key
  • Conflict detection: The Oracle tracks which keys were written at which sequence numbers

MemTable Rotation

When the active memtable exceeds max_memtable_size, it is rotated:

  1. The current memtable is marked immutable and added to a flush queue
  2. A new empty memtable becomes the active one
  3. A background task asynchronously flushes immutable memtables to Level 0 SSTables

This rotation is transparent to writers and ensures the active memtable never grows unbounded.


Read Path

The read path describes how data is retrieved from a transaction's get() call, traversing all storage layers. Understanding the read path helps explain SurrealKV's performance characteristics and the trade-offs involved in LSM-tree designs.

Overview

In an LSM tree, reads must check multiple locations because the same key may exist in several places simultaneously:

  • The transaction's uncommitted write buffer
  • The active memtable (most recent committed writes)
  • One or more immutable memtables (pending flush)
  • Level 0 SSTables (recently flushed, may overlap)
  • Level 1+ SSTables (compacted, non-overlapping)

The search proceeds from newest to oldest, stopping at the first match. This ensures that the most recent version of a key is always returned.

Read Amplification

Read amplification measures how many locations must be checked to find a key. In the worst case (key doesn't exist), a read may check all levels. LSM trees trade write performance for read performance: writes are fast (sequential appends) but reads may require multiple lookups.

SurrealKV minimizes read amplification through:

  • Bloom filters: Quickly eliminate SSTables that definitely don't contain a key
  • Block cache: Keep frequently accessed data blocks in memory
  • Non-overlapping levels: L1+ SSTables don't overlap, enabling binary search
┌─────────────────────────────────────────────────────────────────────────────┐
│                           TRANSACTION READ                                  │
│                                                                             │
│  txn.get(key)  ──►  Check write_set (RYOW)  ──►  Found? Return immediately  │
│                            │                                                │
│                            ▼ Not in write_set                               │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           SNAPSHOT READ                                     │
│                                                                             │
│  Create Snapshot (seq_num = visible_seq_num at txn start)                   │
│                            │                                                │
│                            ▼                                                │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ Search Order (stop on first match with seq_num <= snapshot.seq_num): │   │
│  │                                                                      │   │
│  │   1. Active MemTable (newest writes, in-memory skiplist)             │   │
│  │              │                                                       │   │
│  │              ▼                                                       │   │
│  │   2. Immutable MemTables (newest first, pending flush)               │   │
│  │              │                                                       │   │
│  │              ▼                                                       │   │
│  │   3. Level 0 SSTables (ALL tables, may overlap)                      │   │
│  │              │                                                       │   │
│  │              ▼                                                       │   │
│  │   4. Level 1...N SSTables (binary search, non-overlapping)           │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────-───┐
│                         SSTABLE LOOKUP DETAIL                               |
│                                                                             │
│  For each SSTable candidate:                                                │
│                                                                             │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐          │
│  │  Bloom Filter   │───►│  Index Block    │───►│  Data Block     │          │
│  │  (may_contain?) │ Yes│  (binary search)│    │  (with cache)   │          │
│  └─────────────────┘    └─────────────────┘    └─────────────────┘          │
│         │ No                                                                │
│         ▼                                                                   │
│    Skip to next SSTable                                                     │
└─────────────────────────────────────────────────────────────────────────-───┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         VALUE RESOLUTION                                    │
│                                                                             │
│                    ┌─────────────────────────────-┐                         │
│                    │  Is ValueLocation a Pointer? │                         │
│                    └─────────────────────────────-┘                         │
│                         │                  │                                │
│                    Yes  ▼                  ▼  No                            │
│            ┌────────────────-─┐    ┌─────────────────┐                      │
│            │  Read from VLog  │    │ Return Inline   │                      │
│            │  (file_id,offset)│    │    Value        │                      │
│            └─────────────────-┘    └─────────────────┘                      │
│                         │                  │                                │
│                         └────────┬─────────┘                                │
│                                  ▼                                          │
│                         Return Value to Caller                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Implementation Details

RYOW (Read Your Own Writes)

Before checking persistent storage, a read first consults the transaction's local write_set buffer. This ensures uncommitted changes are visible to the transaction that made them, which is needed for multi-statement transactions that read data they just wrote.

Snapshot Isolation

When a transaction begins, it captures the current visible_seq_num as its snapshot. All reads within that transaction see a consistent view of the database as of that moment:

  • Entries with seq_num <= snapshot.seq_num are visible
  • Entries with seq_num > snapshot.seq_num are invisible (committed after the transaction started)

This provides repeatable reads: the same query will return the same results throughout the transaction, regardless of concurrent commits.

Level 0 vs Level 1+

Level 0 is special because it contains recently flushed memtables whose key ranges may overlap. When searching L0, all SSTables must be checked because any of them might contain the target key.

Levels 1 and above maintain a sorted, non-overlapping invariant: each SSTable covers a distinct key range. This enables binary search to find the single SSTable that might contain a key, reducing read amplification for deeper levels.

Bloom Filters

Each SSTable includes a Bloom filter, a probabilistic data structure that can definitively say "key is NOT in this table" or "key MIGHT be in this table." With a 1% false positive rate (tunable), Bloom filters eliminate ~99% of unnecessary SSTable reads.

The filter is loaded into memory when the SSTable is opened, so checking it is extremely fast compared to reading from disk.

Block Cache

SSTable data is organized into blocks (default: 4KB). When a block is read from disk, it is cached in an LRU (Least Recently Used) cache. Subsequent reads for keys in the same block are served from memory.

The cache key is (file_id, block_offset), ensuring that blocks from different SSTables don't collide.

VLog Resolution

When value separation is enabled, large values are not stored directly in SSTables. Instead, a ValuePointer (25 bytes) indicates where to find the actual value in the VLog.

The read path checks the BIT_VALUE_POINTER flag in the value's metadata byte:

  • If set: Read the value from VLog at the specified (file_id, offset)
  • If not set: Return the inline value directly

This extra indirection adds latency for large value reads but reduces SSTable size and compaction cost.


Versioning and MVCC

SurrealKV implements Multi-Version Concurrency Control (MVCC) with snapshot isolation for concurrent reads and writes. MVCC stores multiple versions of each key, allowing readers and writers to operate concurrently without blocking each other.

Why MVCC?

MVCC provides the following:

  1. Non-blocking reads: Readers never block writers, and writers never block readers. Each transaction sees a consistent snapshot.

  2. Time-travel queries: Applications can query historical data at specific timestamps, enabling auditing, debugging, and undo functionality.

  3. Optimistic concurrency: Transactions proceed without locks and check for conflicts at commit time, which works well for read-heavy workloads.

  4. Isolation guarantees: Each transaction sees a consistent view of the database, unaffected by concurrent modifications.

InternalKey Structure

Every key stored in SurrealKV is an InternalKey that combines the user-provided key with version metadata. This encoding is central to how MVCC works:

┌──────────────────┬───────────────────┬──────────────────┐
│    user_key      │    trailer (8B)   │  timestamp (8B)  │
│   (variable)     │  seq_num | kind   │   nanoseconds    │
└──────────────────┴───────────────────┴──────────────────┘

trailer = (seq_num << 8) | kind

kind values:
  0 = Delete (hard delete, tombstone)
  1 = SoftDelete (versioned delete)
  2 = Set (normal write)
  6 = Replace (replaces all previous versions)

The trailer packs the sequence number and operation kind into 8 bytes. The sequence number occupies the upper 56 bits, and the kind occupies the lower 8 bits. This encoding allows efficient comparison: keys are sorted first by user key, then by sequence number (descending), so the newest version appears first.

The timestamp (optional, enabled with versioning) stores the application-provided timestamp in nanoseconds. This enables point-in-time queries using actual wall-clock times rather than internal sequence numbers.

Operation Kinds Explained:

  • Delete (0): A hard delete that removes the key permanently. During compaction at the bottom level, this tombstone can be dropped.
  • SoftDelete (1): A versioned delete that preserves history. The key appears deleted to current readers but historical queries can still see previous versions.
  • Set (2): A normal write that creates a new version of the key.
  • Replace (6): A special write that supersedes all previous versions, allowing aggressive garbage collection.

Snapshot Isolation

Snapshot isolation ensures that each transaction sees a consistent view of the database. When a transaction starts, it captures the current visible_seq_num. All operations within that transaction see exactly the state as of that moment.

Timeline:
─────────────────────────────────────────────────────────────►
     │           │           │           │
  seq=1       seq=2       seq=3       seq=4
  Set(A,1)    Set(A,2)    Set(A,3)    Delete(A)
     │           │           │           │
     │           │           │           │
     ▼           ▼           ▼           ▼
┌─────────────────────────────────────────────────────────┐
│ Snapshot at seq=2 sees: A=2                             │
│ Snapshot at seq=3 sees: A=3                             │
│ Snapshot at seq=4 sees: A deleted (tombstone)           │
└─────────────────────────────────────────────────────────┘

How visibility works:

When reading key A with snapshot seq=2:

  1. Find all versions of A: (seq=4, Delete), (seq=3, Set), (seq=2, Set), (seq=1, Set)
  2. Filter to versions with seq <= 2: (seq=2, Set), (seq=1, Set)
  3. Return the newest visible version: seq=2, value=2

The Oracle

The Oracle is responsible for:

  • Sequence number allocation: Atomically incrementing the global sequence counter for each commit
  • Conflict detection: Tracking which keys were written at which sequence numbers to detect write-write conflicts

When two transactions both modify the same key, the Oracle ensures only one can commit. The first to call commit wins; the second receives a conflict error and must retry.

Version Retention

By default, compaction drops old versions as soon as they are no longer visible to any active snapshot. When versioning is enabled (with_versioning(true, retention_ns)), the behavior changes:

  • All versions are preserved during compaction, not just the latest
  • Retention period: Versions older than retention_ns nanoseconds may be garbage collected
  • Unlimited retention: Setting retention_ns = 0 keeps all versions forever

This enables time-travel queries where applications can read the database state at any historical timestamp within the retention window.

Versioned Storage Architecture

When versioning is enabled, SurrealKV automatically configures the storage layer to support efficient historical queries. The key insight is that all values are stored in the VLog, and indexes (both LSM and optional B+tree) store only value pointers. This architecture enables multiple index structures on top of a single value store.

┌─────────────────────────────────────────────────────────────────────────────┐
│                     VERSIONED STORAGE ARCHITECTURE                          │
│                                                                             │
│  When versioning is enabled, all values are stored in VLog.                 │
│  Indexes store only value pointers, enabling multiple index structures.     │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        INDEX LAYER                                  │    │
│  │                                                                     │    │
│  │   ┌───────────────────┐         ┌───────────────────┐               │    │
│  │   │  LSM Tree Index   │         │  B+Tree Index     │               │    │
│  │   │  (always present) │         │  (optional)       │               │    │
│  │   │                   │         │                   │               │    │
│  │   │  InternalKey ──►  │         │  InternalKey ──►  │               │    │
│  │   │  ValuePointer     │         │  ValuePointer     │               │    │
│  │   └─────────┬─────────┘         └─────────┬─────────┘               │    │
│  │             │                             │                         │    │
│  │             └──────────────┬──────────────┘                         │    │
│  │                            │                                        │    │
│  └────────────────────────────┼────────────────────────────────────────┘    │
│                               │                                             │
│                               ▼                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        VALUE LOG (VLog)                             │    │
│  │                                                                     │    │
│  │   All versioned entries stored here (vlog_value_threshold = 0)      │    │
│  │                                                                     │    │
│  │   ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐                   │    │
│  │   │ Entry 1 │ │ Entry 2 │ │ Entry 3 │ │ Entry 4 │ ...               │    │
│  │   │ v1 of A │ │ v2 of A │ │ v1 of B │ │ v3 of A │                   │    │
│  │   └─────────┘ └─────────┘ └─────────┘ └─────────┘                   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Configuration:

When with_versioning(true, retention_ns) is called:

  • VLog is automatically enabled
  • vlog_value_threshold is set to 0, so all values go to VLog
  • The LSM tree stores (InternalKey → ValuePointer) instead of inline values

Two Query Modes:

Mode Configuration Query Method
LSM-only with_versioning(true, retention_ns) K-way merge across all LSM components
B+tree indexed + with_versioned_index(true) Direct B+tree lookup

LSM-based Versioned Queries:

When using LSM-only versioning, the history() API uses a KMergeIterator that gathers iterators from:

  • Active memtable
  • All immutable memtables (pending flush)
  • All SSTable levels (L0, L1, L2, ...)

These are merged using a k-way merge that returns entries sorted by (user_key, seq_num descending). Finding all versions of a key requires scanning all levels.

B+tree Versioned Queries:

When with_versioned_index(true) is also enabled, versioned queries go directly to the B+tree which stores all (InternalKey → ValuePointer) entries. The B+tree keeps entries sorted, so all versions of a key are contiguous and can be retrieved with a single seek.

The B+tree implementation uses a disk-based page management approach, with the handling of large entries drawing inspiration from SQLite's overflow page design.

Why Use B+tree Index:

The motivation for the optional B+tree index:

  • LSM stores keys across multiple SSTables; finding all versions requires checking every level
  • B+tree consolidates all versions in a single sorted structure
  • Single index lookup vs. merging N sources (where N = memtables + SSTables)

B+tree Trade-offs:

Pros Cons
Single index lookup instead of k-way merge Slow insertion (in-place modified tree on disk)
Fast point and range queries for history Insert performance slows during LSM writes
All versions of a key are contiguous Every LSM write also updates B+tree

Timestamp Ordering Limitation:

When using LSM-only versioning (without B+tree index), timestamps inserted "back in time" will not be read correctly. This occurs because:

  • LSM orders entries by (user_key ASC, seq_num DESC)
  • Point-in-time queries (get_at) find the first entry with seq_num <= snapshot_seq where timestamp <= query_timestamp
  • A later-inserted entry with an earlier timestamp will have a higher sequence number, causing it to be returned instead of the correct historical value

To support out-of-order timestamp inserts, enable the B+tree index with with_versioned_index(true). The B+tree stores entries sorted by (user_key, timestamp) and supports in-place updates, correctly handling historical data insertion.

Read-after-Write Consistency:

Currently, the B+tree index is updated synchronously during LSM writes, providing read-after-write consistency for versioned queries. A recently written version is immediately visible via the history() API.

Future Work:

Async indexing could improve write performance by decoupling B+tree updates from the LSM write path. However, this would require relaxing the read-after-write consistency guarantee (recently written versions may not be immediately visible via B+tree queries). This is a potential optimization for future versions.


Compaction

Compaction is a background process that merges SSTables to reduce read amplification, reclaim space from obsolete versions, and maintain the leveled structure of the LSM tree.

Why Compaction is Necessary

Without compaction, an LSM tree would degrade over time:

  • Read amplification grows: More SSTables means more files to check for each read
  • Space amplification grows: Obsolete versions and tombstones accumulate
  • Level 0 bloats: Overlapping SSTables in L0 hurt read performance

Compaction addresses these issues by:

  1. Merging overlapping SSTables into non-overlapping ones
  2. Dropping obsolete versions that are no longer visible
  3. Removing tombstones at the bottom level
  4. Maintaining size ratios between levels

Leveled Compaction Strategy

SurrealKV uses leveled compaction, similar to RocksDB's default strategy. Each level has a target size, with each level being approximately 10x larger than the previous:

  • Level 0: ~64MB (4 files × 16MB each, overlapping allowed)
  • Level 1: ~256MB (non-overlapping)
  • Level 2: ~2.5GB (non-overlapping)
  • Level N: 10x Level N-1
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LSM TREE STRUCTURE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LEVEL 0 (Overlapping - recently flushed memtables)                         │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐                                        │
│  │ SST A   │ │ SST B   │ │ SST C   │  ← May have overlapping key ranges     │
│  │ a───z   │ │ d───m   │ │ b───k   │                                        │
│  └─────────┘ └─────────┘ └─────────┘                                        │
│       │           │           │                                             │
│       └───────────┼───────────┘                                             │
│                   ▼                                                         │
│  LEVEL 1 (Non-overlapping, sorted)                                          │
│  ┌─────────┬─────────┬─────────┐                                            │
│  │ SST 1   │ SST 2   │ SST 3   │  ← Non-overlapping, can binary search      │
│  │ a───f   │ g───m   │ n───z   │                                            │
│  └─────────┴─────────┴─────────┘                                            │
│                   │                                                         │
│                   ▼                                                         │
│  LEVEL 2 (Non-overlapping, 10x larger)                                      │
│  ┌──────┬──────┬──────┬──────┐                                              │
│  │SST 1 │SST 2 │SST 3 │SST 4 │                                              │
│  │ a─c  │ d─h  │ i─o  │ p─z  │                                              │
│  └──────┴──────┴──────┴──────┘                                              │
│                   │                                                         │
│                   ▼                                                         │
│  LEVEL N (Bottom level)                                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                         COMPACTION TRIGGER                                  │
│                                                                             │
│  Calculate scores for each level:                                           │
│    L0 score = max(file_count / 4, total_bytes / max_bytes)                  │
│    Ln score = level_bytes / target_bytes_for_level                          │
│                                                                             │
│  Pick level with highest score >= 1.0                                       │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          FILE SELECTION                                     │
│                                                                             │
│  1. Select file by priority:                                                │
│     - ByCompensatedSize: file_size * (1 + delete_ratio * 0.5)               │
│     - OldestSmallestSeqFirst: oldest sequence number                        │
│                                                                             │
│  2. Expand to include all files sharing boundary keys                       │
│                                                                             │
│  3. Select overlapping files from target level                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                      COMPACTION ITERATOR                                    │
│                                                                             │
│  For each key in merged iterator:                                           │
│                                                                             │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                   SNAPSHOT-AWARE DECISION                             │  │
│  │                                                                       │  │
│  │  1. Find earliest snapshot that can see this version                  │  │
│  │                                                                       │  │
│  │  2. If same visibility boundary as newer version → DROP (superseded)  │  │
│  │                                                                       │  │
│  │  3. If visible to any active snapshot → KEEP                          │  │
│  │                                                                       │  │
│  │  4. If versioning enabled and within retention → KEEP                 │  │
│  │                                                                       │  │
│  │  5. Tombstone at bottom level with no snapshots → DROP                │  │
│  │                                                                       │  │
│  │  6. Otherwise → DROP                                                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  Track VLog discard stats for garbage collection                            │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           FINISH                                            │
│                                                                             │
│  1. Write new SSTable to disk                                               │
│  2. Update manifest atomically (add new, remove old)                        │
│  3. Delete old SSTable files                                                │
│  4. Flush VLog discard stats                                                │
└─────────────────────────────────────────────────────────────────────────────┘

Compaction Scoring

Each level receives a compaction score that determines priority:

Level 0 Score:

L0_score = max(file_count / L0_compaction_trigger, total_bytes / max_bytes)

L0 is special because overlapping files hurt read performance. The score considers both file count and total size.

Level N Score (N > 0):

Ln_score = level_bytes / target_bytes_for_level

Higher levels are compacted when they exceed their target size.

The level with the highest score >= 1.0 is selected for compaction.

File Selection Algorithms

When a level is selected for compaction, individual files are chosen based on priority:

  • ByCompensatedSize: Prioritizes files with high delete ratios: score = file_size * (1 + delete_ratio * 0.5). This helps reclaim space from files with many tombstones.
  • OldestSmallestSeqFirst: Prioritizes files with the oldest sequence numbers, which helps propagate old data to deeper levels.

After selecting the initial file, the compaction expands to include:

  1. All files in the source level that share boundary keys (to avoid key range gaps)
  2. All overlapping files in the target level

Write Amplification

Write amplification measures how many times data is written to disk over its lifetime. In leveled compaction:

  • Each entry is written once per level it passes through
  • With a 10x size ratio and N levels, write amplification is approximately N
  • Typical values: 10-30x for write-heavy workloads

SurrealKV's value separation (VLog) reduces write amplification by keeping large values out of the compaction process entirely.


Value Log (VLog)

SurrealKV implements WiscKey-style value separation to reduce write amplification for large values. This design is based on the WiscKey paper from FAST '16.

The Write Amplification Problem

In a traditional LSM tree, values are stored alongside keys in SSTables. During compaction, entire key-value pairs are read, merged, and rewritten. For large values, this leads to significant write amplification:

  • A 1MB value written once may be rewritten 10-30 times as it moves through levels
  • SSD lifespan is reduced by excessive writes
  • Compaction becomes I/O bound on value size rather than key count

WiscKey Solution

WiscKey separates keys from values:

  • Keys (typically small) remain in the LSM tree and participate in compaction
  • Values (potentially large) are stored separately in an append-only log

This separation provides:

  • Reduced write amplification: Only keys are compacted, not values
  • Faster compaction: Less data to read and write
  • Better SSD utilization: Sequential writes to VLog, random reads amortized by SSD parallelism

Trade-offs:

  • Read overhead: Large value reads require an extra I/O to the VLog
  • Garbage collection: Stale VLog entries must be reclaimed separately
  • Complexity: Additional component to manage

When to Enable VLog

VLog is beneficial when:

  • Values are large (> 1KB on average)
  • Write volume is high
  • SSD longevity is a concern

VLog may hurt performance when:

  • Values are small (overhead of pointer > benefit)
  • Workload is read-heavy (extra VLog I/O per read)
  • Random access patterns (VLog reads are sequential within a file)

Value Separation Decision

┌─────────────────────────────────────────────────────────────────────────────┐
│                      VALUE SEPARATION DECISION                              │
│                                                                             │
│  During commit, for each value:                                             │
│                                                                             │
│                    ┌─────────────────────────────┐                          │
│                    │  value.len() > threshold?   │                          │
│                    │  (default: 1KB)             │                          │
│                    └─────────────────────────────┘                          │
│                         │                  │                                │
│                    Yes  ▼                  ▼  No                            │
│            ┌─────────────────┐    ┌─────────────────┐                       │
│            │  Store in VLog  │    │  Store Inline   │                       │
│            │                 │    │                 │                       │
│            │ ┌─────────────┐ │    │ ┌─────────────┐ │                       │
│            │ │ValuePointer │ │    │ │ Raw Value   │ │                       │
│            │ │ (25 bytes)  │ │    │ │ (variable)  │ │                       │
│            │ │ in SSTable  │ │    │ │ in SSTable  │ │                       │
│            │ └─────────────┘ │    │ └─────────────┘ │                       │
│            └─────────────────┘    └─────────────────┘                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

ValuePointer Structure

┌─────────────────────────────────────────────────────────────────┐
│                    ValuePointer (25 bytes)                       │
├─────────┬──────────┬──────────┬────────────┬────────────┬───────┤
│ version │ file_id  │  offset  │  key_size  │ value_size │  crc  │
│  (1B)   │  (4B)    │  (8B)    │   (4B)     │   (4B)     │ (4B)  │
└─────────┴──────────┴──────────┴────────────┴────────────┴───────┘

ValueLocation Encoding

┌─────────────────────────────────────────────────────────────────┐
│                      ValueLocation                              │
├─────────┬─────────┬─────────────────────────────────────────────┤
│  meta   │ version │               value                         │
│  (1B)   │  (1B)   │   (inline bytes OR ValuePointer)            │
└─────────┴─────────┴─────────────────────────────────────────────┘

meta bit 0 = BIT_VALUE_POINTER
  If set: value contains ValuePointer (25 bytes)
  If not set: value contains inline bytes

VLog File Format

┌─────────────────────────────────────────────────────────────────────────────┐
│                          VLOG FILE FORMAT                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                      FILE HEADER (31 bytes)                        │     │
│  ├─────────┬─────────┬─────────┬────────────┬────────────┬────────────┤     │
│  │  magic  │ version │ file_id │ created_at │  max_size  │ compressed │     │
│  │  (4B)   │  (2B)   │  (4B)   │   (8B)     │    (8B)    │    (1B)    │     │
│  │ "VLOG"  │         │         │            │            │            │     │
│  └─────────┴─────────┴─────────┴────────────┴────────────┴────────────┘     │
│                                                                             │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                         ENTRY 1                                    │     │
│  ├───────────┬────────────┬──────────────┬──────────────┬─────────────┤     │
│  │  key_len  │ value_len  │     key      │    value     │    crc32    │     │
│  │   (4B)    │   (4B)     │  (variable)  │  (variable)  │    (4B)     │     │
│  └───────────┴────────────┴──────────────┴──────────────┴─────────────┘     │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         ENTRY 2                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                ...                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         ENTRY N                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Entry total size = 8 + key_len + value_len + 4 bytes
File rotation when size >= vlog_max_file_size (default: 256MB)

VLog Configuration

Key configuration options:

Option Default Description
enable_vlog false Enable value separation
vlog_value_threshold 1KB Values larger than this go to VLog
vlog_max_file_size 256MB Maximum size before rotating to new file
vlog_gc_discard_ratio 0.5 Trigger GC when 50% of file is garbage

File Management

VLog files are managed as follows:

  1. Active file: New values are appended to the current active file
  2. Rotation: When the active file reaches vlog_max_file_size, it is closed and a new file is created
  3. Reference counting: Each VLog file maintains a reference count of active iterators to prevent deletion during reads
  4. Garbage collection: Files with high discard ratios are compacted to reclaim space (see next section)

VLog Garbage Collection

VLog files accumulate stale entries as values are overwritten or deleted.

The Garbage Problem

When a key is updated or deleted:

  1. The new value is written to a new VLog file (values are append-only)
  2. The LSM tree's pointer is updated to reference the new location
  3. The old VLog entry becomes unreferenced

Over time, old VLog files may contain only stale entries that are no longer referenced by any SSTable. These files can be safely deleted to reclaim disk space.

How Garbage is Tracked

The key insight is that a VLog file is safe to delete when no live SSTable references it. SurrealKV tracks this using the oldest_vlog_file_id metadata in each SSTable:

During SSTable Creation:

When writing an SSTable (during flush or compaction), the TableWriter tracks the minimum VLog file ID referenced by any value pointer in that SSTable. This is stored in the SSTable's properties as oldest_vlog_file_id.

Computing the Global Minimum:

The LevelManifest.min_oldest_vlog_file_id() method computes the minimum oldest_vlog_file_id across all live SSTables:

// Pseudocode
min_oldest_vlog = manifest
    .iter()
    .filter(|sst| sst.oldest_vlog_file_id > 0)
    .map(|sst| sst.oldest_vlog_file_id)
    .min()
    .unwrap_or(0)

Any VLog file with file_id < min_oldest_vlog is safe to delete because no SSTable references it.

┌─────────────────────────────────────────────────────────────────────────────┐
│                          SST METADATA TRACKING                              │
│                                                                             │
│  During SSTable creation (flush or compaction):                             │
│                                                                             │
│  For each key-value pair:                                                   │
│    if value is a VLog pointer:                                              │
│      min_vlog_file_id = min(min_vlog_file_id, pointer.file_id)              │
│                                                                             │
│  Store min_vlog_file_id as oldest_vlog_file_id in SST properties            │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    AFTER FLUSH OR COMPACTION                                │
│                                                                             │
│  1. Compute global minimum:                                                 │
│     min_oldest_vlog = manifest.min_oldest_vlog_file_id()                    │
│                                                                             │
│  2. If min_oldest_vlog == 0: No SSTs reference VLog, skip cleanup           │
│                                                                             │
│  3. Otherwise: Delete VLog files with file_id < min_oldest_vlog             │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              ITERATOR SAFETY                                │
│                                                                             │
│  Before deleting a VLog file, check iterator_count:                         │
│    - If iterator_count > 0: Defer deletion (readers may access file)        │
│    - If iterator_count == 0: Safe to delete immediately                     │
│                                                                             │
│  This prevents deleting files that in-flight reads may access.              │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         VERSIONED INDEX CLEANUP                             │
│                                                                             │
│  If versioned_index (B+tree) is enabled:                                    │
│    - Scan B+tree entries for stale VLog pointers                            │
│    - Remove entries where pointer.file_id < min_oldest_vlog                 │
│    - Uses read-scan + batched-write-delete for concurrency safety           │
└─────────────────────────────────────────────────────────────────────────────┘

GC Components

Component Purpose Location
oldest_vlog_file_id SST metadata tracking oldest VLog file referenced SST properties
min_oldest_vlog_file_id() Computes global minimum across all SSTs LevelManifest method
iterator_count Safety counter for active readers VLog struct
cleanup_stale_versioned_index() Removes stale B+tree entries CoreInner method

GC Trigger Points

VLog garbage collection runs automatically at two points:

  1. After Flush: When a memtable is flushed to an L0 SSTable
  2. After Compaction: When SSTables are merged during compaction

Both paths call vlog.cleanup_obsolete_files(min_oldest_vlog) after successfully updating the manifest.

Algorithm Details

Why Global Minimum Works:

The global minimum approach is correct because:

  • VLog file IDs are monotonically increasing
  • Each SSTable records the oldest VLog file it references
  • If min_oldest_vlog = N, then all live SSTables reference only VLog files >= N
  • Therefore, VLog files < N have no live references and can be deleted

Iterator Safety:

VLog files cannot be deleted while readers might access them. The iterator_count provides a safety net:

  • Incremented when a snapshot/iterator is created
  • Decremented when the snapshot/iterator is dropped
  • If iterator_count > 0, deletion is deferred to startup cleanup

This prevents the race condition where GC deletes a file that an in-flight read is about to access.

Versioned Index Cleanup:

When the optional B+tree versioned index is enabled, it stores (InternalKey → ValuePointer) entries. After VLog files are deleted, stale entries pointing to those files must be removed:

  1. Read Phase: Scan the B+tree under a read lock, collecting keys with pointer.file_id < min_oldest_vlog
  2. Delete Phase: Remove collected keys in batches under brief write locks

This two-phase approach avoids holding locks during the full scan, allowing concurrent writes.


Recovery

When SurrealKV starts, it recovers the database state from disk by loading the manifest and replaying the WAL. This process ensures that all committed data is restored, even after crashes or power failures.

Crash Recovery Guarantees

SurrealKV provides the following guarantees:

  • Committed data is durable: Any transaction that received a successful commit response will be recovered
  • Uncommitted data may be lost: Transactions that crashed before commit are not recovered
  • Atomic recovery: Either all of a transaction's writes are recovered, or none are

The strength of the durability guarantee depends on the Durability mode used at commit time:

  • Immediate: Data was fsynced before commit returned; it will survive any crash
  • Eventual: Data was written to OS buffers; it may be lost if the OS crashes before flushing

Recovery Process Overview

Recovery proceeds in three phases:

  1. Initialization: Load metadata and acquire exclusive access
  2. WAL Replay: Reconstruct in-memory state from the write-ahead log
  3. Finalization: Clean up orphaned files and start background tasks
┌─────────────────────────────────────────────────────────────────────────────┐
│                                INITIALIZATION                               │
│                                                                             │
│  1. Acquire LOCK file (prevents concurrent access)                          │
│                                                                             │
│  2. Load Level Manifest                                                     │
│     - SSTable locations per level                                           │
│     - min_wal_number (WALs older than this are flushed)                     │
│     - last_sequence number                                                  │
│                                                                             │
│  3. Open/create required directories                                        │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                                  WAL REPLAY                                 │
│                                                                             │
│  For each WAL segment with id >= min_wal_number:                            │
│                                                                             │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Read records sequentially:                                           │  │
│  │                                                                       │  │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐              │  │
│  │  │ Read Header │ ──► │ Verify CRC  │ ──► │ Decompress  │              │  │
│  │  │ (7 bytes)   │     │             │     │ (if needed) │              │  │
│  │  └─────────────┘     └─────────────┘     └─────────────┘              │  │
│  │         │                   │                   │                     │  │
│  │         ▼                   ▼                   ▼                     │  │
│  │  On corruption:       CRC mismatch:       Apply to MemTable           │  │
│  │  - TolerateWithRepair: truncate & continue                            │  │
│  │  - AbsoluteConsistency: return error                                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  If MemTable fills during replay → Flush to L0 SSTable                      │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                                 FINALIZATION                                │
│                                                                             │
│  1. Set sequence numbers:                                                   │
│     visible_seq_num = max(manifest_last_seq, wal_max_seq)                   │
│     log_seq_num = visible_seq_num + 1                                       │
│                                                                             │
│  2. Cleanup orphaned files:                                                 │
│     - SST files not in manifest → delete                                    │
│     - WAL files < min_wal_number → delete                                   │
│                                                                             │
│  3. Start background tasks:                                                 │
│     - MemTable flush task                                                   │
│     - Level compaction task                                                 │
│     - VLog GC task (if VLog enabled)                                        │
└─────────────────────────────────────────────────────────────────────────────┘

WAL Record Format

The WAL uses a block-based format inspired by LevelDB/RocksDB. Each block is 32KB, and records that don't fit in a single block are fragmented across multiple blocks.

┌───────────────────────────────────────────────────────────────────────────┐
│                         WAL BLOCK (32KB)                                  │
├───────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌───────┬───────┬───────┬─────────────────────────────────────────────┐  │
│  │  CRC  │ Size  │ Type  │                 Payload                     │  │
│  │ (4B)  │ (2B)  │ (1B)  │              (variable)                     │  │
│  └───────┴───────┴───────┴─────────────────────────────────────────────┘  │
│                                                                           │
│  Record Types:                                                            │
│    0 = Empty (padding)                                                    │
│    1 = Full (complete record in one block)                                │
│    2 = First (start of fragmented record)                                 │
│    3 = Middle (continuation of fragmented record)                         │
│    4 = Last (end of fragmented record)                                    │
│                                                                           │
│  CRC covers: Type byte + Payload                                          │
└───────────────────────────────────────────────────────────────────────────┘

Record Fragmentation:

Large records that don't fit in the remaining space of a block are split:

  • First: Contains the beginning of the record
  • Middle: Contains continuation data (may be multiple)
  • Last: Contains the final portion

During replay, fragments are reassembled before processing.

Recovery Modes

SurrealKV supports two recovery modes that control behavior when WAL corruption is detected:

Mode Behavior
AbsoluteConsistency Return error on any corruption; do not start
TolerateWithRepair Truncate corrupted tail and continue recovery

TolerateWithRepair is useful for recovering from crashes that left partial writes. The corrupted data represents uncommitted transactions (they crashed before completing), so truncating them maintains consistency.

Manifest's Role in Recovery

The manifest tracks the durable state of the LSM tree:

  • Which SSTables exist at each level
  • Their key ranges and file sizes
  • The min_wal_number: WALs older than this have been fully flushed to SSTables

During recovery, SurrealKV only replays WALs with id >= min_wal_number. Older WALs contain data that is already in SSTables and can be safely deleted.

Orphan File Cleanup

After recovery, the database may contain orphaned files:

  • SST files not in manifest: Created but crashed before manifest update; safe to delete
  • WAL files < min_wal_number: Already flushed to SSTables; safe to delete
  • Temporary files: Incomplete compaction outputs; safe to delete

This cleanup ensures disk space is reclaimed after crashes.


Checkpoint and Restore

Checkpoints create consistent point-in-time snapshots of the database for backup and recovery. Checkpoints capture the entire database state at a specific moment, enabling point-in-time recovery.

Use Cases

Backup and Disaster Recovery:

  • Create periodic checkpoints to external storage (S3, NFS, etc.)
  • Restore from checkpoint after catastrophic failure
  • Geographic distribution of backup copies

Testing and Development:

  • Checkpoint production data for test environments
  • Restore to known state between test runs
  • Debug issues with production data snapshots

Migration:

  • Checkpoint before major upgrades
  • Transfer database between machines
  • Clone database for read replicas

Why Flush Before Checkpoint?

A checkpoint must capture a consistent state. The active memtable contains uncommitted and in-flight data that may be partially applied. By flushing to SSTables first:

  1. All committed data is durably stored in files
  2. The checkpoint captures exactly what was committed at that moment
  3. Recovery from checkpoint is deterministic

Checkpoint Process

┌─────────────────────────────────────────────────────────────────────────┐
│                         CHECKPOINT CREATION                             │
│                                                                         │
│  1. PREPARATION                                                         │
│     - Flush active memtable to SSTable                                  │
│     - Wait for all immutable memtables to flush                         │
│     - Capture current sequence number                                   │
│                                                                         │
│  2. COPY FILES TO CHECKPOINT DIRECTORY                                  │
│     checkpoint_path/                                                    │
│     ├── sstables/         ← Copy all SSTable files                      │
│     ├── wal/              ← Copy WAL segments                           │
│     ├── manifest/         ← Copy manifest file                          │
│     ├── vlog/             ← Copy VLog files (if enabled)                │
│     ├── discard_stats/    ← Copy discard stats (if VLog)                │
│     ├── delete_list/      ← Copy delete list (if VLog)                  │
│     └── metadata.json     ← Checkpoint metadata                         │
│                                                                         │
│  3. METADATA                                                            │
│     {                                                                   │
│       "timestamp": 1234567890,                                          │
│       "sequence_number": 12345,                                         │
│       "sstable_count": 42,                                              │
│       "total_size": 1073741824                                          │
│     }                                                                   │
└─────────────────────────────────────────────────────────────────────────┘

Restore Flow

┌─────────────────────────────────────────────────────────────────────────┐
│                            RESTORE PROCESS                              │
│                                                                         │
│  1. VALIDATION                                                          │
│     - Verify checkpoint metadata exists                                 │
│     - Verify all required files present                                 │
│                                                                         │
│  2. REPLACE DATABASE FILES                                              │
│     - Clear current sstables/, wal/, manifest/, vlog/ directories       │
│     - Copy files from checkpoint                                        │
│                                                                         │
│  3. RELOAD STATE                                                        │
│     - Load manifest from checkpoint                                     │
│     - Clear in-memory memtables (discards uncommitted data)             │
│     - Replay WAL segments from checkpoint                               │
│     - Set sequence numbers from checkpoint                              │
│                                                                         │
│  4. RESUME OPERATIONS                                                   │
│     - Database returns to checkpoint state                              │
│     - Any data written after checkpoint is discarded                    │
└─────────────────────────────────────────────────────────────────────────┘