- Status: Draft
- Authors: Tonbo team
- Created: 2026-02-03
- Area: Compaction, Read Path, Testing
Define a correctness gate for compaction that validates read consistency and MVCC semantics before and after compaction. The gate uses a small reference oracle plus deterministic scenarios as a minimal phase, and provides a structure for later expansion into model-based randomized sequences and broader invariants.
Compaction rewrites physical layout and can silently break logical correctness. A dedicated gate is needed to prove that compaction does not change visible results for a fixed snapshot and does not resurrect deleted or superseded values. This RFC establishes the semantic contract, the minimal gate, and a path to expand coverage as the compaction pipeline evolves.
- Specify compaction correctness invariants in terms of visible read results.
- Deliver a minimal, deterministic gate that is reproducible and debuggable.
- Provide an oracle model that aligns with Tonbo MVCC semantics.
- Define a path to expand from deterministic scenarios to model-based randomized tests.
- Performance benchmarking or tuning.
- Long-running fuzzing as a required CI blocker.
- Full coverage of crash recovery or manifest/GC interactions (follow-on work).
- Guaranteeing correctness for future range tombstones or remote compaction (follow-on work).
The correctness gate validates that logical results are unchanged by compaction for a fixed snapshot. Invariants are grouped by phase to allow a minimal starting point.
Phase 0 (minimal gate):
-
Snapshot consistency
For a fixed logical snapshot, reads before and after compaction return identical results. -
No resurrection + last-write-wins
A key deleted (or overwritten) before a snapshot does not reappear, and the latest visible version at that snapshot is returned. -
Range completeness
Range scans at a fixed snapshot return the same key set and values before and after compaction.
Phase 1+ (expanded invariants):
-
Tombstone pruning safety
Pruning tombstones does not expose older versions that should remain hidden. -
Iterator/seek stability
Iterator ordering and seek semantics remain stable across compaction. -
Reopen + snapshot durability
Snapshot consistency remains valid across DB reopen boundaries.
The gate uses a small in-memory MVCC oracle to represent logical truth:
- State model: key -> ordered versions. Each version stores
(commit_ts, tombstone, value). - Visibility: for snapshot
ts, the visible version is the highestcommit_ts <= ts. - Tie-breaks: delete wins on equal
commit_ts(consistent with SST merge semantics). - Read semantics:
get(key, ts)returns the visible version or not-found.scan(range, ts)returns all visible keys in order.
The oracle does not model compaction mechanics; it models only logical results so the compaction pipeline can change freely as long as results match.
Deterministic scenarios (Phase 0):
| Scenario | Purpose | Invariants |
|---|---|---|
| Overwrite chain | Validate last-write-wins at fixed snapshot | 1, 2 |
| Delete-heavy | Validate no resurrection | 1, 2 |
| Range scan with deletes | Validate range completeness | 1, 3 |
| Cross-segment overlap | Validate compaction rewrite invariance | 1, 2, 3 |
Each scenario:
- Build a controlled write/delete sequence.
- Take snapshot
ts. - Read (get/scan) before compaction; compare to oracle.
- Force compaction.
- Read after compaction at the same
ts; compare to oracle.
Model-based randomized sequences (Phase 1+):
- Generate sequences of operations: put, delete, flush, compact, get, scan.
- Use seeded RNG; log seed and operation trace for reproduction.
- Validate after each read, and at periodic pre/post compaction checkpoints.
- TigerBeetle uses a model-based fuzzer for its LSM tree and compares scan/get results against a reference model.
- LevelDB maintains a ModelDB (in-memory map) and compares iterator results at intervals and across snapshots.
- RocksDB stress tests maintain an expected-state oracle that verifies latest values across runs.
- Turso uses proptest and differential oracles to compare results against SQLite, plus explicit property checks in concurrency simulation.
Tonbo adopts the same principle: compare logical results to a minimal oracle, then expand via randomized sequences once the baseline gate is validated.
The gate must emit enough context to replay failures:
- RNG seed (if randomized).
- Operation trace (human-readable and machine-parsable).
- Snapshot timestamps used for assertions.
- Diff of expected vs actual results (keys and versions).
Initial coverage targets minor compaction and the current read path across mutable + immutable + SST layers. Major compaction, manifest/GC, and remote execution are validated in Phase 1+ after the minimal gate is stable.
- Required deterministic baseline: run
compaction_correctness_while skippingcompaction_correctness_model_randomized. This keeps the required gate reproducible and focused on deterministic scenario coverage. - Optional expanded randomized lane: run
compaction_correctness_model_randomizedwithTONBO_COMPACTION_MODEL_SST=1,TONBO_COMPACTION_REOPEN=1, andTONBO_COMPACTION_EAGER_FLUSH=1(with explicit seed list) to intentionally exercise SST/compaction/reopen model behavior.
- Add tombstone pruning safety tests and MVCC watermark checks.
- Extend coverage to major compaction, manifest updates, and GC integration.
- Promote expanded randomized coverage from optional to required once SST/page-index stability is complete.
- Add crash/reopen validation as part of the gate.
- Add range tombstone semantics when supported.