feat: add range tombstone (delete_range) support by temporaryfix · Pull Request #242 · fjall-rs/lsm-tree

temporaryfix · 2026-02-06T14:11:10Z

Summary

Implements native range tombstone support for the LSM-tree, enabling efficient deletion of contiguous key ranges without writing individual tombstones per key.

Closes #2

Motivation

Currently, deleting a range of keys requires iterating over the range and writing a point tombstone for each key. This is expensive for large ranges — both in write amplification and in the tombstones that must be compacted away later. Range tombstones (as described in the RocksDB DeleteRange design) solve this by recording a single [start, end) interval with a sequence number, suppressing all keys within the range that have a lower seqno.

What's included

Core types

RangeTombstone: half-open [start, end) interval with seqno. Supports contains_key, visible_at, should_suppress, intersect_opt, and fully_covers queries. Ordered by (start asc, seqno desc, end asc).
ActiveTombstoneSet / ActiveTombstoneSetReverse: streaming sweep-line trackers for forward and reverse iteration, using a seqno multiset and min/max-heap expiry to efficiently determine suppression without rescanning all tombstones.
CoveringRt: returned by table-skip queries to identify tombstones that fully cover a table's key range.

Memtable integration

IntervalTree: AVL-balanced BST keyed by start, augmented with subtree_max_end / subtree_max_seqno / subtree_min_seqno for efficient pruning. Supports query_suppression, overlapping_tombstones, and covering_rt queries.
Dual-indexed storage: IntervalTree for point/overlap queries + BTreeMap by end-desc for reverse iteration.
Methods: insert_range_tombstone, is_suppressed_by_range_tombstone, overlapping_tombstones, range_tombstones_by_start/end_desc.

SST block format

Encoder: encode_by_start (prefix-compressed START keys with per-window max_end) and encode_by_end_desc (prefix-compressed END keys). Both produce backward-parseable footer format with restart points.
ByStart decoder: query_suppression, overlapping_tombstones, query_covering_rt_for_range with per-window max_end pruning for early exit.
ByEndDesc decoder: iter() in end-descending order for reverse iteration.
Two new BlockType variants: RangeTombstoneStart and RangeTombstoneEnd.

Read path

Point reads: after resolving a key from memtables or SSTs, check range tombstone suppression across all layers (active memtable, sealed memtables, SST tables). A key is suppressed if any visible tombstone covers it with a higher seqno.
Range/prefix iteration: RangeTombstoneFilter wraps MvccStream, collecting tombstones from all sources and using ActiveTombstoneSet (forward) or ActiveTombstoneSetReverse (reverse) to suppress entries during iteration.
Table skipping: during iteration setup, tables fully covered by a range tombstone (where tombstone seqno > table's max seqno) are skipped entirely, avoiding unnecessary I/O.

Write path

Flush: range tombstones from sealed memtables are collected and written into output SSTs via MultiWriter.
Compaction: range tombstones from input tables are collected and written into output tables. Tombstones are clipped to each output table's key range via intersect_opt(). At the last compaction level, tombstones below the MVCC GC watermark are evicted since no data beneath them can be resurrected.

Note on GC behavior: After a range tombstone is evicted at the bottom level (when its seqno is below the GC watermark), the underlying data becomes visible again to new reads. Physical cleanup of suppressed keys happens lazily during future compactions when no snapshots hold them.

MVCC correctness

Tombstones respect snapshot visibility: visible_at(read_seqno) ensures only tombstones with seqno < read_seqno are considered.
should_suppress(key_seqno, read_seqno) ensures a tombstone only suppresses keys with lower seqno than the tombstone itself.
Tombstone eviction during compaction respects the GC watermark to preserve snapshot isolation.

Test plan

13 end-to-end integration tests (tests/range_tombstone.rs) covering:
- Memtable point reads and range iteration
- Flush persistence and recovery
- Compaction propagation and tombstone clipping
- End-exclusive [start, end) semantics
- Reverse iteration
- MVCC snapshot visibility
- Overlapping tombstones
- Cross-layer suppression (memtable tombstone suppressing SST data)
- GC threshold eviction (data becomes visible after tombstone eviction)
- Table skip optimization during iteration
Unit tests for IntervalTree, ActiveTombstoneSet, RangeTombstone, block encoders/decoders
Full cargo test passes (312 unit + 23 doc tests)

Stats

23 files changed, ~3,900 lines added
12 commits, incrementally buildable

🤖 Generated with Claude Code

Introduce the foundational data structures for range tombstone support: - RangeTombstone: half-open [start, end) interval with seqno, Ord impl (start asc, seqno desc, end asc), contains_key, visible_at, should_suppress, intersect_opt, fully_covers, CoveringRt - ActiveTombstoneSet: forward iteration tracker with seqno multiset and min-heap expiry (monotonic IDs for deterministic ordering) - ActiveTombstoneSetReverse: reverse iteration tracker with max-heap expiry, strict > activation for half-open end bound Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- IntervalTree: AVL-balanced BST keyed by start, augmented with subtree_max_end, subtree_max_seqno, subtree_min_seqno for pruning. Supports query_suppression, overlapping_tombstones, covering_rt queries. - Memtable: dual-indexed range tombstone storage (IntervalTree for point/overlap queries, BTreeMap by end-desc for reverse iteration). Methods: insert_range_tombstone, is_suppressed_by_range_tombstone, overlapping_tombstones, range_tombstones_by_start/end_desc. - Error::InvalidBlock variant for corrupt block detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Encoder: encode_by_start (prefix-compresses START, per-window max_end) and encode_by_end_desc (prefix-compresses END). Both produce backward-parseable footer format with restart points. - ByStart decoder: query_suppression, overlapping_tombstones, query_covering_rt_for_range with per-window max_end pruning. Hard error on start >= end corruption. - ByEndDesc decoder: iter() in end-desc order for reverse iteration. Hard error on start >= end corruption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Integrate range tombstone filtering into the read path: - Point reads check memtable range tombstones before returning KVs - Range/prefix iteration wraps MvccStream with RangeTombstoneFilter - Bidirectional filter supports both forward and reverse iteration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add RangeTombstoneStart/End block types - Load tombstone blocks from SST regions during Table::recover - Write dual tombstone blocks (ByStart + ByEndDesc) in Writer::finish - Add Table query methods: suppression, covering_rt, overlapping, iterators - Store range_tombstone_count in table metadata Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Point reads now check range tombstone suppression from SST tables in addition to memtables - Iteration collects range tombstones from SST tables alongside memtable tombstones for the RangeTombstoneFilter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Collect range tombstones from sealed memtables during flush - Pass tombstones through flush_to_tables to MultiWriter - MultiWriter writes tombstones to every output table - Both Tree and BlobTree flush paths supported Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Collect range tombstones from input tables during compaction and write them to output tables. At the last level, evict tombstones below the MVCC GC watermark since no data beneath them can be resurrected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Covers memtable point reads, range iteration, flush, compaction, end-exclusive semantics, reverse iteration, MVCC visibility, overlapping tombstones, cross-layer suppression, and GC threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When MultiWriter rotates or finishes, range tombstones are now clipped to each output table's key range using intersect_opt(). This avoids writing tombstones that don't overlap with a table's data, and ensures each table only stores the relevant portion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tones During range iteration setup, tables that are fully covered by a range tombstone (with higher seqno than the table's max) are now skipped entirely, avoiding unnecessary I/O. Tombstones are collected from all sources before building the merge iterator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Confirms that when a range tombstone is evicted at the bottom level (gc_watermark > tombstone seqno), the underlying data that survived compaction becomes visible again via point reads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

temporaryfix · 2026-02-06T14:21:26Z

Apologies forgot to lint, working on it now.

…code Fix deny-level clippy lints (map_or -> is_some_and/is_none_or, unnecessary Ok(x?), unwrap_used/indexing_slicing in tests, type complexity) and add missing #[must_use], #[expect], and doc sections across all range tombstone files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-02-06T14:30:39Z

Codecov Report

❌ Patch coverage is 92.21528% with 162 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/table/range_tombstone_block_by_start.rs	91.51%	37 Missing ⚠️
src/table/mod.rs	48.38%	32 Missing ⚠️
src/memtable/interval_tree.rs	90.34%	31 Missing ⚠️
src/memtable/mod.rs	69.44%	22 Missing ⚠️
src/table/range_tombstone_block_by_end.rs	92.69%	19 Missing ⚠️
src/active_tombstone_set.rs	97.46%	5 Missing ⚠️
src/table/multi_writer.rs	82.60%	4 Missing ⚠️
src/compaction/flavour.rs	50.00%	3 Missing ⚠️
src/range.rs	95.12%	2 Missing ⚠️
src/range_tombstone_filter.rs	98.58%	2 Missing ⚠️
... and 3 more

📢 Thoughts on this report? Let us know!

- Add AtomicBool fast path on Memtable to skip range tombstone RwLock acquisition when no tombstones exist - Add has_sst_range_tombstones flag on SuperVersion to skip SST table iteration for point reads and range scans when no SSTs have tombstones - Skip RangeTombstoneFilter wrapping entirely when no tombstones collected - Deduplicate overlapping range tombstones during compaction using sort + dedup_by to remove tombstones fully covered by a retained one - Read range_tombstone_count from SST metadata (optional, default 0 for backward compat with older v3 tables) - Add tests for fast-path correctness and compaction dedup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

temporaryfix · 2026-02-06T19:05:19Z

Will do some benchmarks on this when I get the chance to prove that it doesn't reduce performance.

ariesdevil · 2026-02-10T12:41:42Z

Can't wait to use this feature!

temporaryfix · 2026-02-10T13:17:14Z

@ariesdevil I would love another person with eyes on it if you have time to do a review. I spent a couple of days working with multiple AIs on formulating a plan before I started implementing but I'm still not convinced this is perfect.

It is a big feature to implement in a code base that is unfamiliar to me. It is also a lot to ask a code owner to review with over 4000 lines added. I tried breaking it down into atomic commits to make it easier.

The more eyes that are on this, the more comfortable @marvin-j97 will feel about merging it. Also we absolutely must ensure this is properly benchmarked so we do not cause performance regressions.

marvin-j97 · 2026-02-10T13:25:20Z

The first thing that seems awkward to me is the duplication of range tombstones for start and end inside tables. Ideally you would just store a table's range tombstones using a DataBlock. That way we don't have another block implementation, as the data blocks are pretty tried and tested.

marvin-j97 · 2026-02-10T13:27:41Z

+    tree.insert("c", "val_c", 3);
+
+    // Insert range tombstone [a, d) at seqno 10
+    tree.active_memtable()


AbstractTree should then have a remove_range method, so we don't have to access the active memtable manually.

temporaryfix · 2026-02-10T15:44:09Z

@marvin-j97 cheers for the initial review.

I agree it’s a bit awkward to add bespoke block formats, reusing the existing DataBlock path would be the obvious “less code” move and it’s well-tested.

The reason I kept range tombstones separate is the access pattern mismatch:

KV DataBlocks are optimised around “key -> value” lookups + forward iteration.
Range tombstones are mostly overlap/cover checks (start <= key < end) and fast pruning based on end coverage.
Reverse MVCC specifically needs streaming activation on (end > current_key). If tombstones are only start-sorted, reverse either turns into scanning/building an in-memory end index, or you pay a lot of runtime work per iterator.

Splitting into ByStart + ByEndDesc lets us:

prefix-compress starts in one block and ends in the other
bake in per-window max_end pruning (skip whole restart windows early)
keep reverse activation streaming/bounded (no full scan / no “index on open”)

This isn’t us inventing a new pattern: RocksDB does not store range tombstones in regular data blocks either. They have dedicated table metadata/blocks and specialised handling (incl fragmentation/caching):

DeleteRange wiki: https://github.com/facebook/rocksdb/wiki/DeleteRange-Implementation
Fragmentation logic: https://github.com/facebook/rocksdb/blob/main/db/range_tombstone_fragmenter.cc
Table reader integration: https://github.com/facebook/rocksdb/blob/main/table/block_based/block_based_table_reader.cc

Their shape is different (single start-sorted list + fragmentation) whereas we keep two orders (start-asc + end-desc) and avoid fragmentation to make reverse streaming + pruning straightforward. But the underlying precedent is the same: tombstones are special enough to deserve special storage/read paths.

That said, I’m not religious about the exact encoding. If the repo strongly prefers DataBlock reuse for maintainability/reviewability, we can get most of the same properties with two DataBlocks:

key = start, value = end|seqno (forward overlap/cover)
key = end (or order-preserving inversion), value = start|seqno (reverse activation)

We would need to keep per-window max_end as a small sidecar (or another tiny DataBlock) to preserve the pruning wins.

Understood re: AbstractTree::remove_range I’ll add that and route the tests through the public API so they don’t poke active_memtable directly.

Let me know what you think, happy to prototype the DataBlock version or discuss further.

marvin-j97 · 2026-02-10T15:56:50Z

KV DataBlocks are optimised around “key -> value” lookups + forward iteration.

That's partially true. The blocks are designed for binary search with lower + upper bounds. But the upper bound does not really help with range tombstones, as explained in the next paragraph.

If tombstones are only start-sorted, reverse either turns into scanning/building an in-memory end index, or you pay a lot of runtime work per iterator.

Indeed I think this is pretty unavoidable, because range tombstones are inherently a 2D-problem (range + temporality (seqno)). RocksDB also turns the stored range tombstones into a different in-memory representation when the table is first loaded. Unless you can figure out some kind of specialized data structure (new block type) that allows queries similar to a {KD/range/segment/interval} tree that does not need a deserialization step (similar to e.g. LOUDS encoded trie or zero-copy schemes).

RocksDB does not store range tombstones in regular data blocks either

I think for all intents and purposes the range tombstones in RocksDB are stored in data blocks without binary seek index. See the image in https://rocksdb.org/blog/2018/11/21/delete-range.html
Though they don't fragment them I think, that was a todo on their part: "Another future optimization is to create a new format version that requires range tombstones to be stored in a fragmented form".

Taking a look at Pebble may be worth it.

Expose remove_range on the AbstractTree trait so callers (including tests) no longer need to reach into active_memtable() directly to insert range tombstones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

temporaryfix · 2026-02-12T14:39:46Z

This recent paper suggests I may be barking up the wrong tree anyway: https://arxiv.org/html/2511.06061v1 @marvin-j97

marvin-j97 · 2026-02-12T14:56:08Z

This recent paper suggests I may be barking up the wrong tree anyway: https://arxiv.org/html/2511.06061v1 @marvin-j97

Generally I'm open to having alternative implementations. Having a separate data structure in the LSM-tree is generally possible by adding a new field into the Version files (that's how the value log works). However, such auxiliary indexes must be immutable/log-structured as they are only allowed to change when a new Version is created, which this paper seems to do by having an in-memory index and DR-tree disk files.

~~Edit: The paper does not mention range/prefix reads, so you'd hope it still works for those.~~ (see Exp. 7)

temporaryfix · 2026-03-15T07:20:06Z

Apologies, I do intend on finishing this but life has gotten in the way.

- Add RangeTombstone, ActiveTombstoneSet, IntervalTree core types (ported from upstream PR fjall-rs#242 algorithms, own SST persistence) - Add RangeTombstoneFilter for bidirectional iteration suppression - Integrate into Memtable with interval tree for O(log n) queries - Add remove_range(start, end, seqno) and remove_prefix(prefix, seqno) to AbstractTree trait, implemented for Tree and BlobTree - Wire suppression into point reads (memtable + sealed + SST layers) - Wire RangeTombstoneFilter into range/prefix iteration pipeline - SST persistence: raw wire format in BlockType::RangeTombstone block - Flush: collect range tombstones from sealed memtables, write to SSTs - Compaction: propagate RTs from input to output tables with clipping - GC: evict range tombstones below watermark at bottom level - Table-skip: skip tables fully covered by a range tombstone - MultiWriter: clip RTs to each output table's key range on rotation - Handle RT-only tables (derive key range from tombstone bounds) - 16 integration tests covering all paths Closes #16

polaz · 2026-03-18T05:02:28Z

We independently implemented range tombstones in our fork (structured-world/lsm-tree#21) and wanted to share a few notes in case they're useful.

Storage format: We went with a single raw block (BlockType::RangeTombstone) using a flat wire format: [start_len:u16_le][start][end_len:u16_le][end][seqno:u64_le] repeated. No prefix compression, no per-window pruning — the block is decoded into Vec<RangeTombstone> at table load time. This is closer to what @marvin-j97 suggested (reusing simple block storage, in-memory representation for queries) — one block type instead of two custom ones.

Key differences from this PR:

Area	This PR (temporaryfix)	Our fork
SST block format	2 custom blocks (ByStart + ByEndDesc) with prefix compression and per-window max_end	1 raw block, decoded to Vec at load
Reverse iteration	Streaming from ByEndDesc block	Sorted copy built in `RangeTombstoneFilter::new`
Table-skip	Not implemented	Implemented for range iteration (skip table when RT fully covers key range with higher seqno)
Compaction dedup	Not implemented	`sort + dedup` on input RTs to prevent accumulation from MultiWriter rotation
RT-only tables	Not addressed	Synthetic `WeakTombstone` sentinel at `max_rt_seqno + 1` for index creation
Point-read suppression	Via SuperVersion	Via `is_suppressed_by_range_tombstones` across all layers
Key length validation	Not addressed	`u16::try_from` at insertion + decode-time `start < end` invariant check

What we kept the same: IntervalTree (AVL-balanced), ActiveTombstoneSet (sweep-line), RangeTombstoneFilter (bidirectional wrapper), compaction GC at last level, flush with unclipped RTs.

The tradeoff is clear — our encoding is simpler but loses the pruning/prefix-compression wins for large RT counts. For our use case (graph bulk deletion — few large RTs rather than many small ones) this is fine.

Happy to discuss or cherry-pick anything useful.

temporaryfix · 2026-03-23T01:28:30Z

@polaz If you have a working implementation that could work here cherry picking would be great thankyou.

temporaryfix and others added 12 commits February 6, 2026 13:58

marvin-j97 added documentation Improvements or additions to documentation enhancement New feature or request epic api type:compaction type:memtable type:table type:point-read labels Feb 8, 2026

marvin-j97 reviewed Feb 10, 2026

View reviewed changes

ariesdevil mentioned this pull request Feb 10, 2026

feat(example): add a kv example based on fjall kv store databendlabs/openraft#1658

Open

3 tasks

feat: add AbstractTree::remove_range public API

2b44e46

Expose remove_range on the AbstractTree trait so callers (including tests) no longer need to reach into active_memtable() directly to insert range tombstones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

marvin-j97 added the file format label Feb 12, 2026

polaz mentioned this pull request Mar 16, 2026

feat: add range tombstones (delete_range / delete_prefix) structured-world/coordinode-lsm-tree#21

Merged

4 tasks

Uh oh!

Conversation

temporaryfix commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What's included

Core types

Memtable integration

SST block format

Read path

Write path

MVCC correctness

Test plan

Stats

Uh oh!

temporaryfix commented Feb 6, 2026

Uh oh!

codecov bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

temporaryfix commented Feb 6, 2026

Uh oh!

ariesdevil commented Feb 10, 2026

Uh oh!

temporaryfix commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marvin-j97 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marvin-j97 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

temporaryfix commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marvin-j97 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

temporaryfix commented Feb 12, 2026

Uh oh!

marvin-j97 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

temporaryfix commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polaz commented Mar 18, 2026

Uh oh!

temporaryfix commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

temporaryfix commented Feb 6, 2026 •

edited

Loading

codecov bot commented Feb 6, 2026 •

edited

Loading

temporaryfix commented Feb 10, 2026 •

edited

Loading

marvin-j97 commented Feb 10, 2026 •

edited

Loading

temporaryfix commented Feb 10, 2026 •

edited

Loading

marvin-j97 commented Feb 10, 2026 •

edited

Loading

marvin-j97 commented Feb 12, 2026 •

edited

Loading

temporaryfix commented Mar 15, 2026 •

edited

Loading