Skip to content

feat: add range tombstone (delete_range) support#242

Open
temporaryfix wants to merge 15 commits intofjall-rs:mainfrom
temporaryfix:feature/range-delete
Open

feat: add range tombstone (delete_range) support#242
temporaryfix wants to merge 15 commits intofjall-rs:mainfrom
temporaryfix:feature/range-delete

Conversation

@temporaryfix
Copy link
Copy Markdown

@temporaryfix temporaryfix commented Feb 6, 2026

Summary

Implements native range tombstone support for the LSM-tree, enabling efficient deletion of contiguous key ranges without writing individual tombstones per key.

Closes #2

Motivation

Currently, deleting a range of keys requires iterating over the range and writing a point tombstone for each key. This is expensive for large ranges — both in write amplification and in the tombstones that must be compacted away later. Range tombstones (as described in the RocksDB DeleteRange design) solve this by recording a single [start, end) interval with a sequence number, suppressing all keys within the range that have a lower seqno.

What's included

Core types

  • RangeTombstone: half-open [start, end) interval with seqno. Supports contains_key, visible_at, should_suppress, intersect_opt, and fully_covers queries. Ordered by (start asc, seqno desc, end asc).
  • ActiveTombstoneSet / ActiveTombstoneSetReverse: streaming sweep-line trackers for forward and reverse iteration, using a seqno multiset and min/max-heap expiry to efficiently determine suppression without rescanning all tombstones.
  • CoveringRt: returned by table-skip queries to identify tombstones that fully cover a table's key range.

Memtable integration

  • IntervalTree: AVL-balanced BST keyed by start, augmented with subtree_max_end / subtree_max_seqno / subtree_min_seqno for efficient pruning. Supports query_suppression, overlapping_tombstones, and covering_rt queries.
  • Dual-indexed storage: IntervalTree for point/overlap queries + BTreeMap by end-desc for reverse iteration.
  • Methods: insert_range_tombstone, is_suppressed_by_range_tombstone, overlapping_tombstones, range_tombstones_by_start/end_desc.

SST block format

  • Encoder: encode_by_start (prefix-compressed START keys with per-window max_end) and encode_by_end_desc (prefix-compressed END keys). Both produce backward-parseable footer format with restart points.
  • ByStart decoder: query_suppression, overlapping_tombstones, query_covering_rt_for_range with per-window max_end pruning for early exit.
  • ByEndDesc decoder: iter() in end-descending order for reverse iteration.
  • Two new BlockType variants: RangeTombstoneStart and RangeTombstoneEnd.

Read path

  • Point reads: after resolving a key from memtables or SSTs, check range tombstone suppression across all layers (active memtable, sealed memtables, SST tables). A key is suppressed if any visible tombstone covers it with a higher seqno.
  • Range/prefix iteration: RangeTombstoneFilter wraps MvccStream, collecting tombstones from all sources and using ActiveTombstoneSet (forward) or ActiveTombstoneSetReverse (reverse) to suppress entries during iteration.
  • Table skipping: during iteration setup, tables fully covered by a range tombstone (where tombstone seqno > table's max seqno) are skipped entirely, avoiding unnecessary I/O.

Write path

  • Flush: range tombstones from sealed memtables are collected and written into output SSTs via MultiWriter.
  • Compaction: range tombstones from input tables are collected and written into output tables. Tombstones are clipped to each output table's key range via intersect_opt(). At the last compaction level, tombstones below the MVCC GC watermark are evicted since no data beneath them can be resurrected.

Note on GC behavior: After a range tombstone is evicted at the bottom level (when its seqno is below the GC watermark), the underlying data becomes visible again to new reads. Physical cleanup of suppressed keys happens lazily during future compactions when no snapshots hold them.

MVCC correctness

  • Tombstones respect snapshot visibility: visible_at(read_seqno) ensures only tombstones with seqno < read_seqno are considered.
  • should_suppress(key_seqno, read_seqno) ensures a tombstone only suppresses keys with lower seqno than the tombstone itself.
  • Tombstone eviction during compaction respects the GC watermark to preserve snapshot isolation.

Test plan

  • 13 end-to-end integration tests (tests/range_tombstone.rs) covering:
    • Memtable point reads and range iteration
    • Flush persistence and recovery
    • Compaction propagation and tombstone clipping
    • End-exclusive [start, end) semantics
    • Reverse iteration
    • MVCC snapshot visibility
    • Overlapping tombstones
    • Cross-layer suppression (memtable tombstone suppressing SST data)
    • GC threshold eviction (data becomes visible after tombstone eviction)
    • Table skip optimization during iteration
  • Unit tests for IntervalTree, ActiveTombstoneSet, RangeTombstone, block encoders/decoders
  • Full cargo test passes (312 unit + 23 doc tests)

Stats

  • 23 files changed, ~3,900 lines added
  • 12 commits, incrementally buildable

🤖 Generated with Claude Code

temporaryfix and others added 12 commits February 6, 2026 13:58
Introduce the foundational data structures for range tombstone support:

- RangeTombstone: half-open [start, end) interval with seqno, Ord impl
  (start asc, seqno desc, end asc), contains_key, visible_at,
  should_suppress, intersect_opt, fully_covers, CoveringRt
- ActiveTombstoneSet: forward iteration tracker with seqno multiset
  and min-heap expiry (monotonic IDs for deterministic ordering)
- ActiveTombstoneSetReverse: reverse iteration tracker with max-heap
  expiry, strict > activation for half-open end bound

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- IntervalTree: AVL-balanced BST keyed by start, augmented with
  subtree_max_end, subtree_max_seqno, subtree_min_seqno for pruning.
  Supports query_suppression, overlapping_tombstones, covering_rt queries.
- Memtable: dual-indexed range tombstone storage (IntervalTree for
  point/overlap queries, BTreeMap by end-desc for reverse iteration).
  Methods: insert_range_tombstone, is_suppressed_by_range_tombstone,
  overlapping_tombstones, range_tombstones_by_start/end_desc.
- Error::InvalidBlock variant for corrupt block detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Encoder: encode_by_start (prefix-compresses START, per-window max_end)
  and encode_by_end_desc (prefix-compresses END). Both produce
  backward-parseable footer format with restart points.
- ByStart decoder: query_suppression, overlapping_tombstones,
  query_covering_rt_for_range with per-window max_end pruning.
  Hard error on start >= end corruption.
- ByEndDesc decoder: iter() in end-desc order for reverse iteration.
  Hard error on start >= end corruption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Integrate range tombstone filtering into the read path:
- Point reads check memtable range tombstones before returning KVs
- Range/prefix iteration wraps MvccStream with RangeTombstoneFilter
- Bidirectional filter supports both forward and reverse iteration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add RangeTombstoneStart/End block types
- Load tombstone blocks from SST regions during Table::recover
- Write dual tombstone blocks (ByStart + ByEndDesc) in Writer::finish
- Add Table query methods: suppression, covering_rt, overlapping, iterators
- Store range_tombstone_count in table metadata

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Point reads now check range tombstone suppression from SST tables
  in addition to memtables
- Iteration collects range tombstones from SST tables alongside
  memtable tombstones for the RangeTombstoneFilter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Collect range tombstones from sealed memtables during flush
- Pass tombstones through flush_to_tables to MultiWriter
- MultiWriter writes tombstones to every output table
- Both Tree and BlobTree flush paths supported

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collect range tombstones from input tables during compaction and write
them to output tables. At the last level, evict tombstones below the
MVCC GC watermark since no data beneath them can be resurrected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers memtable point reads, range iteration, flush, compaction,
end-exclusive semantics, reverse iteration, MVCC visibility,
overlapping tombstones, cross-layer suppression, and GC threshold.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When MultiWriter rotates or finishes, range tombstones are now
clipped to each output table's key range using intersect_opt().
This avoids writing tombstones that don't overlap with a table's
data, and ensures each table only stores the relevant portion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tones

During range iteration setup, tables that are fully covered by a
range tombstone (with higher seqno than the table's max) are now
skipped entirely, avoiding unnecessary I/O. Tombstones are collected
from all sources before building the merge iterator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Confirms that when a range tombstone is evicted at the bottom level
(gc_watermark > tombstone seqno), the underlying data that survived
compaction becomes visible again via point reads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@temporaryfix
Copy link
Copy Markdown
Author

Apologies forgot to lint, working on it now.

…code

Fix deny-level clippy lints (map_or -> is_some_and/is_none_or,
unnecessary Ok(x?), unwrap_used/indexing_slicing in tests, type
complexity) and add missing #[must_use], #[expect], and doc sections
across all range tombstone files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 6, 2026

- Add AtomicBool fast path on Memtable to skip range tombstone RwLock
  acquisition when no tombstones exist
- Add has_sst_range_tombstones flag on SuperVersion to skip SST table
  iteration for point reads and range scans when no SSTs have tombstones
- Skip RangeTombstoneFilter wrapping entirely when no tombstones collected
- Deduplicate overlapping range tombstones during compaction using
  sort + dedup_by to remove tombstones fully covered by a retained one
- Read range_tombstone_count from SST metadata (optional, default 0 for
  backward compat with older v3 tables)
- Add tests for fast-path correctness and compaction dedup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@temporaryfix
Copy link
Copy Markdown
Author

Will do some benchmarks on this when I get the chance to prove that it doesn't reduce performance.

@ariesdevil
Copy link
Copy Markdown

Can't wait to use this feature!

@temporaryfix
Copy link
Copy Markdown
Author

temporaryfix commented Feb 10, 2026

@ariesdevil I would love another person with eyes on it if you have time to do a review. I spent a couple of days working with multiple AIs on formulating a plan before I started implementing but I'm still not convinced this is perfect.

It is a big feature to implement in a code base that is unfamiliar to me. It is also a lot to ask a code owner to review with over 4000 lines added. I tried breaking it down into atomic commits to make it easier.

The more eyes that are on this, the more comfortable @marvin-j97 will feel about merging it. Also we absolutely must ensure this is properly benchmarked so we do not cause performance regressions.

@marvin-j97
Copy link
Copy Markdown
Contributor

marvin-j97 commented Feb 10, 2026

The first thing that seems awkward to me is the duplication of range tombstones for start and end inside tables. Ideally you would just store a table's range tombstones using a DataBlock. That way we don't have another block implementation, as the data blocks are pretty tried and tested.

Comment thread tests/range_tombstone.rs Outdated
tree.insert("c", "val_c", 3);

// Insert range tombstone [a, d) at seqno 10
tree.active_memtable()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractTree should then have a remove_range method, so we don't have to access the active memtable manually.

@temporaryfix
Copy link
Copy Markdown
Author

temporaryfix commented Feb 10, 2026

@marvin-j97 cheers for the initial review.

I agree it’s a bit awkward to add bespoke block formats, reusing the existing DataBlock path would be the obvious “less code” move and it’s well-tested.

The reason I kept range tombstones separate is the access pattern mismatch:

  • KV DataBlocks are optimised around “key -> value” lookups + forward iteration.
  • Range tombstones are mostly overlap/cover checks (start <= key < end) and fast pruning based on end coverage.
  • Reverse MVCC specifically needs streaming activation on (end > current_key). If tombstones are only start-sorted, reverse either turns into scanning/building an in-memory end index, or you pay a lot of runtime work per iterator.

Splitting into ByStart + ByEndDesc lets us:

  1. prefix-compress starts in one block and ends in the other
  2. bake in per-window max_end pruning (skip whole restart windows early)
  3. keep reverse activation streaming/bounded (no full scan / no “index on open”)

This isn’t us inventing a new pattern: RocksDB does not store range tombstones in regular data blocks either. They have dedicated table metadata/blocks and specialised handling (incl fragmentation/caching):

Their shape is different (single start-sorted list + fragmentation) whereas we keep two orders (start-asc + end-desc) and avoid fragmentation to make reverse streaming + pruning straightforward. But the underlying precedent is the same: tombstones are special enough to deserve special storage/read paths.

That said, I’m not religious about the exact encoding. If the repo strongly prefers DataBlock reuse for maintainability/reviewability, we can get most of the same properties with two DataBlocks:

  1. key = start, value = end|seqno (forward overlap/cover)
  2. key = end (or order-preserving inversion), value = start|seqno (reverse activation)

We would need to keep per-window max_end as a small sidecar (or another tiny DataBlock) to preserve the pruning wins.

Understood re: AbstractTree::remove_range I’ll add that and route the tests through the public API so they don’t poke active_memtable directly.

Let me know what you think, happy to prototype the DataBlock version or discuss further.

@marvin-j97
Copy link
Copy Markdown
Contributor

marvin-j97 commented Feb 10, 2026

  • KV DataBlocks are optimised around “key -> value” lookups + forward iteration.

That's partially true. The blocks are designed for binary search with lower + upper bounds. But the upper bound does not really help with range tombstones, as explained in the next paragraph.

If tombstones are only start-sorted, reverse either turns into scanning/building an in-memory end index, or you pay a lot of runtime work per iterator.

Indeed I think this is pretty unavoidable, because range tombstones are inherently a 2D-problem (range + temporality (seqno)). RocksDB also turns the stored range tombstones into a different in-memory representation when the table is first loaded. Unless you can figure out some kind of specialized data structure (new block type) that allows queries similar to a {KD/range/segment/interval} tree that does not need a deserialization step (similar to e.g. LOUDS encoded trie or zero-copy schemes).

RocksDB does not store range tombstones in regular data blocks either

I think for all intents and purposes the range tombstones in RocksDB are stored in data blocks without binary seek index. See the image in https://rocksdb.org/blog/2018/11/21/delete-range.html
Though they don't fragment them I think, that was a todo on their part: "Another future optimization is to create a new format version that requires range tombstones to be stored in a fragmented form".

Taking a look at Pebble may be worth it.

Expose remove_range on the AbstractTree trait so callers (including
tests) no longer need to reach into active_memtable() directly to
insert range tombstones.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@temporaryfix
Copy link
Copy Markdown
Author

This recent paper suggests I may be barking up the wrong tree anyway: https://arxiv.org/html/2511.06061v1 @marvin-j97

@marvin-j97
Copy link
Copy Markdown
Contributor

marvin-j97 commented Feb 12, 2026

This recent paper suggests I may be barking up the wrong tree anyway: https://arxiv.org/html/2511.06061v1 @marvin-j97

Generally I'm open to having alternative implementations. Having a separate data structure in the LSM-tree is generally possible by adding a new field into the Version files (that's how the value log works). However, such auxiliary indexes must be immutable/log-structured as they are only allowed to change when a new Version is created, which this paper seems to do by having an in-memory index and DR-tree disk files.

Edit: The paper does not mention range/prefix reads, so you'd hope it still works for those. (see Exp. 7)

@temporaryfix
Copy link
Copy Markdown
Author

temporaryfix commented Mar 15, 2026

Apologies, I do intend on finishing this but life has gotten in the way.

polaz added a commit to structured-world/coordinode-lsm-tree that referenced this pull request Mar 16, 2026
- Add RangeTombstone, ActiveTombstoneSet, IntervalTree core types
  (ported from upstream PR fjall-rs#242 algorithms, own SST persistence)
- Add RangeTombstoneFilter for bidirectional iteration suppression
- Integrate into Memtable with interval tree for O(log n) queries
- Add remove_range(start, end, seqno) and remove_prefix(prefix, seqno)
  to AbstractTree trait, implemented for Tree and BlobTree
- Wire suppression into point reads (memtable + sealed + SST layers)
- Wire RangeTombstoneFilter into range/prefix iteration pipeline
- SST persistence: raw wire format in BlockType::RangeTombstone block
- Flush: collect range tombstones from sealed memtables, write to SSTs
- Compaction: propagate RTs from input to output tables with clipping
- GC: evict range tombstones below watermark at bottom level
- Table-skip: skip tables fully covered by a range tombstone
- MultiWriter: clip RTs to each output table's key range on rotation
- Handle RT-only tables (derive key range from tombstone bounds)
- 16 integration tests covering all paths

Closes #16
@polaz
Copy link
Copy Markdown

polaz commented Mar 18, 2026

We independently implemented range tombstones in our fork (structured-world/lsm-tree#21) and wanted to share a few notes in case they're useful.

Storage format: We went with a single raw block (BlockType::RangeTombstone) using a flat wire format: [start_len:u16_le][start][end_len:u16_le][end][seqno:u64_le] repeated. No prefix compression, no per-window pruning — the block is decoded into Vec<RangeTombstone> at table load time. This is closer to what @marvin-j97 suggested (reusing simple block storage, in-memory representation for queries) — one block type instead of two custom ones.

Key differences from this PR:

Area This PR (temporaryfix) Our fork
SST block format 2 custom blocks (ByStart + ByEndDesc) with prefix compression and per-window max_end 1 raw block, decoded to Vec at load
Reverse iteration Streaming from ByEndDesc block Sorted copy built in RangeTombstoneFilter::new
Table-skip Not implemented Implemented for range iteration (skip table when RT fully covers key range with higher seqno)
Compaction dedup Not implemented sort + dedup on input RTs to prevent accumulation from MultiWriter rotation
RT-only tables Not addressed Synthetic WeakTombstone sentinel at max_rt_seqno + 1 for index creation
Point-read suppression Via SuperVersion Via is_suppressed_by_range_tombstones across all layers
Key length validation Not addressed u16::try_from at insertion + decode-time start < end invariant check

What we kept the same: IntervalTree (AVL-balanced), ActiveTombstoneSet (sweep-line), RangeTombstoneFilter (bidirectional wrapper), compaction GC at last level, flush with unclipped RTs.

The tradeoff is clear — our encoding is simpler but loses the pruning/prefix-compression wins for large RT counts. For our use case (graph bulk deletion — few large RTs rather than many small ones) this is fine.

Happy to discuss or cherry-pick anything useful.

@temporaryfix
Copy link
Copy Markdown
Author

@polaz If you have a working implementation that could work here cherry picking would be great thankyou.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Range tombstones/delete_range/delete_prefix

4 participants