feat: add range tombstone (delete_range) support#242
feat: add range tombstone (delete_range) support#242temporaryfix wants to merge 15 commits intofjall-rs:mainfrom
Conversation
Introduce the foundational data structures for range tombstone support: - RangeTombstone: half-open [start, end) interval with seqno, Ord impl (start asc, seqno desc, end asc), contains_key, visible_at, should_suppress, intersect_opt, fully_covers, CoveringRt - ActiveTombstoneSet: forward iteration tracker with seqno multiset and min-heap expiry (monotonic IDs for deterministic ordering) - ActiveTombstoneSetReverse: reverse iteration tracker with max-heap expiry, strict > activation for half-open end bound Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- IntervalTree: AVL-balanced BST keyed by start, augmented with subtree_max_end, subtree_max_seqno, subtree_min_seqno for pruning. Supports query_suppression, overlapping_tombstones, covering_rt queries. - Memtable: dual-indexed range tombstone storage (IntervalTree for point/overlap queries, BTreeMap by end-desc for reverse iteration). Methods: insert_range_tombstone, is_suppressed_by_range_tombstone, overlapping_tombstones, range_tombstones_by_start/end_desc. - Error::InvalidBlock variant for corrupt block detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Encoder: encode_by_start (prefix-compresses START, per-window max_end) and encode_by_end_desc (prefix-compresses END). Both produce backward-parseable footer format with restart points. - ByStart decoder: query_suppression, overlapping_tombstones, query_covering_rt_for_range with per-window max_end pruning. Hard error on start >= end corruption. - ByEndDesc decoder: iter() in end-desc order for reverse iteration. Hard error on start >= end corruption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Integrate range tombstone filtering into the read path: - Point reads check memtable range tombstones before returning KVs - Range/prefix iteration wraps MvccStream with RangeTombstoneFilter - Bidirectional filter supports both forward and reverse iteration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add RangeTombstoneStart/End block types - Load tombstone blocks from SST regions during Table::recover - Write dual tombstone blocks (ByStart + ByEndDesc) in Writer::finish - Add Table query methods: suppression, covering_rt, overlapping, iterators - Store range_tombstone_count in table metadata Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Point reads now check range tombstone suppression from SST tables in addition to memtables - Iteration collects range tombstones from SST tables alongside memtable tombstones for the RangeTombstoneFilter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Collect range tombstones from sealed memtables during flush - Pass tombstones through flush_to_tables to MultiWriter - MultiWriter writes tombstones to every output table - Both Tree and BlobTree flush paths supported Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collect range tombstones from input tables during compaction and write them to output tables. At the last level, evict tombstones below the MVCC GC watermark since no data beneath them can be resurrected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers memtable point reads, range iteration, flush, compaction, end-exclusive semantics, reverse iteration, MVCC visibility, overlapping tombstones, cross-layer suppression, and GC threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When MultiWriter rotates or finishes, range tombstones are now clipped to each output table's key range using intersect_opt(). This avoids writing tombstones that don't overlap with a table's data, and ensures each table only stores the relevant portion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tones During range iteration setup, tables that are fully covered by a range tombstone (with higher seqno than the table's max) are now skipped entirely, avoiding unnecessary I/O. Tombstones are collected from all sources before building the merge iterator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Confirms that when a range tombstone is evicted at the bottom level (gc_watermark > tombstone seqno), the underlying data that survived compaction becomes visible again via point reads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Apologies forgot to lint, working on it now. |
…code Fix deny-level clippy lints (map_or -> is_some_and/is_none_or, unnecessary Ok(x?), unwrap_used/indexing_slicing in tests, type complexity) and add missing #[must_use], #[expect], and doc sections across all range tombstone files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
- Add AtomicBool fast path on Memtable to skip range tombstone RwLock acquisition when no tombstones exist - Add has_sst_range_tombstones flag on SuperVersion to skip SST table iteration for point reads and range scans when no SSTs have tombstones - Skip RangeTombstoneFilter wrapping entirely when no tombstones collected - Deduplicate overlapping range tombstones during compaction using sort + dedup_by to remove tombstones fully covered by a retained one - Read range_tombstone_count from SST metadata (optional, default 0 for backward compat with older v3 tables) - Add tests for fast-path correctness and compaction dedup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Will do some benchmarks on this when I get the chance to prove that it doesn't reduce performance. |
|
Can't wait to use this feature! |
|
@ariesdevil I would love another person with eyes on it if you have time to do a review. I spent a couple of days working with multiple AIs on formulating a plan before I started implementing but I'm still not convinced this is perfect. It is a big feature to implement in a code base that is unfamiliar to me. It is also a lot to ask a code owner to review with over 4000 lines added. I tried breaking it down into atomic commits to make it easier. The more eyes that are on this, the more comfortable @marvin-j97 will feel about merging it. Also we absolutely must ensure this is properly benchmarked so we do not cause performance regressions. |
|
The first thing that seems awkward to me is the duplication of range tombstones for start and end inside tables. Ideally you would just store a table's range tombstones using a DataBlock. That way we don't have another block implementation, as the data blocks are pretty tried and tested. |
| tree.insert("c", "val_c", 3); | ||
|
|
||
| // Insert range tombstone [a, d) at seqno 10 | ||
| tree.active_memtable() |
There was a problem hiding this comment.
AbstractTree should then have a remove_range method, so we don't have to access the active memtable manually.
|
@marvin-j97 cheers for the initial review. I agree it’s a bit awkward to add bespoke block formats, reusing the existing DataBlock path would be the obvious “less code” move and it’s well-tested. The reason I kept range tombstones separate is the access pattern mismatch:
Splitting into ByStart + ByEndDesc lets us:
This isn’t us inventing a new pattern: RocksDB does not store range tombstones in regular data blocks either. They have dedicated table metadata/blocks and specialised handling (incl fragmentation/caching):
Their shape is different (single start-sorted list + fragmentation) whereas we keep two orders (start-asc + end-desc) and avoid fragmentation to make reverse streaming + pruning straightforward. But the underlying precedent is the same: tombstones are special enough to deserve special storage/read paths. That said, I’m not religious about the exact encoding. If the repo strongly prefers DataBlock reuse for maintainability/reviewability, we can get most of the same properties with two DataBlocks:
We would need to keep per-window max_end as a small sidecar (or another tiny DataBlock) to preserve the pruning wins. Understood re: Let me know what you think, happy to prototype the DataBlock version or discuss further. |
That's partially true. The blocks are designed for binary search with lower + upper bounds. But the upper bound does not really help with range tombstones, as explained in the next paragraph.
Indeed I think this is pretty unavoidable, because range tombstones are inherently a 2D-problem (range + temporality (seqno)). RocksDB also turns the stored range tombstones into a different in-memory representation when the table is first loaded. Unless you can figure out some kind of specialized data structure (new block type) that allows queries similar to a {KD/range/segment/interval} tree that does not need a deserialization step (similar to e.g. LOUDS encoded trie or zero-copy schemes).
I think for all intents and purposes the range tombstones in RocksDB are stored in data blocks without binary seek index. See the image in https://rocksdb.org/blog/2018/11/21/delete-range.html Taking a look at Pebble may be worth it. |
Expose remove_range on the AbstractTree trait so callers (including tests) no longer need to reach into active_memtable() directly to insert range tombstones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This recent paper suggests I may be barking up the wrong tree anyway: https://arxiv.org/html/2511.06061v1 @marvin-j97 |
Generally I'm open to having alternative implementations. Having a separate data structure in the LSM-tree is generally possible by adding a new field into the Version files (that's how the value log works). However, such auxiliary indexes must be immutable/log-structured as they are only allowed to change when a new Version is created, which this paper seems to do by having an in-memory index and DR-tree disk files.
|
|
Apologies, I do intend on finishing this but life has gotten in the way. |
- Add RangeTombstone, ActiveTombstoneSet, IntervalTree core types (ported from upstream PR fjall-rs#242 algorithms, own SST persistence) - Add RangeTombstoneFilter for bidirectional iteration suppression - Integrate into Memtable with interval tree for O(log n) queries - Add remove_range(start, end, seqno) and remove_prefix(prefix, seqno) to AbstractTree trait, implemented for Tree and BlobTree - Wire suppression into point reads (memtable + sealed + SST layers) - Wire RangeTombstoneFilter into range/prefix iteration pipeline - SST persistence: raw wire format in BlockType::RangeTombstone block - Flush: collect range tombstones from sealed memtables, write to SSTs - Compaction: propagate RTs from input to output tables with clipping - GC: evict range tombstones below watermark at bottom level - Table-skip: skip tables fully covered by a range tombstone - MultiWriter: clip RTs to each output table's key range on rotation - Handle RT-only tables (derive key range from tombstone bounds) - 16 integration tests covering all paths Closes #16
|
We independently implemented range tombstones in our fork (structured-world/lsm-tree#21) and wanted to share a few notes in case they're useful. Storage format: We went with a single raw block ( Key differences from this PR:
What we kept the same: IntervalTree (AVL-balanced), ActiveTombstoneSet (sweep-line), RangeTombstoneFilter (bidirectional wrapper), compaction GC at last level, flush with unclipped RTs. The tradeoff is clear — our encoding is simpler but loses the pruning/prefix-compression wins for large RT counts. For our use case (graph bulk deletion — few large RTs rather than many small ones) this is fine. Happy to discuss or cherry-pick anything useful. |
|
@polaz If you have a working implementation that could work here cherry picking would be great thankyou. |
Summary
Implements native range tombstone support for the LSM-tree, enabling efficient deletion of contiguous key ranges without writing individual tombstones per key.
Closes #2
Motivation
Currently, deleting a range of keys requires iterating over the range and writing a point tombstone for each key. This is expensive for large ranges — both in write amplification and in the tombstones that must be compacted away later. Range tombstones (as described in the RocksDB DeleteRange design) solve this by recording a single
[start, end)interval with a sequence number, suppressing all keys within the range that have a lower seqno.What's included
Core types
RangeTombstone: half-open[start, end)interval with seqno. Supportscontains_key,visible_at,should_suppress,intersect_opt, andfully_coversqueries. Ordered by(start asc, seqno desc, end asc).ActiveTombstoneSet/ActiveTombstoneSetReverse: streaming sweep-line trackers for forward and reverse iteration, using a seqno multiset and min/max-heap expiry to efficiently determine suppression without rescanning all tombstones.CoveringRt: returned by table-skip queries to identify tombstones that fully cover a table's key range.Memtable integration
IntervalTree: AVL-balanced BST keyed by start, augmented withsubtree_max_end/subtree_max_seqno/subtree_min_seqnofor efficient pruning. Supportsquery_suppression,overlapping_tombstones, andcovering_rtqueries.IntervalTreefor point/overlap queries +BTreeMapby end-desc for reverse iteration.insert_range_tombstone,is_suppressed_by_range_tombstone,overlapping_tombstones,range_tombstones_by_start/end_desc.SST block format
encode_by_start(prefix-compressed START keys with per-windowmax_end) andencode_by_end_desc(prefix-compressed END keys). Both produce backward-parseable footer format with restart points.query_suppression,overlapping_tombstones,query_covering_rt_for_rangewith per-windowmax_endpruning for early exit.iter()in end-descending order for reverse iteration.BlockTypevariants:RangeTombstoneStartandRangeTombstoneEnd.Read path
RangeTombstoneFilterwrapsMvccStream, collecting tombstones from all sources and usingActiveTombstoneSet(forward) orActiveTombstoneSetReverse(reverse) to suppress entries during iteration.Write path
MultiWriter.intersect_opt(). At the last compaction level, tombstones below the MVCC GC watermark are evicted since no data beneath them can be resurrected.Note on GC behavior: After a range tombstone is evicted at the bottom level (when its seqno is below the GC watermark), the underlying data becomes visible again to new reads. Physical cleanup of suppressed keys happens lazily during future compactions when no snapshots hold them.
MVCC correctness
visible_at(read_seqno)ensures only tombstones withseqno < read_seqnoare considered.should_suppress(key_seqno, read_seqno)ensures a tombstone only suppresses keys with lower seqno than the tombstone itself.Test plan
tests/range_tombstone.rs) covering:[start, end)semanticsIntervalTree,ActiveTombstoneSet,RangeTombstone, block encoders/decoderscargo testpasses (312 unit + 23 doc tests)Stats
🤖 Generated with Claude Code