Add prefix filter support by zaidoon1 · Pull Request #186 · fjall-rs/lsm-tree

zaidoon1 · 2025-11-03T19:51:39Z

rebased #151 on top of main and adapted various things to the new api

Summary by CodeRabbit

New Features
- Prefix extractors for prefix-aware filtering, persisted extractor metadata, and compatibility checks; reads can skip non-matching table ranges to reduce work; writers record/use extractors so filters index prefixes
Tests
- Extensive unit, integration and recovery tests covering extractor behavior, compatibility, and range-skipping
Fuzz
- New fuzz harness validating prefix-filter correctness; improved fuzz run setup and ignore rules
Benchmarks
- Benchmarks measuring range-query performance with and without extractors
Chores
- Robust directory creation in fuzz run instructions (mkdir -p)

marvin-j97 · 2026-01-02T15:45:07Z

Unfortunately this has been hit by another wave of conflicts, but I just released 3.0.0, so there will be a bit of a freeze of activity from this point on.

zaidoon1 · 2026-01-03T06:57:34Z

no worries! let me know when you want me to restart working on this/when you are ready to merge things in again. Also feel free to ping me on any tickets/features, etc.. happy to help with whatever (lsm tree or fjall itself)

marvin-j97 · 2026-02-09T23:06:38Z

At this point 3.0 has stabilized I think. I'm definitely keen on getting prefix extractors and compaction filters in as the next major features.

codecov · 2026-02-10T20:33:14Z

Codecov Report

❌ Patch coverage is 91.24424% with 76 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/run_reader.rs	85.95%	33 Missing ⚠️
src/table/mod.rs	88.88%	17 Missing ⚠️
src/prefix.rs	90.75%	16 Missing ⚠️
src/table/writer/filter/full.rs	91.17%	3 Missing ⚠️
src/table/writer/filter/partitioned.rs	93.10%	2 Missing ⚠️
src/tree/mod.rs	97.33%	2 Missing ⚠️
src/blob_tree/mod.rs	97.36%	1 Missing ⚠️
src/range.rs	96.00%	1 Missing ⚠️
src/table/writer/filter/mod.rs	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

marvin-j97 · 2026-02-11T13:43:09Z

I will soon do a more in-depth look into this PR but in the mean time: the run_reader logic is mostly not covered by tests, so I think there are still edge cases that are missing in tests. Other files are not as affected or even improve in coverage, so that's good.

zaidoon1 · 2026-02-11T16:55:02Z

sounds good, i'll add more tests to cover the run_reader logic

zaidoon1 · 2026-02-11T18:19:20Z

note there is some false positives like: https://app.codecov.io/gh/fjall-rs/lsm-tree/pull/186#644ae531cb268487817af88f68673c70-R56 where the doc comments are showing up as "untested"

marvin-j97 · 2026-02-12T18:57:05Z

note there is some false positives like: https://app.codecov.io/gh/fjall-rs/lsm-tree/pull/186#644ae531cb268487817af88f68673c70-R56 where the doc comments are showing up as "untested"

That makes sense because the extractors are never actually asserted to work correctly.

Adding something like

assert_eq!(..., SegmentedPrefixExtractor.name());

assert!(..., SegmentedPrefixExtractor.extract(...));

should fix it.

zaidoon1 · 2026-02-12T22:11:46Z

sounds good, i'll add that but i'll wait for your other feedback on this PR and i address it all in one go.

marvin-j97 · 2026-02-12T22:40:18Z

This file is probably at the point where it could be split into multiple smaller files. But I can do that later.

marvin-j97 · 2026-02-16T18:34:38Z

There are a couple of branches not hit in run_reader still (at least in the early optimization section in RunReader::new), maybe they are ultra-rare edge cases, but it implies the existing tests are not covering all possible scenarios.
Maybe the fuzz tests technically covers those, but fuzz tests obviously don't count towards code coverage.

zaidoon1 · 2026-02-16T23:31:27Z

i've just pushed a commit, this adds coverage for the last valid case that i missed. The rest is "safe fallbacks" for code that should not be hit. What would prefer? I've added debug_asserts and kept the safe fallback. I can also expect/panic all together if you prefer?

…ror handling - get_without_filter now applies the same global_seqno normalization and early-exit check as Table::get, preventing potential MVCC visibility errors for tables with non-zero global_seqno (defensive fix) - Replace silent .ok() with .expect() in 7 fuzz call sites so iterator errors surface immediately instead of being swallowed - Add unit test verifying get_without_filter and get agree under global_seqno translation (fails without the fix) - Fix contradictory assertion messages in prefix_filter_recovery tests

Flatten the nested if/else into early returns, eliminating the intermediate `item` binding. Also fix the trailing comment to accurately describe both code paths that reach get_without_filter (prefix filter already consulted, or filter not trustworthy).

Upstream changed filter_queries to only increment on missing key lookups (issue fjall-rs#246). The test was asserting the old behavior where filter_queries incremented for every lookup. Use a nonexistent key within the table's key range to properly exercise the filter counter.

- Remove the second optimization block in should_skip_range_by_prefix_filter that could incorrectly skip tables in multi-prefix ranges when the start prefix matched the table's min key prefix but the probe landed in a different filter partition - Error on malformed UTF-8 in prefix_extractor metadata instead of silently falling back to None - Document the extract_first() and name() correctness invariants on PrefixExtractor - Return 0 from full filter finish() when no hashes were registered, matching partitioned filter behavior

Read through metadata.prefix_extractor_name instead of maintaining a separate copy on Inner, eliminating the clone and the invariant that both fields must stay in sync.

Use the exported FixedPrefixExtractor in the doc example instead of reimplementing it inline. Change 'iff' to 'if' in extract_first docs.

Three correctness bugs fixed: 1. FixedPrefixExtractor false negatives for short keys: extract_first() returned Some(short_key) for keys shorter than the configured length, producing a hash that was never in the filter. Changed to return None (out-of-domain), matching RocksDB's NewFixedPrefixTransform behavior. This caused silent data loss when tree.prefix() was called with a prefix shorter than the extractor length. 2. tree.prefix() could not use the filter when the prefix length equaled the extractor length: prefix_to_range("h") produces range "h".."i", and extract_first("h") != extract_first("i"), so the filter layer treated it as a multi-prefix range and never consulted the filter. Fixed by threading the original prefix from tree.prefix() through to the filter layer as a prefix_hint, bypassing the range-based prefix extraction. A stability guard (extract_first(hint) == extract_first( hint + "\0")) prevents false negatives when the hint is not a valid extracted prefix (e.g. FullKeyExtractor, or hint shorter than extractor length). 3. RunReader lazy per-table skip missing prefix_filter_allowed check: the optimized validated_prefix_hint path called probe_prefix_filter directly without verifying extractor compatibility. Since optimize_runs can merge tables from different compaction epochs (with different extractor configs) into a single run, this could probe an old table's filter with the wrong extractor and get a false negative. Performance optimizations in RunReader: - Precompute the validated prefix hint once in RunReader::new instead of re-running the stability guard (with a Vec allocation) on every table during lazy iteration - Replace common_prefix: Option<Vec<u8>> with can_prune_upfront: bool, eliminating up to 3 .to_vec() allocations per RunReader::new - Use extractor.as_ref() instead of extractor.clone() in the upfront pruning loop, avoiding an unnecessary Arc refcount bump - Call probe_prefix_filter directly in the lazy loops when a validated hint is available, skipping the full should_skip_range_by_prefix_filter guard re-check

…efix_hint by reference

- Expand ClusteredPrefix range from 0-3 to 0-5 bytes so prefix length can equal or exceed all extractor lengths (up to 4), covering the case where prefix_to_range produces bounds with different extracted prefixes - Widen prefix byte alphabet from 0-7 to 0-9 so values 8-9 produce prefixes absent from all keys, exercising the filter-should-skip path - Add AFL-controlled MajorCompact target size (128 vs 4096 bytes) to produce many small tables per run, making the RunReader lazy loop (3+ table overlap) far more likely to fire - Add PrefixScanExistingKey operation that derives the scan prefix from a previously inserted key, guaranteeing the prefix overlaps real data and forcing the filter to make a meaningful decision - Add FlushCompactReopenCompact composite operation that atomically creates the mixed-extractor multi-table run condition (flush, compact with small target, reopen with different extractor) without relying on AFL to randomly sequence four separate operations

When a prefix extractor is configured, multiple keys sharing the same prefix produce duplicate hashes in the Bloom filter buffer. Without dedup, the filter is sized for N total hashes instead of the true number of unique prefixes, wasting memory proportional to the average number of keys per prefix. Sort + dedup the hash buffer at flush/spill time so the filter is sized for the actual number of unique prefixes. This is gated behind an enable_dedup() flag set by Writer::use_prefix_extractor, so the full-key Bloom path (where each hash is already unique) pays no overhead. For partitioned filters, dedup runs per-partition in spill_filter_partition before building each partition's Bloom filter. Also add a HierarchicalExtractor to the fuzz target that returns multiple prefixes per key (2-byte + 4-byte), exercising the interleaved hash dedup path and the multi-prefix contains_prefix probe logic.

Remove metrics from probe_prefix_filter entirely and let each caller track filter_queries and io_skipped_by_filter in a context-appropriate way, matching the upstream Table::get() pattern from issue fjall-rs#246. Point reads (point_read_from_table): - Filter excludes: both counters incremented (saved I/O) - Filter allows, key not found: filter_queries only (wasted I/O) - Filter allows, key found: no increment (successful read) Range scans (should_skip_range_by_prefix_filter, RunReader lazy loops): - Filter excludes: both counters incremented (saved I/O) - Filter allows: no increment (outcome unknown until iteration) This ensures io_skipped_by_filter / filter_queries gives the true positive rate of filter decisions, without inflating filter_queries for successful reads.

…tering) When a prefix extractor is configured, the filter previously only contained prefix hashes, causing point reads to bypass the Bloom filter entirely (via get_without_filter) after the prefix check passed. For keys that don't exist but share a prefix with existing keys, this meant reading data blocks only to find nothing — wasted I/O that a full-key Bloom would have prevented. Now the filter contains both prefix hashes AND full-key hashes (matching RocksDB's whole_key_filtering + prefix_extractor approach). Point reads use two-level filtering: the prefix filter for coarse table-level pruning, then the full-key Bloom for precise per-key filtering. This eliminates the 10% regression observed in workloads with many keys per prefix. The whole_key_filtering option (default true) can be set to false for seek-only workloads that never perform point lookups, saving filter space by omitting full-key hashes.

Increment filter_queries on every probe where the filter actually exists and answers (Ok(Some(_))), not just on skips. This ensures io_skipped_by_filter / filter_queries gives the true positive rate of filter decisions. Previously filter_queries only incremented on definitive exclusion, making the FPR appear artificially low. Return Ok(None) from probe_prefix_filter and maybe_contains_prefix when the filter cannot answer (no filter block, incompatible extractor, or out-of-domain key). Callers only increment metrics for Ok(Some(_)), preventing false counts for tables without filters or with mismatched extractors. All 5 prefix filter call sites (point_read_from_table, both paths in should_skip_range_by_prefix_filter, and both RunReader lazy loops) now follow the same pattern: capture the probe result, count if it answered, count the skip if it excluded.

The RunReader lazy loop previously called probe_prefix_filter on every table, which internally called extractor.extract(key) (a Box<dyn Iterator> heap allocation) and re-hashed the prefix bytes on each probe. For workloads scanning hundreds of tables per query, this overhead was the primary cause of a 10.6% throughput regression in the column store benchmark. Precompute the prefix hash once in RunReader::new (from the validated prefix hint) and store it alongside the hint. The lazy loops now call probe_prefix_filter_with_hash which uses the key only for TLI partition selection and checks the precomputed hash directly against the Bloom filter, eliminating: - Box<dyn Iterator> allocation per table - extract_first() call per table - Hash computation per table Column store benchmark results (5 min, 1000 databases, 48-byte prefix): - Before optimization: -10.6% regression vs baseline - After optimization: +16.1% improvement vs baseline - Tail latency: 966us -> 27us (36x improvement)

…allocation contains_prefix previously called extractor.extract(key) which returns Box<dyn Iterator> — a heap allocation on every filter probe. Replace with extract_first(key) which returns Option<&[u8]> (zero allocation). Checking only the first prefix is sufficient for filter probing: during writes, extract_first(key) is always registered in the filter for every key. If it is absent, the key was never written and the table can be safely skipped. Secondary prefixes from extract() can only match hashes from other keys with different first prefixes, which is irrelevant. Feed workload benchmark (point-read heavy, 5 min): - Point read throughput: +18.1% - Point read latency: -30.4% - Range scan throughput: +12.8%

Add extract_last to PrefixExtractor trait, returning the highest- cardinality prefix instead of the coarsest. For multi-prefix extractors like [company#, company#user#], this allows RunReader to check hash(company#user#) instead of hash(company#), skipping tables that contain the company but not the specific user. The approach naturally adapts to hint length: long hints get the most specific prefix, short hints fall back to the coarsest. A stability guard (extract_last(hint) == extract_last(hint + NUL)) ensures the chosen prefix is stable across the query range. If the guard fails, falls back to extract_first (previous behavior). For single-prefix extractors, extract_last is overridden to delegate to extract_first (zero allocation, identical behavior). The struct size (Option<u64>) and per-table cost (single Bloom probe) are unchanged. The only added cost is one extract_last call in RunReader::new (once per query, not per table).

…uning The single-table skip path (should_skip_range_by_prefix_filter with a prefix hint) now uses the same extract_last optimization as RunReader: precompute the most specific stable prefix hash, then probe with probe_prefix_filter_with_hash. Previously it used extract_first, giving coarser pruning than the RunReader path for multi-prefix extractors. Both prefix filter pruning paths (RunReader lazy loop and single-table skip) now consistently use the most specific available prefix hash, with extract_last stability guard and extract_first fallback.

…ing is enabled When whole_key_filtering is true (the default), the filter contains full-key hashes. For point reads, the full-key Bloom is strictly more precise than the prefix pre-check — so skip the prefix probe entirely and go straight to the Bloom. This eliminates a redundant filter probe on every point read that was costing ~10% throughput on point-read-heavy workloads. The prefix filter now only activates for range/prefix scans, where it provides table-level pruning that the full-key Bloom cannot. When whole_key_filtering is false, point reads still use the prefix filter as the sole pre-check since no full-key hashes are available. Feed workload (point-read heavy): - Before: -9.5% regression vs no prefix extractor - After: +7.7% improvement vs no prefix extractor, -69.6% disk reads Column store (scan-only): +4.7% improvement, consistent with previous

zaidoon1 · 2026-04-08T02:21:36Z

I haven't thought about it too much until now, but at this point only extract_first is really used, and makes me question how well multi prefixes really work. Because if you extract a multiple prefixes from a key, let's say <company_id>#<user_id>#... -> ["<company>#", "<company>#<user>#"] naturally that allows to filter for prefix queries over company and (company + user).

However: as it currently stands, for a (company + user) query, extract_first will only check for company#, and I'm unsure whether going through all prefixes warrants the heap allocation + filter lookup overhead. Obviously company + user has a higher cardinality, so it would be more exact for filtering, I guess.

ok i've pushed three commits to address all of this:

The extract() iterator is still used on the write path to register ALL prefix hashes in the filter, so multi-prefix extractors work correctly for both coarse and fine-grained queries.
For reads, we now use extract_last (the most specific prefix) instead of extract_first (the coarsest) when probing the filter. This is computed once per query in RunReader::new and should_skip_range_by_prefix_filter, not per table — so the one-time extract() Box allocation is amortized. The per-table probe uses a precomputed hash with zero allocation.
For a multi-prefix extractor like ["company#", "company#user#"]:

tree.prefix("company#user#hello") → probes with hash("company#user#") (most specific stable prefix). Tables with the company but not this specific user are skipped.
tree.prefix("company#") → probes with hash("company#") (only prefix available at this hint length). Tables without this company are skipped.

The approach adapts to the hint length automatically. A stability guard (extract_last(hint) == extract_last(hint + "\0")) ensures the chosen prefix is stable across the query range — if the most specific prefix isn't stable, it falls back to extract_first.
For single-prefix extractors, extract_last is overridden to delegate to extract_first (zero allocation, identical behavior). No overhead change.
For point reads with whole_key_filtering=true (the default), the prefix filter is skipped entirely — point reads go straight to the full-key Bloom which is strictly more precise. The prefix filter only activates for range/prefix scans where it provides table-level pruning.

marvin-j97 added enhancement New feature or request epic api type:filters type:table labels Nov 7, 2025

zaidoon1 force-pushed the zaidoon/prefix-filter branch from fc17be2 to 2df2ae4 Compare February 10, 2026 20:30

zaidoon1 force-pushed the zaidoon/prefix-filter branch from 2df2ae4 to 58ad234 Compare February 10, 2026 20:33

zaidoon1 force-pushed the zaidoon/prefix-filter branch 3 times, most recently from af5ac86 to c1ea699 Compare February 11, 2026 17:46

zaidoon1 force-pushed the zaidoon/prefix-filter branch from c1ea699 to eaec760 Compare February 11, 2026 18:33

marvin-j97 reviewed Feb 12, 2026

View reviewed changes

marvin-j97 added performance type:point-read file format labels Feb 12, 2026

zaidoon1 force-pushed the zaidoon/prefix-filter branch 3 times, most recently from 7f9fffc to ee30eba Compare February 14, 2026 14:58

zaidoon1 force-pushed the zaidoon/prefix-filter branch from ee30eba to b099f10 Compare February 16, 2026 22:37

marvin-j97 and others added 28 commits April 7, 2026 21:47

refactor(should_skip_range_by_prefix_filter): dont require owned key

223687e

add PrefixExtractor::extract_first

fc289fc

doc

2eeb71f

doc

2bd1edd

doc

a083b64

test: cover partitioned filter spill in register_bytes

fc6fe2e

refactor: replace silent skip with expect in register_bytes spill

001f317

ignore test coverage for unit test helper fn

9e853b3

bump

cb73425

doc: use mkdirp in fuzz test

a754df1

Remove duplicate prefix_extractor_name field from Inner

53450ad

Read through metadata.prefix_extractor_name instead of maintaining a separate copy on Inner, eliminating the clone and the invariant that both fields must stay in sync.

Simplify PrefixExtractor doc example and fix typo

afe87a6

Use the exported FixedPrefixExtractor in the doc example instead of reimplementing it inline. Change 'iff' to 'if' in extract_first docs.

Fix clippy: collapse nested if, move const before statements, take pr…

fd21ab5

…efix_hint by reference

zaidoon1 force-pushed the zaidoon/prefix-filter branch from 8126d0b to e3622b8 Compare April 8, 2026 01:50

Uh oh!

Conversation

zaidoon1 commented Nov 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

marvin-j97 commented Jan 2, 2026

Uh oh!

zaidoon1 commented Jan 3, 2026

Uh oh!

marvin-j97 commented Feb 9, 2026

Uh oh!

codecov bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

marvin-j97 commented Feb 11, 2026

Uh oh!

zaidoon1 commented Feb 11, 2026

Uh oh!

zaidoon1 commented Feb 11, 2026

Uh oh!

marvin-j97 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaidoon1 commented Feb 12, 2026

Uh oh!

Uh oh!

marvin-j97 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marvin-j97 commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaidoon1 commented Feb 16, 2026

Uh oh!

zaidoon1 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zaidoon1 commented Nov 3, 2025 •

edited by coderabbitai bot

Loading

codecov bot commented Feb 10, 2026 •

edited

Loading

marvin-j97 commented Feb 12, 2026 •

edited

Loading

marvin-j97 commented Feb 16, 2026 •

edited

Loading

zaidoon1 commented Apr 8, 2026 •

edited

Loading