Skip to content

Add prefix filter support#186

Open
zaidoon1 wants to merge 43 commits intofjall-rs:mainfrom
zaidoon1:zaidoon/prefix-filter
Open

Add prefix filter support#186
zaidoon1 wants to merge 43 commits intofjall-rs:mainfrom
zaidoon1:zaidoon/prefix-filter

Conversation

@zaidoon1
Copy link
Copy Markdown
Contributor

@zaidoon1 zaidoon1 commented Nov 3, 2025

rebased #151 on top of main and adapted various things to the new api

Summary by CodeRabbit

  • New Features

    • Prefix extractors for prefix-aware filtering, persisted extractor metadata, and compatibility checks; reads can skip non-matching table ranges to reduce work; writers record/use extractors so filters index prefixes
  • Tests

    • Extensive unit, integration and recovery tests covering extractor behavior, compatibility, and range-skipping
  • Fuzz

    • New fuzz harness validating prefix-filter correctness; improved fuzz run setup and ignore rules
  • Benchmarks

    • Benchmarks measuring range-query performance with and without extractors
  • Chores

    • Robust directory creation in fuzz run instructions (mkdir -p)

@marvin-j97
Copy link
Copy Markdown
Contributor

Unfortunately this has been hit by another wave of conflicts, but I just released 3.0.0, so there will be a bit of a freeze of activity from this point on.

@zaidoon1
Copy link
Copy Markdown
Contributor Author

zaidoon1 commented Jan 3, 2026

no worries! let me know when you want me to restart working on this/when you are ready to merge things in again. Also feel free to ping me on any tickets/features, etc.. happy to help with whatever (lsm tree or fjall itself)

@marvin-j97
Copy link
Copy Markdown
Contributor

At this point 3.0 has stabilized I think. I'm definitely keen on getting prefix extractors and compaction filters in as the next major features.

@zaidoon1 zaidoon1 force-pushed the zaidoon/prefix-filter branch from fc17be2 to 2df2ae4 Compare February 10, 2026 20:30
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 10, 2026

@zaidoon1 zaidoon1 force-pushed the zaidoon/prefix-filter branch from 2df2ae4 to 58ad234 Compare February 10, 2026 20:33
@marvin-j97
Copy link
Copy Markdown
Contributor

I will soon do a more in-depth look into this PR but in the mean time: the run_reader logic is mostly not covered by tests, so I think there are still edge cases that are missing in tests. Other files are not as affected or even improve in coverage, so that's good.

@zaidoon1
Copy link
Copy Markdown
Contributor Author

sounds good, i'll add more tests to cover the run_reader logic

@zaidoon1 zaidoon1 force-pushed the zaidoon/prefix-filter branch 3 times, most recently from af5ac86 to c1ea699 Compare February 11, 2026 17:46
@zaidoon1
Copy link
Copy Markdown
Contributor Author

note there is some false positives like: https://app.codecov.io/gh/fjall-rs/lsm-tree/pull/186#644ae531cb268487817af88f68673c70-R56 where the doc comments are showing up as "untested"

@zaidoon1 zaidoon1 force-pushed the zaidoon/prefix-filter branch from c1ea699 to eaec760 Compare February 11, 2026 18:33
@marvin-j97
Copy link
Copy Markdown
Contributor

marvin-j97 commented Feb 12, 2026

note there is some false positives like: https://app.codecov.io/gh/fjall-rs/lsm-tree/pull/186#644ae531cb268487817af88f68673c70-R56 where the doc comments are showing up as "untested"

That makes sense because the extractors are never actually asserted to work correctly.

Adding something like

assert_eq!(..., SegmentedPrefixExtractor.name());

assert!(..., SegmentedPrefixExtractor.extract(...));

should fix it.

@zaidoon1
Copy link
Copy Markdown
Contributor Author

sounds good, i'll add that but i'll wait for your other feedback on this PR and i address it all in one go.

Comment thread src/config/mod.rs Outdated
Comment thread tests/prefix_filter.rs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is probably at the point where it could be split into multiple smaller files. But I can do that later.

Comment thread src/table/writer/mod.rs Outdated
Comment thread src/config/mod.rs Outdated
Comment thread src/table/mod.rs Outdated
Comment thread src/table/mod.rs Outdated
Comment thread src/run_reader.rs Outdated
@marvin-j97
Copy link
Copy Markdown
Contributor

marvin-j97 commented Feb 16, 2026

There are a couple of branches not hit in run_reader still (at least in the early optimization section in RunReader::new), maybe they are ultra-rare edge cases, but it implies the existing tests are not covering all possible scenarios.
Maybe the fuzz tests technically covers those, but fuzz tests obviously don't count towards code coverage.

@zaidoon1 zaidoon1 force-pushed the zaidoon/prefix-filter branch from ee30eba to b099f10 Compare February 16, 2026 22:37
@zaidoon1
Copy link
Copy Markdown
Contributor Author

i've just pushed a commit, this adds coverage for the last valid case that i missed. The rest is "safe fallbacks" for code that should not be hit. What would prefer? I've added debug_asserts and kept the safe fallback. I can also expect/panic all together if you prefer?

marvin-j97 and others added 28 commits April 7, 2026 21:47
…ror handling

- get_without_filter now applies the same global_seqno normalization and
  early-exit check as Table::get, preventing potential MVCC visibility
  errors for tables with non-zero global_seqno (defensive fix)
- Replace silent .ok() with .expect() in 7 fuzz call sites so iterator
  errors surface immediately instead of being swallowed
- Add unit test verifying get_without_filter and get agree under
  global_seqno translation (fails without the fix)
- Fix contradictory assertion messages in prefix_filter_recovery tests
Flatten the nested if/else into early returns, eliminating the
intermediate `item` binding. Also fix the trailing comment to
accurately describe both code paths that reach get_without_filter
(prefix filter already consulted, or filter not trustworthy).
Upstream changed filter_queries to only increment on missing key lookups
(issue fjall-rs#246). The test was asserting the old behavior where filter_queries
incremented for every lookup. Use a nonexistent key within the table's
key range to properly exercise the filter counter.
- Remove the second optimization block in should_skip_range_by_prefix_filter
  that could incorrectly skip tables in multi-prefix ranges when the start
  prefix matched the table's min key prefix but the probe landed in a
  different filter partition
- Error on malformed UTF-8 in prefix_extractor metadata instead of silently
  falling back to None
- Document the extract_first() and name() correctness invariants on
  PrefixExtractor
- Return 0 from full filter finish() when no hashes were registered,
  matching partitioned filter behavior
Read through metadata.prefix_extractor_name instead of maintaining a
separate copy on Inner, eliminating the clone and the invariant that
both fields must stay in sync.
Use the exported FixedPrefixExtractor in the doc example instead of
reimplementing it inline. Change 'iff' to 'if' in extract_first docs.
Three correctness bugs fixed:

1. FixedPrefixExtractor false negatives for short keys: extract_first()
   returned Some(short_key) for keys shorter than the configured length,
   producing a hash that was never in the filter. Changed to return None
   (out-of-domain), matching RocksDB's NewFixedPrefixTransform behavior.
   This caused silent data loss when tree.prefix() was called with a
   prefix shorter than the extractor length.

2. tree.prefix() could not use the filter when the prefix length equaled
   the extractor length: prefix_to_range("h") produces range "h".."i",
   and extract_first("h") != extract_first("i"), so the filter layer
   treated it as a multi-prefix range and never consulted the filter.
   Fixed by threading the original prefix from tree.prefix() through to
   the filter layer as a prefix_hint, bypassing the range-based prefix
   extraction. A stability guard (extract_first(hint) == extract_first(
   hint + "\0")) prevents false negatives when the hint is not a valid
   extracted prefix (e.g. FullKeyExtractor, or hint shorter than extractor
   length).

3. RunReader lazy per-table skip missing prefix_filter_allowed check:
   the optimized validated_prefix_hint path called probe_prefix_filter
   directly without verifying extractor compatibility. Since optimize_runs
   can merge tables from different compaction epochs (with different
   extractor configs) into a single run, this could probe an old table's
   filter with the wrong extractor and get a false negative.

Performance optimizations in RunReader:

- Precompute the validated prefix hint once in RunReader::new instead of
  re-running the stability guard (with a Vec allocation) on every table
  during lazy iteration
- Replace common_prefix: Option<Vec<u8>> with can_prune_upfront: bool,
  eliminating up to 3 .to_vec() allocations per RunReader::new
- Use extractor.as_ref() instead of extractor.clone() in the upfront
  pruning loop, avoiding an unnecessary Arc refcount bump
- Call probe_prefix_filter directly in the lazy loops when a validated
  hint is available, skipping the full should_skip_range_by_prefix_filter
  guard re-check
- Expand ClusteredPrefix range from 0-3 to 0-5 bytes so prefix length
  can equal or exceed all extractor lengths (up to 4), covering the
  case where prefix_to_range produces bounds with different extracted
  prefixes
- Widen prefix byte alphabet from 0-7 to 0-9 so values 8-9 produce
  prefixes absent from all keys, exercising the filter-should-skip path
- Add AFL-controlled MajorCompact target size (128 vs 4096 bytes) to
  produce many small tables per run, making the RunReader lazy loop
  (3+ table overlap) far more likely to fire
- Add PrefixScanExistingKey operation that derives the scan prefix from
  a previously inserted key, guaranteeing the prefix overlaps real data
  and forcing the filter to make a meaningful decision
- Add FlushCompactReopenCompact composite operation that atomically
  creates the mixed-extractor multi-table run condition (flush, compact
  with small target, reopen with different extractor) without relying
  on AFL to randomly sequence four separate operations
When a prefix extractor is configured, multiple keys sharing the same
prefix produce duplicate hashes in the Bloom filter buffer. Without
dedup, the filter is sized for N total hashes instead of the true number
of unique prefixes, wasting memory proportional to the average number of
keys per prefix.

Sort + dedup the hash buffer at flush/spill time so the filter is sized
for the actual number of unique prefixes. This is gated behind an
enable_dedup() flag set by Writer::use_prefix_extractor, so the full-key
Bloom path (where each hash is already unique) pays no overhead.

For partitioned filters, dedup runs per-partition in spill_filter_partition
before building each partition's Bloom filter.

Also add a HierarchicalExtractor to the fuzz target that returns multiple
prefixes per key (2-byte + 4-byte), exercising the interleaved hash
dedup path and the multi-prefix contains_prefix probe logic.
Remove metrics from probe_prefix_filter entirely and let each caller
track filter_queries and io_skipped_by_filter in a context-appropriate
way, matching the upstream Table::get() pattern from issue fjall-rs#246.

Point reads (point_read_from_table):
- Filter excludes: both counters incremented (saved I/O)
- Filter allows, key not found: filter_queries only (wasted I/O)
- Filter allows, key found: no increment (successful read)

Range scans (should_skip_range_by_prefix_filter, RunReader lazy loops):
- Filter excludes: both counters incremented (saved I/O)
- Filter allows: no increment (outcome unknown until iteration)

This ensures io_skipped_by_filter / filter_queries gives the true
positive rate of filter decisions, without inflating filter_queries
for successful reads.
…tering)

When a prefix extractor is configured, the filter previously only
contained prefix hashes, causing point reads to bypass the Bloom filter
entirely (via get_without_filter) after the prefix check passed. For
keys that don't exist but share a prefix with existing keys, this meant
reading data blocks only to find nothing — wasted I/O that a full-key
Bloom would have prevented.

Now the filter contains both prefix hashes AND full-key hashes (matching
RocksDB's whole_key_filtering + prefix_extractor approach). Point reads
use two-level filtering: the prefix filter for coarse table-level
pruning, then the full-key Bloom for precise per-key filtering. This
eliminates the 10% regression observed in workloads with many keys per
prefix.

The whole_key_filtering option (default true) can be set to false for
seek-only workloads that never perform point lookups, saving filter
space by omitting full-key hashes.
Increment filter_queries on every probe where the filter actually exists
and answers (Ok(Some(_))), not just on skips. This ensures
io_skipped_by_filter / filter_queries gives the true positive rate of
filter decisions. Previously filter_queries only incremented on definitive
exclusion, making the FPR appear artificially low.

Return Ok(None) from probe_prefix_filter and maybe_contains_prefix when
the filter cannot answer (no filter block, incompatible extractor, or
out-of-domain key). Callers only increment metrics for Ok(Some(_)),
preventing false counts for tables without filters or with mismatched
extractors.

All 5 prefix filter call sites (point_read_from_table, both paths in
should_skip_range_by_prefix_filter, and both RunReader lazy loops) now
follow the same pattern: capture the probe result, count if it answered,
count the skip if it excluded.
The RunReader lazy loop previously called probe_prefix_filter on every
table, which internally called extractor.extract(key) (a Box<dyn Iterator>
heap allocation) and re-hashed the prefix bytes on each probe. For
workloads scanning hundreds of tables per query, this overhead was the
primary cause of a 10.6% throughput regression in the column store
benchmark.

Precompute the prefix hash once in RunReader::new (from the validated
prefix hint) and store it alongside the hint. The lazy loops now call
probe_prefix_filter_with_hash which uses the key only for TLI partition
selection and checks the precomputed hash directly against the Bloom
filter, eliminating:
- Box<dyn Iterator> allocation per table
- extract_first() call per table
- Hash computation per table

Column store benchmark results (5 min, 1000 databases, 48-byte prefix):
- Before optimization: -10.6% regression vs baseline
- After optimization: +16.1% improvement vs baseline
- Tail latency: 966us -> 27us (36x improvement)
…allocation

contains_prefix previously called extractor.extract(key) which returns
Box<dyn Iterator> — a heap allocation on every filter probe. Replace
with extract_first(key) which returns Option<&[u8]> (zero allocation).

Checking only the first prefix is sufficient for filter probing: during
writes, extract_first(key) is always registered in the filter for every
key. If it is absent, the key was never written and the table can be
safely skipped. Secondary prefixes from extract() can only match hashes
from other keys with different first prefixes, which is irrelevant.

Feed workload benchmark (point-read heavy, 5 min):
- Point read throughput: +18.1%
- Point read latency: -30.4%
- Range scan throughput: +12.8%
Add extract_last to PrefixExtractor trait, returning the highest-
cardinality prefix instead of the coarsest. For multi-prefix extractors
like [company#, company#user#], this allows RunReader to check
hash(company#user#) instead of hash(company#), skipping tables that
contain the company but not the specific user.

The approach naturally adapts to hint length: long hints get the most
specific prefix, short hints fall back to the coarsest. A stability
guard (extract_last(hint) == extract_last(hint + NUL)) ensures the
chosen prefix is stable across the query range. If the guard fails,
falls back to extract_first (previous behavior).

For single-prefix extractors, extract_last is overridden to delegate to
extract_first (zero allocation, identical behavior). The struct size
(Option<u64>) and per-table cost (single Bloom probe) are unchanged.
The only added cost is one extract_last call in RunReader::new (once per
query, not per table).
…uning

The single-table skip path (should_skip_range_by_prefix_filter with a
prefix hint) now uses the same extract_last optimization as RunReader:
precompute the most specific stable prefix hash, then probe with
probe_prefix_filter_with_hash. Previously it used extract_first, giving
coarser pruning than the RunReader path for multi-prefix extractors.

Both prefix filter pruning paths (RunReader lazy loop and single-table
skip) now consistently use the most specific available prefix hash,
with extract_last stability guard and extract_first fallback.
…ing is enabled

When whole_key_filtering is true (the default), the filter contains
full-key hashes. For point reads, the full-key Bloom is strictly more
precise than the prefix pre-check — so skip the prefix probe entirely
and go straight to the Bloom. This eliminates a redundant filter probe
on every point read that was costing ~10% throughput on point-read-heavy
workloads.

The prefix filter now only activates for range/prefix scans, where it
provides table-level pruning that the full-key Bloom cannot.

When whole_key_filtering is false, point reads still use the prefix
filter as the sole pre-check since no full-key hashes are available.

Feed workload (point-read heavy):
- Before: -9.5% regression vs no prefix extractor
- After: +7.7% improvement vs no prefix extractor, -69.6% disk reads

Column store (scan-only): +4.7% improvement, consistent with previous
@zaidoon1 zaidoon1 force-pushed the zaidoon/prefix-filter branch from 8126d0b to e3622b8 Compare April 8, 2026 01:50
@zaidoon1
Copy link
Copy Markdown
Contributor Author

zaidoon1 commented Apr 8, 2026

I haven't thought about it too much until now, but at this point only extract_first is really used, and makes me question how well multi prefixes really work. Because if you extract a multiple prefixes from a key, let's say <company_id>#<user_id>#... -> ["<company>#", "<company>#<user>#"] naturally that allows to filter for prefix queries over company and (company + user).

However: as it currently stands, for a (company + user) query, extract_first will only check for company#, and I'm unsure whether going through all prefixes warrants the heap allocation + filter lookup overhead. Obviously company + user has a higher cardinality, so it would be more exact for filtering, I guess.

ok i've pushed three commits to address all of this:

The extract() iterator is still used on the write path to register ALL prefix hashes in the filter, so multi-prefix extractors work correctly for both coarse and fine-grained queries.
For reads, we now use extract_last (the most specific prefix) instead of extract_first (the coarsest) when probing the filter. This is computed once per query in RunReader::new and should_skip_range_by_prefix_filter, not per table — so the one-time extract() Box allocation is amortized. The per-table probe uses a precomputed hash with zero allocation.
For a multi-prefix extractor like ["company#", "company#user#"]:

  • tree.prefix("company#user#hello") → probes with hash("company#user#") (most specific stable prefix). Tables with the company but not this specific user are skipped.
  • tree.prefix("company#") → probes with hash("company#") (only prefix available at this hint length). Tables without this company are skipped.

The approach adapts to the hint length automatically. A stability guard (extract_last(hint) == extract_last(hint + "\0")) ensures the chosen prefix is stable across the query range — if the most specific prefix isn't stable, it falls back to extract_first.
For single-prefix extractors, extract_last is overridden to delegate to extract_first (zero allocation, identical behavior). No overhead change.
For point reads with whole_key_filtering=true (the default), the prefix filter is skipped entirely — point reads go straight to the full-key Bloom which is strictly more precise. The prefix filter only activates for range/prefix scans where it provides table-level pruning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants