Conversation
|
Unfortunately this has been hit by another wave of conflicts, but I just released 3.0.0, so there will be a bit of a freeze of activity from this point on. |
|
no worries! let me know when you want me to restart working on this/when you are ready to merge things in again. Also feel free to ping me on any tickets/features, etc.. happy to help with whatever (lsm tree or fjall itself) |
|
At this point 3.0 has stabilized I think. I'm definitely keen on getting prefix extractors and compaction filters in as the next major features. |
fc17be2 to
2df2ae4
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
2df2ae4 to
58ad234
Compare
|
I will soon do a more in-depth look into this PR but in the mean time: the run_reader logic is mostly not covered by tests, so I think there are still edge cases that are missing in tests. Other files are not as affected or even improve in coverage, so that's good. |
|
sounds good, i'll add more tests to cover the run_reader logic |
af5ac86 to
c1ea699
Compare
|
note there is some false positives like: https://app.codecov.io/gh/fjall-rs/lsm-tree/pull/186#644ae531cb268487817af88f68673c70-R56 where the doc comments are showing up as "untested" |
c1ea699 to
eaec760
Compare
That makes sense because the extractors are never actually asserted to work correctly. Adding something like assert_eq!(..., SegmentedPrefixExtractor.name());
assert!(..., SegmentedPrefixExtractor.extract(...));should fix it. |
|
sounds good, i'll add that but i'll wait for your other feedback on this PR and i address it all in one go. |
There was a problem hiding this comment.
This file is probably at the point where it could be split into multiple smaller files. But I can do that later.
7f9fffc to
ee30eba
Compare
|
There are a couple of branches not hit in |
ee30eba to
b099f10
Compare
|
i've just pushed a commit, this adds coverage for the last valid case that i missed. The rest is "safe fallbacks" for code that should not be hit. What would prefer? I've added debug_asserts and kept the safe fallback. I can also expect/panic all together if you prefer? |
…ror handling - get_without_filter now applies the same global_seqno normalization and early-exit check as Table::get, preventing potential MVCC visibility errors for tables with non-zero global_seqno (defensive fix) - Replace silent .ok() with .expect() in 7 fuzz call sites so iterator errors surface immediately instead of being swallowed - Add unit test verifying get_without_filter and get agree under global_seqno translation (fails without the fix) - Fix contradictory assertion messages in prefix_filter_recovery tests
Flatten the nested if/else into early returns, eliminating the intermediate `item` binding. Also fix the trailing comment to accurately describe both code paths that reach get_without_filter (prefix filter already consulted, or filter not trustworthy).
Upstream changed filter_queries to only increment on missing key lookups (issue fjall-rs#246). The test was asserting the old behavior where filter_queries incremented for every lookup. Use a nonexistent key within the table's key range to properly exercise the filter counter.
- Remove the second optimization block in should_skip_range_by_prefix_filter that could incorrectly skip tables in multi-prefix ranges when the start prefix matched the table's min key prefix but the probe landed in a different filter partition - Error on malformed UTF-8 in prefix_extractor metadata instead of silently falling back to None - Document the extract_first() and name() correctness invariants on PrefixExtractor - Return 0 from full filter finish() when no hashes were registered, matching partitioned filter behavior
Read through metadata.prefix_extractor_name instead of maintaining a separate copy on Inner, eliminating the clone and the invariant that both fields must stay in sync.
Use the exported FixedPrefixExtractor in the doc example instead of reimplementing it inline. Change 'iff' to 'if' in extract_first docs.
Three correctness bugs fixed:
1. FixedPrefixExtractor false negatives for short keys: extract_first()
returned Some(short_key) for keys shorter than the configured length,
producing a hash that was never in the filter. Changed to return None
(out-of-domain), matching RocksDB's NewFixedPrefixTransform behavior.
This caused silent data loss when tree.prefix() was called with a
prefix shorter than the extractor length.
2. tree.prefix() could not use the filter when the prefix length equaled
the extractor length: prefix_to_range("h") produces range "h".."i",
and extract_first("h") != extract_first("i"), so the filter layer
treated it as a multi-prefix range and never consulted the filter.
Fixed by threading the original prefix from tree.prefix() through to
the filter layer as a prefix_hint, bypassing the range-based prefix
extraction. A stability guard (extract_first(hint) == extract_first(
hint + "\0")) prevents false negatives when the hint is not a valid
extracted prefix (e.g. FullKeyExtractor, or hint shorter than extractor
length).
3. RunReader lazy per-table skip missing prefix_filter_allowed check:
the optimized validated_prefix_hint path called probe_prefix_filter
directly without verifying extractor compatibility. Since optimize_runs
can merge tables from different compaction epochs (with different
extractor configs) into a single run, this could probe an old table's
filter with the wrong extractor and get a false negative.
Performance optimizations in RunReader:
- Precompute the validated prefix hint once in RunReader::new instead of
re-running the stability guard (with a Vec allocation) on every table
during lazy iteration
- Replace common_prefix: Option<Vec<u8>> with can_prune_upfront: bool,
eliminating up to 3 .to_vec() allocations per RunReader::new
- Use extractor.as_ref() instead of extractor.clone() in the upfront
pruning loop, avoiding an unnecessary Arc refcount bump
- Call probe_prefix_filter directly in the lazy loops when a validated
hint is available, skipping the full should_skip_range_by_prefix_filter
guard re-check
…efix_hint by reference
- Expand ClusteredPrefix range from 0-3 to 0-5 bytes so prefix length can equal or exceed all extractor lengths (up to 4), covering the case where prefix_to_range produces bounds with different extracted prefixes - Widen prefix byte alphabet from 0-7 to 0-9 so values 8-9 produce prefixes absent from all keys, exercising the filter-should-skip path - Add AFL-controlled MajorCompact target size (128 vs 4096 bytes) to produce many small tables per run, making the RunReader lazy loop (3+ table overlap) far more likely to fire - Add PrefixScanExistingKey operation that derives the scan prefix from a previously inserted key, guaranteeing the prefix overlaps real data and forcing the filter to make a meaningful decision - Add FlushCompactReopenCompact composite operation that atomically creates the mixed-extractor multi-table run condition (flush, compact with small target, reopen with different extractor) without relying on AFL to randomly sequence four separate operations
When a prefix extractor is configured, multiple keys sharing the same prefix produce duplicate hashes in the Bloom filter buffer. Without dedup, the filter is sized for N total hashes instead of the true number of unique prefixes, wasting memory proportional to the average number of keys per prefix. Sort + dedup the hash buffer at flush/spill time so the filter is sized for the actual number of unique prefixes. This is gated behind an enable_dedup() flag set by Writer::use_prefix_extractor, so the full-key Bloom path (where each hash is already unique) pays no overhead. For partitioned filters, dedup runs per-partition in spill_filter_partition before building each partition's Bloom filter. Also add a HierarchicalExtractor to the fuzz target that returns multiple prefixes per key (2-byte + 4-byte), exercising the interleaved hash dedup path and the multi-prefix contains_prefix probe logic.
Remove metrics from probe_prefix_filter entirely and let each caller track filter_queries and io_skipped_by_filter in a context-appropriate way, matching the upstream Table::get() pattern from issue fjall-rs#246. Point reads (point_read_from_table): - Filter excludes: both counters incremented (saved I/O) - Filter allows, key not found: filter_queries only (wasted I/O) - Filter allows, key found: no increment (successful read) Range scans (should_skip_range_by_prefix_filter, RunReader lazy loops): - Filter excludes: both counters incremented (saved I/O) - Filter allows: no increment (outcome unknown until iteration) This ensures io_skipped_by_filter / filter_queries gives the true positive rate of filter decisions, without inflating filter_queries for successful reads.
…tering) When a prefix extractor is configured, the filter previously only contained prefix hashes, causing point reads to bypass the Bloom filter entirely (via get_without_filter) after the prefix check passed. For keys that don't exist but share a prefix with existing keys, this meant reading data blocks only to find nothing — wasted I/O that a full-key Bloom would have prevented. Now the filter contains both prefix hashes AND full-key hashes (matching RocksDB's whole_key_filtering + prefix_extractor approach). Point reads use two-level filtering: the prefix filter for coarse table-level pruning, then the full-key Bloom for precise per-key filtering. This eliminates the 10% regression observed in workloads with many keys per prefix. The whole_key_filtering option (default true) can be set to false for seek-only workloads that never perform point lookups, saving filter space by omitting full-key hashes.
Increment filter_queries on every probe where the filter actually exists and answers (Ok(Some(_))), not just on skips. This ensures io_skipped_by_filter / filter_queries gives the true positive rate of filter decisions. Previously filter_queries only incremented on definitive exclusion, making the FPR appear artificially low. Return Ok(None) from probe_prefix_filter and maybe_contains_prefix when the filter cannot answer (no filter block, incompatible extractor, or out-of-domain key). Callers only increment metrics for Ok(Some(_)), preventing false counts for tables without filters or with mismatched extractors. All 5 prefix filter call sites (point_read_from_table, both paths in should_skip_range_by_prefix_filter, and both RunReader lazy loops) now follow the same pattern: capture the probe result, count if it answered, count the skip if it excluded.
The RunReader lazy loop previously called probe_prefix_filter on every table, which internally called extractor.extract(key) (a Box<dyn Iterator> heap allocation) and re-hashed the prefix bytes on each probe. For workloads scanning hundreds of tables per query, this overhead was the primary cause of a 10.6% throughput regression in the column store benchmark. Precompute the prefix hash once in RunReader::new (from the validated prefix hint) and store it alongside the hint. The lazy loops now call probe_prefix_filter_with_hash which uses the key only for TLI partition selection and checks the precomputed hash directly against the Bloom filter, eliminating: - Box<dyn Iterator> allocation per table - extract_first() call per table - Hash computation per table Column store benchmark results (5 min, 1000 databases, 48-byte prefix): - Before optimization: -10.6% regression vs baseline - After optimization: +16.1% improvement vs baseline - Tail latency: 966us -> 27us (36x improvement)
…allocation contains_prefix previously called extractor.extract(key) which returns Box<dyn Iterator> — a heap allocation on every filter probe. Replace with extract_first(key) which returns Option<&[u8]> (zero allocation). Checking only the first prefix is sufficient for filter probing: during writes, extract_first(key) is always registered in the filter for every key. If it is absent, the key was never written and the table can be safely skipped. Secondary prefixes from extract() can only match hashes from other keys with different first prefixes, which is irrelevant. Feed workload benchmark (point-read heavy, 5 min): - Point read throughput: +18.1% - Point read latency: -30.4% - Range scan throughput: +12.8%
Add extract_last to PrefixExtractor trait, returning the highest- cardinality prefix instead of the coarsest. For multi-prefix extractors like [company#, company#user#], this allows RunReader to check hash(company#user#) instead of hash(company#), skipping tables that contain the company but not the specific user. The approach naturally adapts to hint length: long hints get the most specific prefix, short hints fall back to the coarsest. A stability guard (extract_last(hint) == extract_last(hint + NUL)) ensures the chosen prefix is stable across the query range. If the guard fails, falls back to extract_first (previous behavior). For single-prefix extractors, extract_last is overridden to delegate to extract_first (zero allocation, identical behavior). The struct size (Option<u64>) and per-table cost (single Bloom probe) are unchanged. The only added cost is one extract_last call in RunReader::new (once per query, not per table).
…uning The single-table skip path (should_skip_range_by_prefix_filter with a prefix hint) now uses the same extract_last optimization as RunReader: precompute the most specific stable prefix hash, then probe with probe_prefix_filter_with_hash. Previously it used extract_first, giving coarser pruning than the RunReader path for multi-prefix extractors. Both prefix filter pruning paths (RunReader lazy loop and single-table skip) now consistently use the most specific available prefix hash, with extract_last stability guard and extract_first fallback.
…ing is enabled When whole_key_filtering is true (the default), the filter contains full-key hashes. For point reads, the full-key Bloom is strictly more precise than the prefix pre-check — so skip the prefix probe entirely and go straight to the Bloom. This eliminates a redundant filter probe on every point read that was costing ~10% throughput on point-read-heavy workloads. The prefix filter now only activates for range/prefix scans, where it provides table-level pruning that the full-key Bloom cannot. When whole_key_filtering is false, point reads still use the prefix filter as the sole pre-check since no full-key hashes are available. Feed workload (point-read heavy): - Before: -9.5% regression vs no prefix extractor - After: +7.7% improvement vs no prefix extractor, -69.6% disk reads Column store (scan-only): +4.7% improvement, consistent with previous
8126d0b to
e3622b8
Compare
ok i've pushed three commits to address all of this: The
The approach adapts to the hint length automatically. A stability guard ( |
rebased #151 on top of main and adapted various things to the new api
Summary by CodeRabbit
New Features
Tests
Fuzz
Benchmarks
Chores