Skip to content

Fix RRF tiebreak precision bug and assorted small cleanups#63

Merged
octogonz merged 4 commits into
microsoft:mainfrom
octogonz:main2
May 20, 2026
Merged

Fix RRF tiebreak precision bug and assorted small cleanups#63
octogonz merged 4 commits into
microsoft:mainfrom
octogonz:main2

Conversation

@octogonz
Copy link
Copy Markdown
Collaborator

Addresses a batch of small bugs and inconsistencies surfaced by a code review. The most substantive fix is in the RRF fusion ranking: an f32::EPSILON tolerance in the score comparison was coarse enough to swallow real score gaps between candidates within the fusion window, silently reversing the intended order. Replaced with exact-equality fall-through; the lower-priority tiebreaks (best contributing rank, best-rank method, row_id) are unchanged.

Other fixes:

  • Embedding input truncation past the model's max token length now emits a warning instead of silently dropping tokens.
  • A stats counter in FTS indexing now derives from the actual skipped-row list rather than a subtraction that could drift from it.
  • An FTS tokenizer identifier-splitter that was doing O(n^2) membership checks on a Vec now uses a parallel HashSet for dedup while preserving insertion order.
  • A single-method search hit helper was hardcoding rank: 1 for every entry; now threads the real 1-indexed rank through from the call site.
  • An unreachable early-return inside a recursive tree-descent helper is replaced with a debug_assert! of the invariant the callers already maintain.
  • Test-only accessors carrying stale #[allow(dead_code)] are now properly #[cfg(test)]-gated; one annotation on a function that's actually used in production is removed.
  • Added a comment documenting why the embedder intentionally skips L2 normalization (cosine distance is scale-invariant) and what would need to change if the consumer ever switches distance metrics or adds an ANN index.

octogonz added 4 commits May 19, 2026 19:10
- Replace f32::EPSILON tolerance with exact equality comparison in RRF
  sorting, fixing ordering for scores that differ by less than EPSILON
- Fix zero_token_skipped to use zero_token_row_ids.len() for consistency
- Add comment explaining why L2 normalization is omitted after mean pooling
… rank

- Emit warning when embedding input is truncated to 8192 tokens
- Gate test-only PackageIndex accessors with #[cfg(test)]
- Remove stale #[allow(dead_code)] from insert_package_name
- Pass real rank to make_single_method_fused_hit instead of hardcoded 1
- Replace unreachable early-return with debug_assert! in
  find_shallowest_split_scope (split_search.rs)
- Use HashSet for O(n) deduplication in split_identifier (tokenizer.rs)
The early-return in find_shallowest_split_scope handles a real edge case
where the node doesn't span the requested range (e.g. node 2-551 for
range 1-550). The task description was incorrect.
@octogonz octogonz merged commit 7b25544 into microsoft:main May 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant