Fix RRF tiebreak precision bug and assorted small cleanups by octogonz · Pull Request #63 · microsoft/monodex

octogonz · 2026-05-20T02:23:10Z

Addresses a batch of small bugs and inconsistencies surfaced by a code review. The most substantive fix is in the RRF fusion ranking: an f32::EPSILON tolerance in the score comparison was coarse enough to swallow real score gaps between candidates within the fusion window, silently reversing the intended order. Replaced with exact-equality fall-through; the lower-priority tiebreaks (best contributing rank, best-rank method, row_id) are unchanged.

Other fixes:

Embedding input truncation past the model's max token length now emits a warning instead of silently dropping tokens.
A stats counter in FTS indexing now derives from the actual skipped-row list rather than a subtraction that could drift from it.
An FTS tokenizer identifier-splitter that was doing O(n^2) membership checks on a Vec now uses a parallel HashSet for dedup while preserving insertion order.
A single-method search hit helper was hardcoding rank: 1 for every entry; now threads the real 1-indexed rank through from the call site.
An unreachable early-return inside a recursive tree-descent helper is replaced with a debug_assert! of the invariant the callers already maintain.
Test-only accessors carrying stale #[allow(dead_code)] are now properly #[cfg(test)]-gated; one annotation on a function that's actually used in production is removed.
Added a comment documenting why the embedder intentionally skips L2 normalization (cosine distance is scale-invariant) and what would need to change if the consumer ever switches distance metrics or adds an ANN index.

- Replace f32::EPSILON tolerance with exact equality comparison in RRF sorting, fixing ordering for scores that differ by less than EPSILON - Fix zero_token_skipped to use zero_token_row_ids.len() for consistency - Add comment explaining why L2 normalization is omitted after mean pooling

… rank - Emit warning when embedding input is truncated to 8192 tokens - Gate test-only PackageIndex accessors with #[cfg(test)] - Remove stale #[allow(dead_code)] from insert_package_name - Pass real rank to make_single_method_fused_hit instead of hardcoded 1

- Replace unreachable early-return with debug_assert! in find_shallowest_split_scope (split_search.rs) - Use HashSet for O(n) deduplication in split_identifier (tokenizer.rs)

The early-return in find_shallowest_split_scope handles a real edge case where the node doesn't span the requested range (e.g. node 2-551 for range 1-550). The task description was incorrect.

octogonz added 4 commits May 19, 2026 19:10

fix: unreachable code removal and O(n) dedup in tokenizer

416c5b9

- Replace unreachable early-return with debug_assert! in find_shallowest_split_scope (split_search.rs) - Use HashSet for O(n) deduplication in split_identifier (tokenizer.rs)

revert: Bug 11 was incorrect - early-return is reachable

5595746

The early-return in find_shallowest_split_scope handles a real edge case where the node doesn't span the requested range (e.g. node 2-551 for range 1-550). The task description was incorrect.

octogonz merged commit 7b25544 into microsoft:main May 20, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RRF tiebreak precision bug and assorted small cleanups#63

Fix RRF tiebreak precision bug and assorted small cleanups#63
octogonz merged 4 commits into
microsoft:mainfrom
octogonz:main2

octogonz commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

octogonz commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant