Add log-odds conjunction fusion for BB25 hybrid search#1041
Add log-odds conjunction fusion for BB25 hybrid search#1041davidmezzetti merged 3 commits intoneuml:masterfrom
Conversation
BB25 normalization outputs calibrated probabilities, but the existing hybrid fusion uses convex combination which discards the Bayesian probability semantics. This causes BB25 to regress on 4/5 BEIR datasets. Add log-odds conjunction fusion (from "From Bayesian Inference to Neural Computation") that correctly combines probability signals in logit space with per-query dynamic calibration for dense cosine scores. - scoring/normalize.py: Extract Bayesian method check into isbayes() - scoring/base.py: Add default isbayes() returning False - scoring/tfidf.py: Add isbayes() delegating to normalizer - search/base.py: Add logodds(), convex(), rrf() fusion methods; dispatch based on isbayes() BEIR nDCG@10 results (BB25+LogOdds vs Default): arguana +2.23, fiqa +2.03, scidocs +0.62, scifact +1.33, nfcorpus -1.96
|
Thank you for this and the detailed explanation. If someone just enabled scoring only indexing and enabled bayes, do the the scores work? Or do they need the logic in logodds to work even in standalone? i.e. embeddings = Embeddings(
keyword=True,
scoring={"method": "bm25", "terms": True, "normalize": "bb25"}
)I like the new methods to combine scores. I think it would be good to split that up into a separate subclass to containerize it (something like hybrid and similar to what we did with scoring normalizer). |
|
Thanks for the review!
Sparse-only BB25: Yes, it works standalone. When there's no ANN index, the hybrid flag is False and the fusion logic (logodds/convex/rrf) is never reached -- sparse results are returned directly with BB25 probabilities from the normalizer.
Subclass extraction: Good idea. I'll refactor the fusion methods into a separate Hybrid class (similar to Normalize) so Search delegates to it based on the scoring configuration. |
Move logodds, convex, and rrf fusion methods from Search into a dedicated Hybrid class, following the same pattern as Normalize.
|
Running the tests now. Once complete, I'll merge and re-run the benchmarks and report back. Thank you for this! |
|
@jaepil Just ran the build. It's failing due to the coding conventions failing. If you wanted to fix this, you'd just need to install pre-commit to see the issues: https://github.com/neuml/.github/blob/master/CONTRIBUTING.md#set-up-a-development-environment If you don't have time for that, I can merge and fix after that. |
@davidmezzetti I'm going to fix it now. |
- Fix black formatting: remove unnecessary parentheses, remove spaces around ** - Fix pylint too-many-branches: extract calibrate() method from logodds() - Fix pylint unused-variable: rename score to _ in rrf()
|
@davidmezzetti Fixed the coding convention issues (black formatting + pylint warnings). The CI should pass now. |
|
The other minor thing coding convention wise is the repo doesn't use "_" variable notation. But I can modify that after the merge too. |
|
@jaepil Merged! I just ran the tests locally and they match. Thank you once again for adding this algorithm in! |
|
@davidmezzetti Thank you for the thorough review and for merging! I'll keep the |
|
I just added tests for this code and added a couple code standardizations. Task complete! Thanks again. |
Summary
logodds(),convex(), andrrf()methodsisbayes()on scoring classes to select the appropriate fusion strategyMotivation
BB25 normalization outputs calibrated probabilities in [0, 1]. The existing convex combination fusion (
w * dense + (1-w) * sparse) treats these as raw weights, discarding the Bayesian probability semantics. Log-odds conjunction fuses evidence in logit space where independent probability signals combine additively -- the mathematically correct way to accumulate Bayesian evidence.BEIR Benchmark Results (nDCG@10)
4/5 datasets improved. Average delta: +0.85 across all 5 datasets.
The nfcorpus regression is an inherent property of logit-space fusion on a corpus with very short queries (median 2 words), many relevant documents per query (38.2 avg), and graded relevance levels. The nonlinear logit transform reorders documents whose scores are very close, which slightly hurts fine-grained ordering among many high-scoring relevant documents.
For nfcorpus-like corpora, the reference BB25 implementation's parameter learning feature (
BayesianProbabilityTransform.fit()) can recover the regression by fitting a stable global beta from relevance judgments, which prevents the per-query median from being thrown off by many near-identical BM25 scores from short queries. In our experiments, learned parameters brought nfcorpus from 32.83% back to 34.88%, surpassing the default baseline.