Commit ba0d1ea
authored
Add BB25 normalization for sparse encoders (#1046)
* Add log-odds conjunction fusion for BB25 hybrid search
BB25 normalization outputs calibrated probabilities, but the existing
hybrid fusion uses convex combination which discards the Bayesian
probability semantics. This causes BB25 to regress on 4/5 BEIR datasets.
Add log-odds conjunction fusion (from "From Bayesian Inference to Neural
Computation") that correctly combines probability signals in logit space
with per-query dynamic calibration for dense cosine scores.
- scoring/normalize.py: Extract Bayesian method check into isbayes()
- scoring/base.py: Add default isbayes() returning False
- scoring/tfidf.py: Add isbayes() delegating to normalizer
- search/base.py: Add logodds(), convex(), rrf() fusion methods;
dispatch based on isbayes()
BEIR nDCG@10 results (BB25+LogOdds vs Default):
arguana +2.23, fiqa +2.03, scidocs +0.62, scifact +1.33, nfcorpus -1.96
* Extract Hybrid class for score fusion strategies
Move logodds, convex, and rrf fusion methods from Search into
a dedicated Hybrid class, following the same pattern as Normalize.
* Fix coding convention issues in Hybrid class for CI
- Fix black formatting: remove unnecessary parentheses, remove spaces around **
- Fix pylint too-many-branches: extract calibrate() method from logodds()
- Fix pylint unused-variable: rename score to _ in rrf()
* Add BB25 normalization for sparse encoders and fix IVFSparse topn bug
- Support `normalize: bb25` config for sparse encoder scoring, enabling
Bayesian sigmoid calibration as an alternative to default linear
normalization. Reuses existing Normalize.bayes() infrastructure.
- Fix dimension check in IVFSparse.topn(): use scores.shape[1] (number
of data items) instead of scores.shape[0] (number of queries) for the
argpartition kth bound check. The previous code caused ValueError when
the number of centroids was less than nprobe.
* Add tests for BB25 sparse normalization and IVFSparse topn fix1 parent 8929992 commit ba0d1ea
File tree
4 files changed
+117
-5
lines changed- src/python/txtai
- ann/sparse
- scoring
- test/python
- testann
- testscoring
4 files changed
+117
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
269 | 269 | | |
270 | 270 | | |
271 | 271 | | |
272 | | - | |
| 272 | + | |
273 | 273 | | |
274 | 274 | | |
275 | 275 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
| |||
33 | 34 | | |
34 | 35 | | |
35 | 36 | | |
36 | | - | |
| 37 | + | |
| 38 | + | |
37 | 39 | | |
38 | 40 | | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
39 | 48 | | |
40 | 49 | | |
41 | 50 | | |
| |||
131 | 140 | | |
132 | 141 | | |
133 | 142 | | |
134 | | - | |
| 143 | + | |
135 | 144 | | |
136 | 145 | | |
137 | 146 | | |
| |||
194 | 203 | | |
195 | 204 | | |
196 | 205 | | |
197 | | - | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
198 | 212 | | |
199 | 213 | | |
200 | 214 | | |
| |||
204 | 218 | | |
205 | 219 | | |
206 | 220 | | |
207 | | - | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
208 | 226 | | |
209 | 227 | | |
210 | 228 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
82 | 82 | | |
83 | 83 | | |
84 | 84 | | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
85 | 108 | | |
86 | 109 | | |
87 | 110 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
105 | 176 | | |
106 | 177 | | |
107 | 178 | | |
| |||
0 commit comments