You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+52-18Lines changed: 52 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -249,21 +249,27 @@ var query = new Query("comedy", maxResults: 20)
249
249
250
250
## How It Works
251
251
252
-
Infidex uses a **data-agnostic lexicographic ranking model** that relies entirely on structural and positional properties of the match, no collection statistics (like IDF) influence the final ranking. This ensures explainable, intuitive results regardless of the corpus.
252
+
Infidex uses a **lexicographic ranking model** where:
253
+
-**Precedence** is driven by structural and positional properties (coverage, phrase runs, anchor positions, etc.).
254
+
-**Semantic score** is refined using corpus-derived weights (inverse document frequency over character n‑grams), without any per-dataset manual tuning.
255
+
256
+
Concretely, each query term $q_i$ is assigned a weight
257
+
258
+
$$
259
+
I_i \approx \log_2\frac{N}{\mathrm{df}_i}
260
+
$$
261
+
262
+
where $N$ is the number of documents and $\mathrm{df}_i$ is the document frequency of the term’s character n‑grams. Rarer terms get higher weights and therefore contribute more strongly to coverage and fusion decisions.
253
263
254
264
### Three-Stage Search Pipeline
255
265
256
266
**Stage 1: BM25+ Candidate Generation**
257
267
- Tokenizes text into character n-grams (2-grams + 3-grams)
258
268
- Builds inverted index with document frequencies
259
-
- Uses **BM25+** as the information retrieval backbone with parameters $k_1 = 1.2$, $b = 0.75$, $\delta = 1.0$
260
-
- BM25+ scoring with L2-normalized term weights:
269
+
- BM25+ scoring backbone with L2-normalized term weights:
where $n$ is the number of unique query terms. Intuitively, the suffix is informationally weaker than an average term, so we avoid over-committing to it.
330
+
331
+
-**Position-independent precedence boost**: when exactly one term is unmatched, we compare the **fraction of missing terms** ($1 - C_{\text{coord}}$) to the **fraction of missing information** (derived from $C_{\text{info}}$). If we have lost fewer bits of information than raw term coverage suggests, a precedence bit is set so that documents matching the rarer, more informative term outrank those matching only common terms.
- $L_{\text{2seg}}$: Two-segment alignment for concatenated queries
381
+
The final single-term semantic score is just a convex combination of $C_{\text{avg}}$ and $L_{\text{lex}}$ chosen for practical behavior on real-world data.
347
382
348
383
**For multi-term queries:**
349
384
@@ -353,7 +388,7 @@ where:
353
388
- $C_{\text{avg}}$ = average per-term coverage
354
389
- $T_{\text{tfidf}}$ = normalized TF-IDF score from Stage 1
355
390
- $R_{\text{phrase}}$ = phrase run bonus (consecutive query terms in document order)
0 commit comments