Skip to content

Commit 5f41a74

Browse files
committed
typo tolerance improvements
1 parent 36d608f commit 5f41a74

22 files changed

Lines changed: 1362 additions & 577 deletions

README.md

Lines changed: 52 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -249,21 +249,27 @@ var query = new Query("comedy", maxResults: 20)
249249

250250
## How It Works
251251

252-
Infidex uses a **data-agnostic lexicographic ranking model** that relies entirely on structural and positional properties of the match, no collection statistics (like IDF) influence the final ranking. This ensures explainable, intuitive results regardless of the corpus.
252+
Infidex uses a **lexicographic ranking model** where:
253+
- **Precedence** is driven by structural and positional properties (coverage, phrase runs, anchor positions, etc.).
254+
- **Semantic score** is refined using corpus-derived weights (inverse document frequency over character n‑grams), without any per-dataset manual tuning.
255+
256+
Concretely, each query term $q_i$ is assigned a weight
257+
258+
$$
259+
I_i \approx \log_2\frac{N}{\mathrm{df}_i}
260+
$$
261+
262+
where $N$ is the number of documents and $\mathrm{df}_i$ is the document frequency of the term’s character n‑grams. Rarer terms get higher weights and therefore contribute more strongly to coverage and fusion decisions.
253263

254264
### Three-Stage Search Pipeline
255265

256266
**Stage 1: BM25+ Candidate Generation**
257267
- Tokenizes text into character n-grams (2-grams + 3-grams)
258268
- Builds inverted index with document frequencies
259-
- Uses **BM25+** as the information retrieval backbone with parameters $k_1 = 1.2$, $b = 0.75$, $\delta = 1.0$
260-
- BM25+ scoring with L2-normalized term weights:
269+
- BM25+ scoring backbone with L2-normalized term weights:
261270

262271
$$\text{BM25+}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \left( \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})} + \delta \right)$$
263272

264-
- Ultra-fast with byte-quantized weights (4x memory savings)
265-
- Produces top-K candidates for Stage 2 refinement
266-
267273
Formally, let $V$ be the set of all indexed terms over alphabet $\Sigma$. Infidex builds a deterministic finite-state transducer
268274

269275
$$
@@ -287,7 +293,7 @@ with time complexity $O(|p| + |\mathrm{Pref}(p)|)$ and $O(|s| + |\mathrm{Suff}(s
287293
- Applied to top-K candidates from Stage 1
288294
- Tracks **per-term coverage** for each query word using 5 algorithms:
289295
- Exact whole-word matching
290-
- Fuzzy word matching (adaptive Damerau-Levenshtein)
296+
- Fuzzy word matching (DamerauLevenshtein with an edit radius adapted from a binomial typo model)
291297
- Joined/split word detection
292298
- Prefix/suffix matching (prefixes weighted higher than suffixes)
293299
- LCS (Longest Common Subsequence) fallback when no word-level match exists
@@ -303,6 +309,34 @@ $$C_{\text{coord}} = \frac{1}{n} \sum_{i=1}^{n} c_i$$
303309

304310
- Extracts structural features: phrase runs, anchor token positions, lexical perfection
305311

312+
313+
On top of raw per-term coverage, Infidex tracks how much **information mass** from the query is actually matched:
314+
315+
- For each query term $q_i$, we compute a coverage score $c_i \in [0,1]$ and an information weight $I_i$ as above:
316+
317+
$$
318+
C_{\text{info}} = \frac{\sum_i c_i I_i}{\sum_i I_i}
319+
$$
320+
321+
This information view is used for two key behaviors:
322+
323+
- **Type-ahead detection**: the last query term is treated as “still being typed” when its information share is small:
324+
325+
$$
326+
\frac{I_{\text{last}}}{\sum_i I_i} \le \frac{1}{n+1}
327+
$$
328+
329+
where $n$ is the number of unique query terms. Intuitively, the suffix is informationally weaker than an average term, so we avoid over-committing to it.
330+
331+
- **Position-independent precedence boost**: when exactly one term is unmatched, we compare the **fraction of missing terms** ($1 - C_{\text{coord}}$) to the **fraction of missing information** (derived from $C_{\text{info}}$). If we have lost fewer bits of information than raw term coverage suggests, a precedence bit is set so that documents matching the rarer, more informative term outrank those matching only common terms.
332+
333+
$$
334+
\text{termGap} = 1 - C_{\text{coord}}, \qquad
335+
R_{\text{miss}} = \frac{\sum_i (1 - c_i) I_i}{\sum_i I_i}
336+
$$
337+
338+
If $R_{\text{miss}} < \text{termGap}$ (i.e. we have lost fewer bits of information than raw term coverage suggests), a precedence bit is set.
339+
306340
**Stage 3: Lexicographic Score Fusion**
307341

308342
The final **ordering** is a lexicographic triple $(\text{Precedence}, \text{Semantic}, \tau)$, where $\tau$ is an 8-bit tiebreaker.
@@ -332,18 +366,19 @@ Documents are ranked by a **strict precedence order**—higher bits always domin
332366

333367
The semantic component provides smooth differentiation within precedence tiers:
334368

335-
**For single-term queries:**
369+
**For single-term queries**
336370

337-
$$S_{\text{single}} = \frac{1}{2} \cdot C_{\text{avg}} + \frac{1}{2} \cdot L_{\text{lex}}$$
371+
For single-term queries Infidex uses a **heuristic blend** of:
338372

339-
where $L_{\text{lex}}$ is the lexical similarity computed as:
373+
- Per-term coverage $C_{\text{avg}}$ (how completely the query token is matched), and
374+
- A lexical similarity score $L_{\text{lex}}$ that takes the maximum over several signals:
340375

341-
$$L_{\text{lex}} = \max\left( L_{\text{substr}}, L_{\text{prefix}}, L_{\text{fuzzy}}, L_{\text{2seg}} \right)$$
376+
- $L_{\text{substr}}$: substring containment (with a bias toward matches earlier in the query),
377+
- $L_{\text{prefix}}$: overlap between query suffix and token prefix,
378+
- $L_{\text{fuzzy}}$: Damerau–Levenshtein similarity (with transpositions),
379+
- $L_{\text{2seg}}$: a simple two-segment check for concatenated queries.
342380

343-
- $L_{\text{substr}}$: Substring containment score (position-weighted)
344-
- $L_{\text{prefix}}$: Longest prefix of doc token matching suffix of query
345-
- $L_{\text{fuzzy}}$: Damerau-Levenshtein similarity (transpositions allowed)
346-
- $L_{\text{2seg}}$: Two-segment alignment for concatenated queries
381+
The final single-term semantic score is just a convex combination of $C_{\text{avg}}$ and $L_{\text{lex}}$ chosen for practical behavior on real-world data.
347382

348383
**For multi-term queries:**
349384

@@ -353,7 +388,7 @@ where:
353388
- $C_{\text{avg}}$ = average per-term coverage
354389
- $T_{\text{tfidf}}$ = normalized TF-IDF score from Stage 1
355390
- $R_{\text{phrase}}$ = phrase run bonus (consecutive query terms in document order)
356-
- $\alpha + \beta + \gamma = 1$ (currently $\alpha = 0.4$, $\beta = 0.4$, $\gamma = 0.2$)
391+
- $\alpha + \beta + \gamma = 1$ ($\alpha$, $\beta$, $\gamma$ are adjustable constants)
357392

358393
#### Two-Segment Alignment (Single-Term Queries)
359394

@@ -496,7 +531,6 @@ var engine = new SearchEngine(
496531
textNormalizer: TextNormalizer.CreateDefault(),
497532
tokenizerSetup: TokenizerSetup.CreateDefault(),
498533
coverageSetup: CoverageSetup.CreateDefault(),
499-
stopTermLimit: 1_250_000, // Max terms before stop-word filtering
500534
wordMatcherSetup: new WordMatcherSetup
501535
{
502536
SupportLD1 = true, // Enable fuzzy matching
@@ -514,7 +548,7 @@ var engine = new SearchEngine(
514548
## Testing
515549

516550
Infidex ships with a comprehensive test suite of 400+ tests, including:
517-
- Query relevancy tests on morphologically diverse languages (currently English and Czech) without any dataset-specific tuning
551+
- Multilingual query relevancy tests
518552
- Concurrency tests exercising parallel search, indexing, and save/load patterns
519553
- Persistence, performance, and core API behavior tests
520554

src/Infidex.Example/ExampleMode.cs

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
namespace Infidex.Example;
2+
3+
/// <summary>
4+
/// Controls how the examples behave:
5+
/// - Index: build the engine and index documents, then exit.
6+
/// - Test: index + run predefined queries, then exit.
7+
/// - Repl: index + run predefined queries + interactive REPL (default).
8+
/// </summary>
9+
public enum ExampleMode
10+
{
11+
Index,
12+
Test,
13+
Repl
14+
}
15+
16+

src/Infidex.Example/MovieExample.cs

Lines changed: 61 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
using System.Globalization;
21
using System.Diagnostics;
2+
using System.Globalization;
33
using CsvHelper;
44
using CsvHelper.Configuration;
55
using CsvHelper.Configuration.Attributes;
@@ -25,18 +25,30 @@ public class MovieRecord
2525

2626
public class MovieExample
2727
{
28-
public static void Run()
28+
public static void Run(bool useLargeDataset = false, ExampleMode mode = ExampleMode.Repl)
2929
{
30-
Console.WriteLine("Loading movies from CSV...");
30+
Console.WriteLine(useLargeDataset
31+
? "Loading 1M movies from CSV..."
32+
: "Loading movies from CSV...");
3133

3234
var records = new List<MovieRecord>();
3335
var config = new CsvConfiguration(CultureInfo.InvariantCulture)
3436
{
3537
PrepareHeaderForMatch = args => args.Header.ToLower(),
36-
MissingFieldFound = null // Ignore missing fields if any
38+
MissingFieldFound = null, // Ignore missing fields if any
39+
HeaderValidated = null // Allow schemas without all mapped headers (e.g. movies1M.csv)
3740
};
3841

39-
string filePath = Path.Combine(AppContext.BaseDirectory, "movies.csv");
42+
string filePath = useLargeDataset
43+
? GetLargeMoviesPath()
44+
: Path.Combine(AppContext.BaseDirectory, "movies.csv");
45+
46+
if (!File.Exists(filePath))
47+
{
48+
Console.WriteLine($"Error: movies CSV not found at {filePath}");
49+
return;
50+
}
51+
4052
Console.WriteLine($"Reading from: {filePath}");
4153
using (var reader = new StreamReader(filePath))
4254
using (var csv = new CsvReader(reader, config))
@@ -62,25 +74,41 @@ public static void Run()
6274
// though the current implementation does it at the end of IndexDocuments)
6375
// engine.CalculateWeights();
6476

65-
Console.WriteLine($"Indexing complete in {sw.ElapsedMilliseconds}ms.");
77+
long elapsedMs = sw.ElapsedMilliseconds;
78+
double elapsedSec = sw.Elapsed.TotalSeconds;
79+
double docsPerSec = elapsedSec > 0 ? records.Count / elapsedSec : 0;
6680

67-
// Perform Queries
68-
SearchAndPrint(engine, new Query("redemption shank"));
69-
SearchAndPrint(engine, new Query("Shaaawshank"));
70-
SearchAndPrint(engine, new Query("Shaa awashank"));
71-
SearchAndPrint(engine, new Query("Shaa awa shank"));
81+
Console.WriteLine(
82+
$"Indexing complete: {records.Count:N0} movies in {elapsedMs:N0} ms (~{docsPerSec:N0} docs/s).");
83+
84+
// Index-only mode: stop after building the index.
85+
if (mode == ExampleMode.Index)
86+
return;
7287

73-
while (true)
88+
// Perform predefined queries in Test/Repl modes.
89+
if (mode is ExampleMode.Test or ExampleMode.Repl)
7490
{
75-
Console.Write($"> ");
76-
string input = Console.ReadLine();
91+
SearchAndPrint(engine, new Query("redemption shank"));
92+
SearchAndPrint(engine, new Query("Shaaawshank"));
93+
SearchAndPrint(engine, new Query("Shaa awashank"));
94+
SearchAndPrint(engine, new Query("Shaa awa shank"));
95+
}
7796

78-
if (input is "q" or "!q" or "quit" or "exit")
97+
// Interactive REPL only in Repl mode.
98+
if (mode == ExampleMode.Repl)
99+
{
100+
while (true)
79101
{
80-
break;
102+
Console.Write($"> ");
103+
string input = Console.ReadLine();
104+
105+
if (input is "q" or "!q" or "quit" or "exit")
106+
{
107+
break;
108+
}
109+
110+
SearchAndPrint(engine, new Query(input));
81111
}
82-
83-
SearchAndPrint(engine, new Query(input));
84112
}
85113
}
86114

@@ -115,4 +143,19 @@ private static Document CreateMovieDocument(long id, MovieRecord movie)
115143
{
116144
return new Document(id, movie.Title);
117145
}
146+
147+
private static string GetLargeMoviesPath()
148+
{
149+
// When running from Infidex.Example, the base directory is typically:
150+
// src/Infidex.Example/bin/{Configuration}/{TargetFramework}/
151+
// We want to reach:
152+
// src/Infidex.Tests/movies1M.csv
153+
//
154+
// So we go up four levels to src/, then into Infidex.Tests.
155+
string baseDir = AppContext.BaseDirectory;
156+
string srcDir = Path.GetFullPath(
157+
Path.Combine(baseDir, "..", "..", "..", ".."));
158+
159+
return Path.Combine(srcDir, "Infidex.Tests", "movies1M.csv");
160+
}
118161
}

src/Infidex.Example/Program.cs

Lines changed: 96 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,105 @@
1-
namespace Infidex.Example;
1+
using System;
2+
3+
namespace Infidex.Example;
24

35
class Program
46
{
57
static void Main(string[] args)
68
{
7-
Console.WriteLine("Select dataset:");
8-
Console.WriteLine("1 - Movies");
9-
Console.WriteLine("2 - Schools");
10-
Console.Write("> ");
11-
12-
string choice = Console.ReadLine()?.Trim() ?? "";
13-
14-
switch (choice)
9+
ExampleMode mode = GetModeFromArgs(args);
10+
int? dataset = GetDatasetFromArgs(args);
11+
if (dataset.HasValue)
12+
{
13+
switch (dataset.Value)
14+
{
15+
case 1:
16+
MovieExample.Run(useLargeDataset: false, mode: mode);
17+
return;
18+
case 2:
19+
SchoolExample.Run(mode: mode);
20+
return;
21+
case 3:
22+
MovieExample.Run(useLargeDataset: true, mode: mode);
23+
return;
24+
}
25+
}
26+
27+
while (true)
28+
{
29+
Console.WriteLine("Select dataset:");
30+
Console.WriteLine("1 - Movies 40K (en)");
31+
Console.WriteLine("2 - Schools 10K (cs)");
32+
Console.WriteLine("3 - Movies 1M (en)");
33+
Console.Write("> ");
34+
35+
string choice = Console.ReadLine()?.Trim() ?? "";
36+
37+
switch (choice)
38+
{
39+
case "1":
40+
MovieExample.Run(useLargeDataset: false, mode: mode);
41+
return;
42+
case "2":
43+
SchoolExample.Run(mode: mode);
44+
return;
45+
case "3":
46+
MovieExample.Run(useLargeDataset: true, mode: mode);
47+
return;
48+
default:
49+
Console.WriteLine("Invalid choice. Please enter 1, 2, or 3.");
50+
Console.WriteLine();
51+
break;
52+
}
53+
}
54+
}
55+
56+
private static int? GetDatasetFromArgs(string[] args)
57+
{
58+
if (args == null || args.Length == 0)
59+
return null;
60+
61+
foreach (string arg in args)
1562
{
16-
case "1":
17-
MovieExample.Run();
18-
break;
19-
case "2":
20-
SchoolExample.Run();
21-
break;
22-
default:
23-
Console.WriteLine("Invalid choice. Exiting.");
24-
break;
63+
if (arg.StartsWith("--dataset=", StringComparison.OrdinalIgnoreCase) ||
64+
arg.StartsWith("-d=", StringComparison.OrdinalIgnoreCase))
65+
{
66+
int eqIndex = arg.IndexOf('=');
67+
if (eqIndex < 0 || eqIndex == arg.Length - 1)
68+
continue;
69+
70+
string value = arg[(eqIndex + 1)..];
71+
if (int.TryParse(value, out int dataset) && dataset is >= 1 and <= 3)
72+
return dataset;
73+
}
2574
}
75+
76+
return null;
77+
}
78+
79+
private static ExampleMode GetModeFromArgs(string[] args)
80+
{
81+
if (args == null || args.Length == 0)
82+
return ExampleMode.Repl;
83+
84+
foreach (string arg in args)
85+
{
86+
if (arg.StartsWith("--mode=", StringComparison.OrdinalIgnoreCase) ||
87+
arg.StartsWith("-m=", StringComparison.OrdinalIgnoreCase))
88+
{
89+
int eqIndex = arg.IndexOf('=');
90+
if (eqIndex < 0 || eqIndex == arg.Length - 1)
91+
continue;
92+
93+
string value = arg[(eqIndex + 1)..];
94+
if (value.Equals("index", StringComparison.OrdinalIgnoreCase))
95+
return ExampleMode.Index;
96+
if (value.Equals("test", StringComparison.OrdinalIgnoreCase))
97+
return ExampleMode.Test;
98+
if (value.Equals("repl", StringComparison.OrdinalIgnoreCase))
99+
return ExampleMode.Repl;
100+
}
101+
}
102+
103+
return ExampleMode.Repl;
26104
}
27105
}

0 commit comments

Comments
 (0)