@@ -254,6 +254,14 @@ take some time to complete a full indexing pass over a large directory tree.
254254When indexing completes, ugrep-indexer displays the results of indexing. The
255255total size of the indexes added and average indexing noise is also reported.
256256
257+ Scanning a file to index results in a 64KB index hashes table. Then,
258+ ugrep-indexer halves the table with bit compression using bitwise-and as long
259+ as the target accuracy is not exceeded. Halving is made possible by the fact
260+ that the table encodes hashes for 8 windows at offsets from the start of the
261+ pattern, corresponding to the 8 bits per index hashing table cell. Combining
262+ the two halves of the table may flip some bits to zero from one, which may
263+ cause a false positive match. This proves the monotonicity of the indexer.
264+
257265The ugrep-indexer understands "binary files", which can be skipped and not
258266indexed with ugrep-indexer option ` -I ` (` --ignore-binary ` ). This is useful
259267when searching with ugrep option ` -I ` (` --ignore-binary ` ) to ignore binary
@@ -334,6 +342,14 @@ string:
334342 return true;
335343 }
336344
345+ The prime 61 hash was chosen among many other possible hashing functions using
346+ a realistic experimental setup. A candidate hashing function is tested by
347+ repreatedly searching a randomly-drawn word from a 100MB Wikipedia file that
348+ has one, two or three mutated characters. The mutation is made to ensure it
349+ does not correspond to an actual valid word in the Wikipedia file. Then the
350+ false positive rate is recorded when a mutated word matches the file. A hash
351+ function with a minimal false positive rate should be a good candidate overall.
352+
337353### Q: What is indexing accuracy?
338354
339355Indexing is a form of lossy compression. The higher the indexing accuracy, the
0 commit comments