Skip to content

Commit ec4fee9

Browse files
committed
updated README
explain index table compression by repeated halving
1 parent 44a0b1f commit ec4fee9

File tree

1 file changed

+16
-0
lines changed

1 file changed

+16
-0
lines changed

README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,6 +254,14 @@ take some time to complete a full indexing pass over a large directory tree.
254254
When indexing completes, ugrep-indexer displays the results of indexing. The
255255
total size of the indexes added and average indexing noise is also reported.
256256

257+
Scanning a file to index results in a 64KB index hashes table. Then,
258+
ugrep-indexer halves the table with bit compression using bitwise-and as long
259+
as the target accuracy is not exceeded. Halving is made possible by the fact
260+
that the table encodes hashes for 8 windows at offsets from the start of the
261+
pattern, corresponding to the 8 bits per index hashing table cell. Combining
262+
the two halves of the table may flip some bits to zero from one, which may
263+
cause a false positive match. This proves the monotonicity of the indexer.
264+
257265
The ugrep-indexer understands "binary files", which can be skipped and not
258266
indexed with ugrep-indexer option `-I` (`--ignore-binary`). This is useful
259267
when searching with ugrep option `-I` (`--ignore-binary`) to ignore binary
@@ -334,6 +342,14 @@ string:
334342
return true;
335343
}
336344

345+
The prime 61 hash was chosen among many other possible hashing functions using
346+
a realistic experimental setup. A candidate hashing function is tested by
347+
repreatedly searching a randomly-drawn word from a 100MB Wikipedia file that
348+
has one, two or three mutated characters. The mutation is made to ensure it
349+
does not correspond to an actual valid word in the Wikipedia file. Then the
350+
false positive rate is recorded when a mutated word matches the file. A hash
351+
function with a minimal false positive rate should be a good candidate overall.
352+
337353
### Q: What is indexing accuracy?
338354

339355
Indexing is a form of lossy compression. The higher the indexing accuracy, the

0 commit comments

Comments
 (0)