You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+23-48Lines changed: 23 additions & 48 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@
3
3
4
4
Suppose that you have a database made of half a billion passwords. You want to check quickly whether a given password is in this database. We allow a small probability of false positives (that can be later checked against the full database) but we otherwise do not want any false negatives (if the password is in the set, we absolutely want to know about it).
5
5
6
-
The typical approach to this problem is to apply a Bloom filters. We test here an Xor Filter. The goal is for the filter to use very little memory.
6
+
The typical approach to this problem is to apply a Bloom filters. We test here a binary fuse filter. The goal is for the filter to use very little memory.
7
7
8
8
9
9
## Requirement
@@ -15,11 +15,11 @@ Though the constructed filter may use only about a byte per set entry, the const
15
15
16
16
## Limitations
17
17
18
-
The expectation is that the filter is built once. To build the filter over the full 550 million passwords, you currently need a machine with a sizeable amount of free RAM. It will almost surely fail on a laptop with 4GB or 8GB of RAM; 64 GB of RAM or more is recommended. We could further partition the problem (by dividing up the set) for lower memory usage or better parallelization.
18
+
The expectation is that the filter is built once. To build the filter over the full 550 million passwords, you currently need a machine with a sizeable amount of free RAM. It will almost surely fail on a laptop with 4GB of RAM; 64 GB of RAM or more is recommended. We could further partition the problem (by dividing up the set) for lower memory usage or better parallelization.
19
19
20
20
Queries are very fast, however.
21
21
22
-
We support up to 4 billion entries, if you have the available memory.
22
+
We support hundreds of millions of entries and more, if you have the available memory.
23
23
24
24
25
25
## Preparing the data file
@@ -92,79 +92,54 @@ Expected number of queries per second: 17241.379
92
92
## Performance comparisons
93
93
94
94
For a comparable false positive probability (about 0.3%), the Bloom filter uses more space
95
-
and is slower. The main downside of the xor filter is a more expensive construction.
95
+
and is slower. The binary fuse 8 uses less space, is constructed faster, and has faster queries.
* Thomas Mueller Graf, Daniel Lemire, Binary Fuse Filters: Fast and Smaller Than Xor Filters
169
144
* Thomas Mueller Graf, Daniel Lemire, [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258), Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
0 commit comments