Skip to content

Commit b0016aa

Browse files
committed
Merge branch 'master' of github.com:FastFilter/FilterPassword_private
2 parents e57d3e3 + 35f0454 commit b0016aa

File tree

1 file changed

+23
-48
lines changed

1 file changed

+23
-48
lines changed

README.md

Lines changed: 23 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
Suppose that you have a database made of half a billion passwords. You want to check quickly whether a given password is in this database. We allow a small probability of false positives (that can be later checked against the full database) but we otherwise do not want any false negatives (if the password is in the set, we absolutely want to know about it).
55

6-
The typical approach to this problem is to apply a Bloom filters. We test here an Xor Filter. The goal is for the filter to use very little memory.
6+
The typical approach to this problem is to apply a Bloom filters. We test here a binary fuse filter. The goal is for the filter to use very little memory.
77

88

99
## Requirement
@@ -15,11 +15,11 @@ Though the constructed filter may use only about a byte per set entry, the const
1515

1616
## Limitations
1717

18-
The expectation is that the filter is built once. To build the filter over the full 550 million passwords, you currently need a machine with a sizeable amount of free RAM. It will almost surely fail on a laptop with 4GB or 8GB of RAM; 64 GB of RAM or more is recommended. We could further partition the problem (by dividing up the set) for lower memory usage or better parallelization.
18+
The expectation is that the filter is built once. To build the filter over the full 550 million passwords, you currently need a machine with a sizeable amount of free RAM. It will almost surely fail on a laptop with 4GB of RAM; 64 GB of RAM or more is recommended. We could further partition the problem (by dividing up the set) for lower memory usage or better parallelization.
1919

2020
Queries are very fast, however.
2121

22-
We support up to 4 billion entries, if you have the available memory.
22+
We support hundreds of millions of entries and more, if you have the available memory.
2323

2424

2525
## Preparing the data file
@@ -92,79 +92,54 @@ Expected number of queries per second: 17241.379
9292
## Performance comparisons
9393
9494
For a comparable false positive probability (about 0.3%), the Bloom filter uses more space
95-
and is slower. The main downside of the xor filter is a more expensive construction.
95+
and is slower. The binary fuse 8 uses less space, is constructed faster, and has faster queries.
9696
9797
9898
```
99-
$ ./build_filter -m 10000000 -o xor8.bin -f xor8 pwned-passwords-sha1-ordered-by-count-v4.txt
99+
$ ./build_filter -m 10000000 -o binaryfuse8.bin -f binaryfuse8 pwned-passwords-sha1-ordered-by-count-v4.txt
100100
setting the max. number of entries to 10000000
101101
read 10000000 hashes.Reached the maximum number of lines 10000000.
102-
I read 10000000 hashes in total (0.902 seconds).
102+
I read 10000000 hashes in total (12.644 seconds).
103103
Bytes read = 452288199.
104104
Constructing the filter...
105-
Done in 1.303 seconds.
106-
filter data saved to xor8.bin. Total bytes = 12300054.
105+
Done in 0.420 seconds.
106+
filter data saved to binaryfuse8.bin. Total bytes = 11272228.
107107

108108

109109
$ ./build_filter -m 10000000 -o bloom12.bin -f bloom12 pwned-passwords-sha1-ordered-by-count-v4.txt
110110
setting the max. number of entries to 10000000
111111
read 10000000 hashes.Reached the maximum number of lines 10000000.
112-
I read 10000000 hashes in total (0.914 seconds).
112+
I read 10000000 hashes in total (12.696 seconds).
113113
Bytes read = 452288199.
114114
Constructing the filter...
115-
Done in 0.448 seconds.
115+
Done in 0.613 seconds.
116116
filter data saved to bloom12.bin. Total bytes = 15000024.
117117

118118

119119

120-
$ for i in {1..3} ; do ./query_filter xor8.bin 7C4A8D09CA3762AF6 ; done
120+
$./query_filter binaryfuse8.bin 7C4A8D09CA3762AF6
121+
using database: binaryfuse8.bin
121122
hexval = 0x7c4a8d09ca3762af
122-
Xor filter detected.
123-
I expect the file to span 12300054 bytes.
124-
memory mapping is a success.
125-
Probably in the set.
126-
Processing time 88.000 microseconds.
127-
Expected number of queries per second: 11363.637
128-
hexval = 0x7c4a8d09ca3762af
129-
Xor filter detected.
130-
I expect the file to span 12300054 bytes.
131-
memory mapping is a success.
132-
Probably in the set.
133-
Processing time 59.000 microseconds.
134-
Expected number of queries per second: 16949.152
135-
hexval = 0x7c4a8d09ca3762af
136-
Xor filter detected.
137-
I expect the file to span 12300054 bytes.
138-
memory mapping is a success.
139-
Probably in the set.
140-
Processing time 68.000 microseconds.
141-
Expected number of queries per second: 14705.883
142-
143-
$ for i in {1..3} ; do ./query_filter bloom12.bin 7C4A8D09CA3762AF6 ; done
144-
hexval = 0x7c4a8d09ca3762af
145-
Bloom filter detected.
146-
I expect the file to span 15000024 bytes.
147-
memory mapping is a success.
148-
Surely not in the set.
149-
Processing time 99.000 microseconds.
150-
Expected number of queries per second: 10101.010
151-
hexval = 0x7c4a8d09ca3762af
152-
Bloom filter detected.
153-
I expect the file to span 15000024 bytes.
123+
Binary fuse filter detected.
124+
I expect the file to span 11206692 bytes.
154125
memory mapping is a success.
155126
Surely not in the set.
156-
Processing time 88.000 microseconds.
157-
Expected number of queries per second: 11363.637
127+
Processing time 241.000 microseconds.
128+
Expected number of queries per second: 4149.377
129+
130+
$ ./query_filter bloom12.bin 7C4A8D09CA3762AF6 ; done
131+
using database: bloom12.bin
158132
hexval = 0x7c4a8d09ca3762af
159133
Bloom filter detected.
160134
I expect the file to span 15000024 bytes.
161135
memory mapping is a success.
162-
Surely not in the set.
163-
Processing time 86.000 microseconds.
164-
Expected number of queries per second: 11627.907
136+
Probably in the set.
137+
Processing time 344.000 microseconds.
138+
Expected number of queries per second: 2906.977
165139
```
166140
167141
## Reference
168142
143+
* Thomas Mueller Graf, Daniel Lemire, Binary Fuse Filters: Fast and Smaller Than Xor Filters
169144
* Thomas Mueller Graf, Daniel Lemire, [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258), Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
170145

0 commit comments

Comments
 (0)