Skip to content

Commit 6410b77

Browse files
committed
Merge branch 'master' of github.com:FastFilter/FilterPassword_private
2 parents 68380b3 + b0016aa commit 6410b77

File tree

3 files changed

+116
-123
lines changed

3 files changed

+116
-123
lines changed

README.md

Lines changed: 23 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
Suppose that you have a database made of half a billion passwords. You want to check quickly whether a given password is in this database. We allow a small probability of false positives (that can be later checked against the full database) but we otherwise do not want any false negatives (if the password is in the set, we absolutely want to know about it).
55

6-
The typical approach to this problem is to apply a Bloom filters. We test here an Xor Filter. The goal is for the filter to use very little memory.
6+
The typical approach to this problem is to apply a Bloom filters. We test here a binary fuse filter. The goal is for the filter to use very little memory.
77

88

99
## Requirement
@@ -15,11 +15,11 @@ Though the constructed filter may use only about a byte per set entry, the const
1515

1616
## Limitations
1717

18-
The expectation is that the filter is built once. To build the filter over the full 550 million passwords, you currently need a machine with a sizeable amount of free RAM. It will almost surely fail on a laptop with 4GB or 8GB of RAM; 64 GB of RAM or more is recommended. We could further partition the problem (by dividing up the set) for lower memory usage or better parallelization.
18+
The expectation is that the filter is built once. To build the filter over the full 550 million passwords, you currently need a machine with a sizeable amount of free RAM. It will almost surely fail on a laptop with 4GB of RAM; 64 GB of RAM or more is recommended. We could further partition the problem (by dividing up the set) for lower memory usage or better parallelization.
1919

2020
Queries are very fast, however.
2121

22-
We support up to 4 billion entries, if you have the available memory.
22+
We support hundreds of millions of entries and more, if you have the available memory.
2323

2424

2525
## Preparing the data file
@@ -92,79 +92,54 @@ Expected number of queries per second: 17241.379
9292
## Performance comparisons
9393
9494
For a comparable false positive probability (about 0.3%), the Bloom filter uses more space
95-
and is slower. The main downside of the xor filter is a more expensive construction.
95+
and is slower. The binary fuse 8 uses less space, is constructed faster, and has faster queries.
9696
9797
9898
```
99-
$ ./build_filter -m 10000000 -o xor8.bin -f xor8 pwned-passwords-sha1-ordered-by-count-v4.txt
99+
$ ./build_filter -m 10000000 -o binaryfuse8.bin -f binaryfuse8 pwned-passwords-sha1-ordered-by-count-v4.txt
100100
setting the max. number of entries to 10000000
101101
read 10000000 hashes.Reached the maximum number of lines 10000000.
102-
I read 10000000 hashes in total (0.902 seconds).
102+
I read 10000000 hashes in total (12.644 seconds).
103103
Bytes read = 452288199.
104104
Constructing the filter...
105-
Done in 1.303 seconds.
106-
filter data saved to xor8.bin. Total bytes = 12300054.
105+
Done in 0.420 seconds.
106+
filter data saved to binaryfuse8.bin. Total bytes = 11272228.
107107

108108

109109
$ ./build_filter -m 10000000 -o bloom12.bin -f bloom12 pwned-passwords-sha1-ordered-by-count-v4.txt
110110
setting the max. number of entries to 10000000
111111
read 10000000 hashes.Reached the maximum number of lines 10000000.
112-
I read 10000000 hashes in total (0.914 seconds).
112+
I read 10000000 hashes in total (12.696 seconds).
113113
Bytes read = 452288199.
114114
Constructing the filter...
115-
Done in 0.448 seconds.
115+
Done in 0.613 seconds.
116116
filter data saved to bloom12.bin. Total bytes = 15000024.
117117

118118

119119

120-
$ for i in {1..3} ; do ./query_filter xor8.bin 7C4A8D09CA3762AF6 ; done
120+
$./query_filter binaryfuse8.bin 7C4A8D09CA3762AF6
121+
using database: binaryfuse8.bin
121122
hexval = 0x7c4a8d09ca3762af
122-
Xor filter detected.
123-
I expect the file to span 12300054 bytes.
124-
memory mapping is a success.
125-
Probably in the set.
126-
Processing time 88.000 microseconds.
127-
Expected number of queries per second: 11363.637
128-
hexval = 0x7c4a8d09ca3762af
129-
Xor filter detected.
130-
I expect the file to span 12300054 bytes.
131-
memory mapping is a success.
132-
Probably in the set.
133-
Processing time 59.000 microseconds.
134-
Expected number of queries per second: 16949.152
135-
hexval = 0x7c4a8d09ca3762af
136-
Xor filter detected.
137-
I expect the file to span 12300054 bytes.
138-
memory mapping is a success.
139-
Probably in the set.
140-
Processing time 68.000 microseconds.
141-
Expected number of queries per second: 14705.883
142-
143-
$ for i in {1..3} ; do ./query_filter bloom12.bin 7C4A8D09CA3762AF6 ; done
144-
hexval = 0x7c4a8d09ca3762af
145-
Bloom filter detected.
146-
I expect the file to span 15000024 bytes.
147-
memory mapping is a success.
148-
Surely not in the set.
149-
Processing time 99.000 microseconds.
150-
Expected number of queries per second: 10101.010
151-
hexval = 0x7c4a8d09ca3762af
152-
Bloom filter detected.
153-
I expect the file to span 15000024 bytes.
123+
Binary fuse filter detected.
124+
I expect the file to span 11206692 bytes.
154125
memory mapping is a success.
155126
Surely not in the set.
156-
Processing time 88.000 microseconds.
157-
Expected number of queries per second: 11363.637
127+
Processing time 241.000 microseconds.
128+
Expected number of queries per second: 4149.377
129+
130+
$ ./query_filter bloom12.bin 7C4A8D09CA3762AF6 ; done
131+
using database: bloom12.bin
158132
hexval = 0x7c4a8d09ca3762af
159133
Bloom filter detected.
160134
I expect the file to span 15000024 bytes.
161135
memory mapping is a success.
162-
Surely not in the set.
163-
Processing time 86.000 microseconds.
164-
Expected number of queries per second: 11627.907
136+
Probably in the set.
137+
Processing time 344.000 microseconds.
138+
Expected number of queries per second: 2906.977
165139
```
166140
167141
## Reference
168142
143+
* Thomas Mueller Graf, Daniel Lemire, Binary Fuse Filters: Fast and Smaller Than Xor Filters
169144
* Thomas Mueller Graf, Daniel Lemire, [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258), Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
170145

src/build_filter.cpp

Lines changed: 57 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,12 @@
1111

1212
#include "bloom/bloom.h"
1313
#include "hexutil.h"
14-
#include "xorfilter/xorfilter.h"
15-
#include "xor_singleheader/include/xorfilter.h"
1614
#include "xor_singleheader/include/binaryfusefilter.h"
15+
#include "xor_singleheader/include/xorfilter.h"
16+
#include "xorfilter/xorfilter.h"
1717

1818
#include "mappeablebloomfilter.h"
1919

20-
2120
static void printusage(char *command) {
2221
printf(" Try %s -f xor8 -o filter.bin mydatabase \n", command);
2322
;
@@ -26,9 +25,8 @@ static void printusage(char *command) {
2625
printf("The -V flag verifies the resulting filter.\n");
2726
}
2827

29-
30-
31-
uint64_t * read_data(const char *filename, size_t & array_size, size_t maxline, bool printall) {
28+
uint64_t *read_data(const char *filename, size_t &array_size, size_t maxline,
29+
bool printall) {
3230
char *line = NULL;
3331
size_t line_capacity = 0;
3432
int read;
@@ -140,11 +138,11 @@ int main(int argc, char **argv) {
140138
}
141139

142140
size_t array_size;
143-
uint64_t * array;
144-
if(synthetic) {
141+
uint64_t *array;
142+
if (synthetic) {
145143
array_size = synthetic_size;
146144
array = (uint64_t *)malloc(array_size * sizeof(uint64_t));
147-
for(size_t i = 0; i < array_size; i++) {
145+
for (size_t i = 0; i < array_size; i++) {
148146
array[i] = i;
149147
}
150148
} else {
@@ -154,13 +152,12 @@ int main(int argc, char **argv) {
154152
}
155153
const char *filename = argv[optind];
156154
array = read_data(filename, array_size, maxline, printall);
157-
if(array == nullptr) {
155+
if (array == nullptr) {
158156
return EXIT_FAILURE;
159157
}
160158
}
161159
clock_t start, end;
162160

163-
164161
printf("Constructing the filter...\n");
165162
fflush(NULL);
166163
if (strcmp("binaryfuse8", filtername) == 0) {
@@ -170,23 +167,25 @@ int main(int argc, char **argv) {
170167
binary_fuse8_populate(array, array_size, &filter);
171168
end = clock();
172169
printf("Done in %.3f seconds.\n", (float)(end - start) / CLOCKS_PER_SEC);
173-
if(verify) {
170+
if (verify) {
174171
printf("Checking for false negatives\n");
175-
for(size_t i = 0; i < array_size; i++) {
176-
if(!binary_fuse8_contain(array[i],&filter)) {
177-
printf("Detected a false negative. You probably have a bug. Aborting.\n");
172+
for (size_t i = 0; i < array_size; i++) {
173+
if (!binary_fuse8_contain(array[i], &filter)) {
174+
printf("Detected a false negative. You probably have a bug. "
175+
"Aborting.\n");
178176
return EXIT_FAILURE;
179177
}
180178
}
181179
printf("Verified with success: no false negatives\n");
182180
size_t matches = 0;
183181
size_t volume = 100000;
184-
for(size_t t = 0; t < volume; t++) {
185-
if(binary_fuse8_contain( t * 10001 + 13 + array_size,&filter)) {
182+
for (size_t t = 0; t < volume; t++) {
183+
if (binary_fuse8_contain(t * 10001 + 13 + array_size, &filter)) {
186184
matches++;
187185
}
188186
}
189-
printf("estimated false positive rate: %.3f percent\n", matches * 100.0 / volume);
187+
printf("estimated false positive rate: %.3f percent\n",
188+
matches * 100.0 / volume);
190189
}
191190
free(array);
192191

@@ -198,21 +197,24 @@ int main(int argc, char **argv) {
198197
}
199198
uint64_t cookie = 1234569;
200199
bool isok = true;
201-
size_t total_bytes = sizeof(cookie) + sizeof(filter.Seed) + sizeof(filter.SegmentLength)
202-
+ sizeof(filter.SegmentLengthMask) + sizeof(filter.SegmentCount)
203-
+ sizeof(filter.SegmentCountLength) + sizeof(filter.ArrayLength)
204-
+ sizeof(uint8_t) * filter.ArrayLength;
205-
200+
size_t total_bytes =
201+
sizeof(cookie) + sizeof(filter.Seed) + sizeof(filter.SegmentLength) +
202+
sizeof(filter.SegmentLengthMask) + sizeof(filter.SegmentCount) +
203+
sizeof(filter.SegmentCountLength) + sizeof(filter.ArrayLength) +
204+
sizeof(uint8_t) * filter.ArrayLength;
206205

207206
isok &= fwrite(&cookie, sizeof(cookie), 1, write_ptr);
208207
isok &= fwrite(&filter.Seed, sizeof(filter.Seed), 1, write_ptr);
209-
isok &= fwrite(&filter.SegmentLength, sizeof(filter.SegmentLength), 1, write_ptr);
210-
isok &= fwrite(&filter.SegmentLengthMask, sizeof(filter.SegmentLengthMask), 1, write_ptr);
211-
isok &= fwrite(&filter.SegmentCount, sizeof(filter.SegmentCount), 1, write_ptr);
212-
isok &= fwrite(&filter.SegmentCountLength, sizeof(filter.SegmentCountLength), 1, write_ptr);
213-
isok &= fwrite(&filter.ArrayLength, sizeof(filter.ArrayLength), 1, write_ptr);
214-
215-
208+
isok &= fwrite(&filter.SegmentLength, sizeof(filter.SegmentLength), 1,
209+
write_ptr);
210+
isok &= fwrite(&filter.SegmentLengthMask, sizeof(filter.SegmentLengthMask),
211+
1, write_ptr);
212+
isok &=
213+
fwrite(&filter.SegmentCount, sizeof(filter.SegmentCount), 1, write_ptr);
214+
isok &= fwrite(&filter.SegmentCountLength,
215+
sizeof(filter.SegmentCountLength), 1, write_ptr);
216+
isok &=
217+
fwrite(&filter.ArrayLength, sizeof(filter.ArrayLength), 1, write_ptr);
216218
isok &= fwrite(filter.Fingerprints, sizeof(uint8_t) * filter.ArrayLength, 1,
217219
write_ptr);
218220
isok &= (fclose(write_ptr) == 0);
@@ -231,23 +233,25 @@ int main(int argc, char **argv) {
231233
xor8_buffered_populate(array, array_size, &filter);
232234
end = clock();
233235
printf("Done in %.3f seconds.\n", (float)(end - start) / CLOCKS_PER_SEC);
234-
if(verify) {
236+
if (verify) {
235237
printf("Checking for false negatives\n");
236-
for(size_t i = 0; i < array_size; i++) {
237-
if(!xor8_contain(array[i],&filter)) {
238-
printf("Detected a false negative. You probably have a bug. Aborting.\n");
238+
for (size_t i = 0; i < array_size; i++) {
239+
if (!xor8_contain(array[i], &filter)) {
240+
printf("Detected a false negative. You probably have a bug. "
241+
"Aborting.\n");
239242
return EXIT_FAILURE;
240243
}
241244
}
242245
printf("Verified with success: no false negatives\n");
243246
size_t matches = 0;
244247
size_t volume = 100000;
245-
for(size_t t = 0; t < volume; t++) {
246-
if(xor8_contain( t * 10001 + 13 + array_size,&filter)) {
248+
for (size_t t = 0; t < volume; t++) {
249+
if (xor8_contain(t * 10001 + 13 + array_size, &filter)) {
247250
matches++;
248251
}
249252
}
250-
printf("estimated false positive rate: %.3f percent\n", matches * 100.0 / volume);
253+
printf("estimated false positive rate: %.3f percent\n",
254+
matches * 100.0 / volume);
251255
}
252256
free(array);
253257

@@ -281,36 +285,39 @@ int main(int argc, char **argv) {
281285
start = clock();
282286
using Table = bloomfilter::BloomFilter<uint64_t, 12, false, SimpleMixSplit>;
283287
Table table(array_size);
284-
for(size_t i = 0; i < array_size; i++) {
288+
for (size_t i = 0; i < array_size; i++) {
285289
table.Add(array[i]);
286290
}
287291
end = clock();
288292
printf("Done in %.3f seconds.\n", (float)(end - start) / CLOCKS_PER_SEC);
289-
if(verify) {
293+
if (verify) {
290294
printf("Checking for false negatives\n");
291-
for(size_t i = 0; i < array_size; i++) {
292-
if(table.Contain(array[i]) != bloomfilter::Ok) {
293-
printf("Detected a false negative. You probably have a bug. Aborting.\n");
295+
for (size_t i = 0; i < array_size; i++) {
296+
if (table.Contain(array[i]) != bloomfilter::Ok) {
297+
printf("Detected a false negative. You probably have a bug. "
298+
"Aborting.\n");
294299
return EXIT_FAILURE;
295300
}
296301
}
297-
MappeableBloomFilter<12> filter(
298-
table.SizeInBytes() / 8, table.hasher.seed, table.data);
299-
for(size_t i = 0; i < array_size; i++) {
300-
if(!filter.Contain(array[i])) {
301-
printf("Detected a false negative. You probably have a bug. Aborting.\n");
302+
MappeableBloomFilter<12> filter(table.SizeInBytes() / 8,
303+
table.hasher.seed, table.data);
304+
for (size_t i = 0; i < array_size; i++) {
305+
if (!filter.Contain(array[i])) {
306+
printf("Detected a false negative. You probably have a bug. "
307+
"Aborting.\n");
302308
return EXIT_FAILURE;
303309
}
304310
}
305311
printf("Verified with success: no false negatives\n");
306312
size_t matches = 0;
307313
size_t volume = 100000;
308-
for(size_t t = 0; t < volume; t++) {
309-
if(filter.Contain( t * 10001 + 13 + array_size)) {
314+
for (size_t t = 0; t < volume; t++) {
315+
if (filter.Contain(t * 10001 + 13 + array_size)) {
310316
matches++;
311317
}
312318
}
313-
printf("estimated false positive rate: %.3f percent\n", matches * 100.0 / volume);
319+
printf("estimated false positive rate: %.3f percent\n",
320+
matches * 100.0 / volume);
314321
}
315322
free(array);
316323
FILE *write_ptr;

0 commit comments

Comments
 (0)