You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+69-18Lines changed: 69 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,21 +3,26 @@
3
3
Python bindings for [C](https://github.com/FastFilter/xor_singleheader) implementation of [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258)
4
4
and of [Binary Fuse Filters: Fast and Smaller Than Xor Filters](https://arxiv.org/abs/2201.01174).
5
5
6
+
If you have sets using much memory (e.g., thousands or millions of URLs) and you want to
7
+
quickly filter out elements that are not in the set, these filters offer both great
The filters Xor8 and Fuse8 use slightly over a byte of memory per entry, with a false positive rate of about 0.39%.
20
-
The filters Xor16 and Fuse16 use slightly over two bytes of memory per entry, with a false positive rate of about 0.0015%.
24
+
The filters Xor16 and Fuse16 use slightly over two bytes of memory per entry, with a false positive rate of about 0.0015%. For large sets, Fuse8 and Fuse16 filters use slightly more memory and they can be built
25
+
faster.
21
26
22
27
23
28
@@ -26,8 +31,7 @@ The filters Xor16 and Fuse16 use slightly over two bytes of memory per entry, wi
26
31
>>>
27
32
>>>#Supports unicode strings and heterogeneous types
28
33
>>> test_str = ["あ","अ", 51, 0.0, 12.3]
29
-
>>>filter= Xor8(len(test_str)) #or Xor16(size)
30
-
>>>filter.populate(test_str)
34
+
>>>filter= Xor8(test_str)
31
35
True
32
36
>>>filter.contains("अ")
33
37
True
@@ -41,6 +45,10 @@ False
41
45
60
42
46
```
43
47
48
+
49
+
The `size_in_bytes()` function gives the memory usage of the filter itself. It does not count
50
+
the Python overhead which adds a few bytes to the actual memory usage.
51
+
44
52
You can serialize a filter with the `serialize()` method which returns a buffer, and you can recover the filter with the `deserialize(buffer)` method, which returns a filter:
45
53
46
54
```py
@@ -50,20 +58,25 @@ You can serialize a filter with the `serialize()` method which returns a buffer,
The serialization format is as concise as possible and will typically use a few bytes
62
+
less than `size_in_bytes()`.
63
+
53
64
## Measuring data usage
54
65
55
-
The `size_in_bytes()` function gives the memory usage of the filter itself. The actual memory usage is slightly higher (there is a small constant overhead) due to
66
+
The actual memory usage is slightly higher (there is a small constant overhead) due to
0 commit comments