If you work with large sets—such as thousands or millions of URLs, user IDs, or other keys—and need to quickly check whether an element is present, these filters provide exceptional speed and minimal memory usage. They are ideal for applications where you want to efficiently exclude elements not in your set, with much lower overhead than traditional approaches like sets or even Bloom filters.
pip install pyfusefilter
The filters Xor8 and Fuse8 use slightly over a byte of memory per entry, with a false positive rate of about 0.39%. The filters Xor16 and Fuse16 use slightly over two bytes of memory per entry, with a false positive rate of about 0.0015%. For large sets, Fuse8 and Fuse16 filters use slightly more memory and they can be built faster.
import pyfusefilter
#Supports unicode strings and heterogeneous types
filter = pyfusefilter.Xor8(["あ","अ", 51, 0.0, 12.3])
filter.contains("अ") # returns true
# next returns true
filter[51] #You can use __getitem__ instead of contains
filter["か"] # returns false
12.3 in filter # returns true
The size_in_bytes()
function gives the memory usage of the filter itself. It does not count
the Python overhead which adds a few bytes to the actual memory usage:
filter.size_in_bytes()
You can serialize a filter with the serialize()
method which returns a buffer, and you can recover the filter with the deserialize(buffer)
method, which returns a filter:
import pyfusefilter
import tempfile
filter = pyfusefilter.Xor8(["あ","अ", 51, 0.0, 12.3])
with tempfile.NamedTemporaryFile(delete=False) as tmp:
tmp.write(filter.serialize())
tmp_path = tmp.name
with open(tmp_path, 'rb') as f:
recoverfilter = pyfusefilter.Xor8.deserialize(f.read())
recoverfilter[51] # returns True
The serialization format is as concise as possible and will typically use a few bytes
less than size_in_bytes()
.
The actual memory usage is slightly higher (there is a small constant overhead) due to Python metadata.
from pyfusefilter import Xor8, Fuse8
N = 100
while (N < 10000000):
# filters can be initialized with an integer, the memory is allocated, but unused.
# call 'populate' to fill them with data.
filter = Xor8(len(data))
fusefilter = Fuse8(len(data))
print(N, filter.size_in_bytes()/N, fusefilter.size_in_bytes()/N)
N *= 10
For more accuracy(less false positives) use larger but more accurate Xor16 for Fuse16.
For large sets (contain millions of keys), Fuse8/Fuse16 filters are faster and smaller than Xor8/Xor16.
>>> filter = Xor8(1000000)
>>> filter.size_in_bytes()
1230054
>>> filter = Fuse8(1000000)
>>> filter.size_in_bytes()
1130536
Assuming that your Python interpreter is called python
.
# Clone the repository with submodules
git clone --recurse-submodules https://github.com/glitzflitz/pyfusefilter
cd pyfusefilter
# If you forgot --recurse-submodules, initialize submodules now
git submodule update --init --recursive
# Create and activate virtual environment
python -m venv pyfuseenv
source pyfuseenv/bin/activate # On Windows: pyfuseenv\Scripts\activate
# Install build dependencies
python -m pip install setuptools wheel cffi xxhash
# Build the CFFI extension
python setup.py build_ext
# Install the package
python setup.py install
# Optional: Run tests to verify installation
python -m pip install pytest
python -m pytest tests/ -v
# Generate documentation
python -m pip install pdoc
python -m pdoc pyfusefilter --output-dir docs
Notes:
- The build process compiles C code using your system's C compiler
- On macOS, you may need to install Xcode command line tools:
xcode-select --install
- On Linux, install development headers:
apt-get install python3-dev
(Ubuntu/Debian) oryum install python3-devel
(CentOS/RHEL)
To generate documentation locally:
# Install pdoc
pip install pdoc
# Generate documentation
pdoc pyfusefilter --output-dir docs
# View locally
python -m http.server 8000
# Open http://localhost:8000/docs/
- Binary Fuse Filters: Fast and Smaller Than Xor Filters, Journal of Experimental Algorithmics 27, 2022.
- Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters, Journal of Experimental Algorithmics 25 (1), 2020
- C Implementation
- Go Implementation
- Erlang bindings
- Rust Implementation: 1 and 2
- C++ Implementation
- Java Implementation