-
Notifications
You must be signed in to change notification settings - Fork 9
Split bloomfilter-blocked into its own package
#770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
4bb71e4
db97e35
453cac6
6641ff1
69db9f8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,29 +1,75 @@ | ||
| # A fast, space efficient Bloom filter implementation | ||
| # bloomfilter-blocked | ||
|
|
||
| Copyright 2008, 2009, 2010, 2011 Bryan O'Sullivan <[email protected]>. | ||
| `bloomfilter-blocked` is a Haskell library providing multiple fast and efficient | ||
| implementations of [bloom filters][bloom-filter:wiki]. It is a full rewrite of | ||
| the [`bloomfilter`][bloomfilter:hackage] package, originally authored by Bryan | ||
| O'Sullivan <[email protected]>. | ||
|
|
||
| This package provides both mutable and immutable Bloom filter data | ||
| types, along with a family of hash function and an easy-to-use | ||
| interface. | ||
| A bloom filter is a space-efficient data structure representing a set that can | ||
| be probablistically queried for set membership. The set membership query returns | ||
| no false negatives, but it might return false positives. That is, if an element | ||
| was added to a bloom filter, then a subsequent query definitely returns `True`. | ||
| If an element was *not* added to a filter, then a subsequent query may still | ||
| return `True` if `False` would be the correct answer. The probabiliy of false | ||
| positives -- the false positive rate (FPR) -- is configurable, as we will | ||
| describe later. | ||
|
|
||
| To build: | ||
| The library includes two implementations of bloom filters: classic, and blocked. | ||
|
|
||
| cabal install bloomfilter | ||
| * **Classic** bloom filters, found in the `Data.BloomFilter.Classic` module: a | ||
| default implementation that is faithful to the canonical description of a | ||
| bloom filter data structure. | ||
|
|
||
| For examples of usage, see the Haddock documentation and the files in | ||
| the examples directory. | ||
| * **Blocked** floom filters, found in the `Data.BloomFilter.Blocked` module: an | ||
| implementation that optimises the memory layout of a classic bloom filter for | ||
| speed (cheaper CPU cache reads), at the cost of a slightly higher FPR for the | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cheaper/fewer |
||
| same amount of assigned memory. | ||
|
|
||
| The FPR scales inversely with how much memory is assigned to the filter. It also | ||
| scales inversely with how many elements are added to the set. The user can | ||
| configure how much memory is asisgned to a filter, and the user also controls | ||
| how many elements are added to a set. Each implementation comes with helper | ||
| functions, like `sizeForFPR` and `sizeForBits`, that the user can leverage to | ||
| configure filters. | ||
|
|
||
| # Get involved! | ||
| Both immutable (`Bloom`) and mutable (`MBloom`) bloom filters, including | ||
| functions to convert between the two, are provided for each implementation. Note | ||
| however that a (mutable) bloom filter can not be resized once created, and that | ||
| elements can not be deleted once inserted. | ||
|
|
||
| Please report bugs via the | ||
| [github issue tracker](https://github.com/haskell-pkg-janitors/bloomfilter). | ||
| For more information about the library and examples of how to use it, see the | ||
| Haddock documentation of the different modules. | ||
|
|
||
| Master [git repository](https://github.com/haskell-pkg-janitors/bloomfilter): | ||
| # Usage notes | ||
|
|
||
| * `git clone git://github.com/haskell-pkg-janitors/bloomfilter.git` | ||
| User should take into account the following: | ||
|
|
||
| * This package is not supported on 32bit systems. | ||
|
|
||
| # Authors | ||
| # Differences from the `bloomfilter` package | ||
|
|
||
| This library is written by Bryan O'Sullivan, <[email protected]>. | ||
| The library is a full rewrite of the [`bloomfilter`][bloomfilter:hackage] | ||
| package, originally authored by Bryan O'Sullivan <[email protected]>. The main | ||
| differences are: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: benchmarks to show it's faster! |
||
|
|
||
| * `bloomfilter-blocked` supports both classic and blocked bloom filters, whereas | ||
| `bloomfilter` only supports the former. | ||
| * `bloomfilter-blocked` supports bloom filters of arbitrary sizes, whereas | ||
| `bloomfilter` limits the sizes to powers of two. | ||
| * `bloomfilter-blocked` supports sizes up to `2^48` for classic bloom filters | ||
| and up to `2^41` for blocked bloom filters, instead of `2^32`. | ||
| * In `bloomfilter-blocked`, the `Bloom` and `MBloom` types are parameterised | ||
| over a `Hashable` type class, instead of having a `a -> [Hash]` typed field. | ||
| This separation in `bloomfilter-blocked` allows clean (de-)serialisation of | ||
| filters as the hashing scheme is static. | ||
| * `bloomfilter-blocked` uses [`XXH3`][xxh3] for hashing instead of [Jenkins' | ||
| `lookup3`][lookup3:wiki], which `bloomfilter` uses. | ||
| * The user can configure hash salts for improved security in | ||
| `bloomfilter-blocked`, whereas this is not supported in `bloomfilter`. | ||
|
|
||
| <!-- Sources --> | ||
|
|
||
| [bloom-filter:wiki]: https://en.wikipedia.org/wiki/Bloom_filter | ||
| [bloomfilter:hackage]: https://hackage.haskell.org/package/bloomfilter | ||
| [xxh3]: https://xxhash.com/ | ||
| [lookup3:wiki]: https://en.wikipedia.org/wiki/Jenkins_hash_function#lookup3 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,63 @@ | ||
| -- | By default, this module re-exports the classic bloom filter implementation | ||
| -- from "Data.BloomFilter.Classic". If you want to use the blocked bloom filter | ||
| -- implementation, import "Data.BloomFilter.Blocked". | ||
| module Data.BloomFilter ( | ||
| module Data.BloomFilter.Classic | ||
| -- * Example: a spelling checker | ||
| -- $example | ||
|
|
||
| -- * Differences with the @bloomfilter@ package | ||
| -- $differences | ||
| ) where | ||
|
|
||
| import Data.BloomFilter.Classic | ||
|
|
||
| -- $example | ||
| -- | ||
| -- This example reads a dictionary file containing one word per line, | ||
| -- constructs a Bloom filter with a 1% false positive rate, and | ||
| -- spellchecks its standard input. Like the Unix @spell@ command, it | ||
| -- prints each word that it does not recognize. | ||
| -- | ||
| -- >>> import Control.Monad (forM_) | ||
| -- >>> import System.Environment (getArgs) | ||
| -- >>> import qualified Data.BloomFilter as B | ||
| -- | ||
| -- >>> :{ | ||
| -- main :: IO () | ||
| -- main = do | ||
| -- files <- getArgs | ||
| -- dictionary <- readFile "/usr/share/dict/words" | ||
| -- let !bloom = B.fromList (B.policyForFPR 0.01) 4 (words dictionary) | ||
| -- forM_ files $ \file -> | ||
| -- putStrLn . unlines . filter (`B.notElem` bloom) . words | ||
| -- =<< readFile file | ||
| -- :} | ||
|
|
||
| -- $differences | ||
| -- | ||
| -- This package is an entirely rewritten fork of the | ||
| -- [bloomfilter](https://hackage.haskell.org/package/bloomfilter) package. | ||
| -- | ||
| -- The main differences are | ||
| -- | ||
| -- * Support for both classic and \"blocked\" Bloom filters. Blocked-structured | ||
| -- Bloom filters arrange all the bits for each insert or lookup into a single | ||
| -- cache line, which greatly reduces the number of slow uncached memory reads. | ||
| -- The trade-off for this performance optimisation is a slightly worse | ||
| -- trade-off between bits per element and the FPR. In practice for typical | ||
| -- FPRs of @1-e3@ up to @1e-4@, this requires a couple extra bits per element. | ||
| -- | ||
| -- * This package support Bloom filters of arbitrary sizes (not limited to powers | ||
| -- of two). | ||
| -- | ||
| -- * Sizes over @2^32@ are supported up to @2^48@ for classic Bloom filters and | ||
| -- @2^41@ for blocked Bloom filters. | ||
| -- | ||
| -- * The 'Bloom' and 'MBloom' types are parametrised over a 'Hashable' type | ||
| -- class, instead of having a @a -> ['Hash']@ typed field. | ||
| -- This separation allows clean (de-)serialisation of Bloom filters in this | ||
| -- package, as the hashing scheme is static. | ||
| -- | ||
| -- * [@XXH3@ hash](https://xxhash.com/) is used instead of [Jenkins' | ||
| -- @lookup3@](https://en.wikipedia.org/wiki/Jenkins_hash_function#lookup3). |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,8 +13,7 @@ module Data.BloomFilter.Blocked.Calc ( | |
| policyForBits, | ||
| ) where | ||
|
|
||
| import Data.BloomFilter.Classic.Calc (BitsPerEntry, BloomPolicy (..), | ||
| BloomSize (..), FPR, NumEntries) | ||
| import Data.BloomFilter.Classic.Calc (BitsPerEntry, FPR, NumEntries) | ||
|
|
||
| {- | ||
| Calculating the relationship between bits and FPR for the blocked | ||
|
|
@@ -49,6 +48,32 @@ Fit { | |
|
|
||
| -} | ||
|
|
||
| -- | A policy on intended bloom filter size -- independent of the number of | ||
| -- elements. | ||
| -- | ||
| -- We can decide a policy based on: | ||
| -- | ||
| -- 1. a target false positive rate (FPR) using 'policyForFPR' | ||
| -- 2. a number of bits per entry using 'policyForBits' | ||
| -- | ||
| -- A policy can be turned into a 'BloomSize' given a target 'NumEntries' using | ||
| -- 'sizeForPolicy'. | ||
| -- | ||
| -- Either way we define the policy, we can inspect the result to see: | ||
| -- | ||
| -- 1. The bits per entry 'policyBits'. This will determine the | ||
| -- size of the bloom filter in bits. In general the bits per entry can be | ||
| -- fractional. The final bloom filter size in will be rounded to a whole | ||
| -- number of bits. | ||
| -- 2. The number of hashes 'policyHashes'. | ||
| -- 3. The expected FPR for the policy using 'policyFPR'. | ||
| -- | ||
| data BloomPolicy = BloomPolicy { | ||
| policyBits :: !Double, | ||
| policyHashes :: !Int | ||
| } | ||
| deriving stock Show | ||
|
Comment on lines
+71
to
+75
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmm. If the issue is just the docs, perhaps we should pull these types out into a common module, which could be marked |
||
|
|
||
| policyForFPR :: FPR -> BloomPolicy | ||
| policyForFPR fpr | fpr <= 0 || fpr >= 1 = | ||
| error "bloomPolicyForFPR: fpr out of range (0,1)" | ||
|
|
@@ -103,6 +128,19 @@ policyFPR BloomPolicy { | |
| f1 = 0.5251544487138062 | ||
| f0 = -0.10110451821280719 | ||
|
|
||
| -- | Parameters for constructing a Bloom filter. | ||
| -- | ||
| data BloomSize = BloomSize { | ||
| -- | The requested number of bits in the filter. | ||
| -- | ||
| -- The actual size will be rounded up to the nearest 512. | ||
| sizeBits :: !Int, | ||
|
|
||
| -- | The number of hash functions to use. | ||
| sizeHashes :: !Int | ||
| } | ||
| deriving stock Show | ||
|
|
||
| sizeForFPR :: FPR -> NumEntries -> BloomSize | ||
| sizeForFPR = sizeForPolicy . policyForFPR | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple/two
what does efficient refer to as distinct from fast? Perhaps space-efficient?