IntersectMBO · dcoutts · Jul 7, 2025 · Jul 1, 2025 · Jun 30, 2025 · Jun 30, 2025
@@ -1,29 +1,75 @@
-# A fast, space efficient Bloom filter implementation
+# bloomfilter-blocked
 
-Copyright 2008, 2009, 2010, 2011 Bryan O'Sullivan <[email protected]>.
+`bloomfilter-blocked` is a Haskell library providing multiple fast and efficient
+implementations of [bloom filters][bloom-filter:wiki]. It is a full rewrite of
+the [`bloomfilter`][bloomfilter:hackage] package, originally authored by Bryan
+O'Sullivan <[email protected]>.
 
-This package provides both mutable and immutable Bloom filter data
-types, along with a family of hash function and an easy-to-use
-interface.
+A bloom filter is a space-efficient data structure representing a set that can
+be probablistically queried for set membership. The set membership query returns
+no false negatives, but it might return false positives. That is, if an element
+was added to a bloom filter, then a subsequent query definitely returns `True`.
+If an element was *not* added to a filter, then a subsequent query may still
+return `True` if `False` would be the correct answer. The probabiliy of false
+positives -- the false positive rate (FPR) -- is configurable, as we will
+describe later.
 
-To build:
+The library includes two implementations of bloom filters: classic, and blocked.
 
-    cabal install bloomfilter
+* **Classic** bloom filters, found in the `Data.BloomFilter.Classic` module: a
+  default implementation that is faithful to the canonical description of a
+  bloom filter data structure.
 
-For examples of usage, see the Haddock documentation and the files in
-the examples directory.
+* **Blocked** floom filters, found in the `Data.BloomFilter.Blocked` module: an
+  implementation that optimises the memory layout of a classic bloom filter for
+  speed (cheaper CPU cache reads), at the cost of a slightly higher FPR for the
+  same amount of assigned memory.
 
+The FPR scales inversely with how much memory is assigned to the filter. It also
+scales inversely with how many elements are added to the set. The user can
+configure how much memory is asisgned to a filter, and the user also controls
+how many elements are added to a set. Each implementation comes with helper
+functions, like `sizeForFPR` and `sizeForBits`, that the user can leverage to
+configure filters.
 
-# Get involved!
+Both immutable (`Bloom`) and mutable (`MBloom`) bloom filters, including
+functions to convert between the two, are provided for each implementation. Note
+however that a (mutable) bloom filter can not be resized once created, and that
+elements can not be deleted once inserted.
 
-Please report bugs via the
-[github issue tracker](https://github.com/haskell-pkg-janitors/bloomfilter).
+For more information about the library and examples of how to use it, see the
+Haddock documentation of the different modules.
 
-Master [git repository](https://github.com/haskell-pkg-janitors/bloomfilter):
+# Usage notes
 
-* `git clone git://github.com/haskell-pkg-janitors/bloomfilter.git`
+User should take into account the following:
 
+* This package is not supported on 32bit systems.
 
-# Authors
+# Differences from the `bloomfilter` package
 
-This library is written by Bryan O'Sullivan, <[email protected]>.
+The library is a full rewrite of the [`bloomfilter`][bloomfilter:hackage]
+package, originally authored by Bryan O'Sullivan <[email protected]>. The main
+differences are:
+
+* `bloomfilter-blocked` supports both classic and blocked bloom filters, whereas
+  `bloomfilter` only supports the former.
+* `bloomfilter-blocked` supports bloom filters of arbitrary sizes, whereas
+  `bloomfilter` limits the sizes to powers of two.
+* `bloomfilter-blocked` supports sizes up to `2^48` for classic bloom filters
+  and up to `2^41` for blocked bloom filters, instead of `2^32`.
+* In `bloomfilter-blocked`, the `Bloom` and `MBloom` types are parameterised
+  over a `Hashable` type class, instead of having a `a -> [Hash]` typed field.
+  This separation in `bloomfilter-blocked` allows clean (de-)serialisation of
+  filters as the hashing scheme is static.
+* `bloomfilter-blocked` uses [`XXH3`][xxh3] for hashing instead of [Jenkins'
+  `lookup3`][lookup3:wiki], which `bloomfilter` uses.
+* The user can configure hash salts for improved security in
+  `bloomfilter-blocked`, whereas this is not supported in `bloomfilter`.
+
+<!-- Sources -->
+
+[bloom-filter:wiki]: https://en.wikipedia.org/wiki/Bloom_filter
+[bloomfilter:hackage]: https://hackage.haskell.org/package/bloomfilter
+[xxh3]: https://xxhash.com/
+[lookup3:wiki]: https://en.wikipedia.org/wiki/Jenkins_hash_function#lookup3
@@ -1,5 +1,63 @@
+-- | By default, this module re-exports the classic bloom filter implementation
+-- from "Data.BloomFilter.Classic". If you want to use the blocked bloom filter
+-- implementation, import "Data.BloomFilter.Blocked".
 module Data.BloomFilter (
     module Data.BloomFilter.Classic
+    -- * Example: a spelling checker
+    -- $example
+
+    -- * Differences with the @bloomfilter@ package
+    -- $differences
   ) where
 
 import           Data.BloomFilter.Classic
+
+-- $example
+--
+-- This example reads a dictionary file containing one word per line,
+-- constructs a Bloom filter with a 1% false positive rate, and
+-- spellchecks its standard input.  Like the Unix @spell@ command, it
+-- prints each word that it does not recognize.
+--
+-- >>> import           Control.Monad (forM_)
+-- >>> import           System.Environment (getArgs)
+-- >>> import qualified Data.BloomFilter as B
+--
+-- >>> :{
+-- main :: IO ()
+-- main = do
+--     files <- getArgs
+--     dictionary <- readFile "/usr/share/dict/words"
+--     let !bloom = B.fromList (B.policyForFPR 0.01) 4 (words dictionary)
+--     forM_ files $ \file ->
+--           putStrLn . unlines . filter (`B.notElem` bloom) . words
+--       =<< readFile file
+-- :}
+
+-- $differences
+--
+-- This package is an entirely rewritten fork of the
+-- [bloomfilter](https://hackage.haskell.org/package/bloomfilter) package.
+--
+-- The main differences are
+--
+-- * Support for both classic and \"blocked\" Bloom filters. Blocked-structured
+--   Bloom filters arrange all the bits for each insert or lookup into a single
+--   cache line, which greatly reduces the number of slow uncached memory reads.
+--   The trade-off for this performance optimisation is a slightly worse
+--   trade-off between bits per element and the FPR. In practice for typical
+--   FPRs of @1-e3@ up to @1e-4@, this requires a couple extra bits per element.
+--
+-- * This package support Bloom filters of arbitrary sizes (not limited to powers
+--   of two).
+--
+-- * Sizes over @2^32@ are supported up to @2^48@ for classic Bloom filters and
+--   @2^41@ for blocked Bloom filters.
+--
+-- * The 'Bloom' and 'MBloom' types are parametrised over a 'Hashable' type
+--   class, instead of having a @a -> ['Hash']@ typed field.
+--   This separation allows clean (de-)serialisation of Bloom filters in this
+--   package, as the hashing scheme is static.
+--
+-- * [@XXH3@ hash](https://xxhash.com/) is used instead of [Jenkins'
+--   @lookup3@](https://en.wikipedia.org/wiki/Jenkins_hash_function#lookup3).
@@ -1,19 +1,17 @@
--- |
+-- | A fast, space efficient Bloom filter implementation.  A Bloom filter is a
+-- set-like data structure that provides a probabilistic membership test.
 --
--- A fast, space efficient Bloom filter implementation.  A Bloom
--- filter is a set-like data structure that provides a probabilistic
--- membership test.
+-- * Queries do not give false negatives. When an element is added to a filter,
+--   a subsequent membership test will definitely return 'True'.
 --
--- * Queries do not give false negatives.  When an element is added to
---   a filter, a subsequent membership test will definitely return
---   'True'.
+-- * False positives /are/ possible. If an element has not been added to a
+--   filter, a membership test /may/ nevertheless indicate that the element is
+--   present.
 --
--- * False positives /are/ possible.  If an element has not been added
---   to a filter, a membership test /may/ nevertheless indicate that
---   the element is present.
---
-
 module Data.BloomFilter.Blocked (
+    -- * Overview
+    -- $overview
+
     -- * Types
     Hash,
     Salt,
@@ -57,6 +55,7 @@ module Data.BloomFilter.Blocked (
     maxSizeBits,
     insert,
     insertMany,
+    read,
 
     -- ** Conversion
     freeze,
@@ -68,6 +67,7 @@ module Data.BloomFilter.Blocked (
     hashesWithSalt,
     insertHashes,
     elemHashes,
+    readHashes,
     -- ** Prefetching
     prefetchInsert,
     prefetchElem,
@@ -80,23 +80,60 @@ import           Data.Bits ((.&.))
 import           Data.Primitive.ByteArray (MutableByteArray)
 import qualified Data.Primitive.PrimArray as P
 
-import           Data.BloomFilter.Blocked.Calc
+import           Data.BloomFilter.Blocked.Calc (BitsPerEntry, BloomPolicy (..),
+                     BloomSize (..), FPR, NumEntries, policyFPR, policyForBits,
+                     policyForFPR, sizeForBits, sizeForFPR, sizeForPolicy)
 import           Data.BloomFilter.Blocked.Internal hiding (deserialise)
 import qualified Data.BloomFilter.Blocked.Internal as Internal
 import           Data.BloomFilter.Hash
 
-import           Prelude hiding (elem, notElem)
+import           Prelude hiding (elem, notElem, read)
+
+-- $setup
+--
+-- >>> import Text.Printf
+
+-- $overview
+--
+-- Each of the functions for creating Bloom filters accepts a 'BloomSize'. The
+-- size determines the number of bits that should be used for the filter. Note
+-- that a filter is fixed in size; it cannot be resized after creation.
+--
+-- The size can be specified by asking for a target false positive rate (FPR)
+-- or a number of bits per element, and the number of elements in the filter.
+-- For example:
+--
+-- * @'sizeForFPR' 1e-3 10_000@ for a Bloom filter sized for 10,000 elements
+--   with a false positive rate of 1 in 1000
+--
+-- * @'sizeForBits' 10 10_000@ for a Bloom filter sized for 10,000 elements
+--   with 10 bits per element
+--
+-- Depending on the application it may be more important to target a fixed
+-- amount of memory to use, or target a specific FPR.
+--
+-- As a very rough guide for filter sizes, here are a range of FPRs and bits
+-- per element:
+--
+-- * FPR of 1e-1 requires approximately 4.8 bits per element
+-- * FPR of 1e-2 requires approximately 9.8 bits per element
+-- * FPR of 1e-3 requires approximately 15.8 bits per element
+-- * FPR of 1e-4 requires approximately 22.6 bits per element
+-- * FPR of 1e-5 requires approximately 30.2 bits per element
+--
+-- >>> fmap (printf "%0.1f" . policyBits . policyForFPR) [1e-1, 1e-2, 1e-3, 1e-4, 1e-5] :: [String]
+-- ["4.8","9.8","15.8","22.6","30.2"]
 
 -- | Create an immutable Bloom filter, using the given setup function
 -- which executes in the 'ST' monad.
 --
 -- Example:
 --
--- @
+-- >>> :{
 -- filter = create (sizeForBits 16 2) 4 $ \mf -> do
---           insert mf \"foo\"
---           insert mf \"bar\"
--- @
+--  insert mf "foo"
+--  insert mf "bar"
+-- :}
 --
 -- Note that the result of the setup function is not used.
 create :: BloomSize
@@ -141,6 +178,12 @@ elem = \ !x !b -> elemHashes b (hashesWithSalt (hashSalt b) x)
 notElem :: Hashable a => a -> Bloom a -> Bool
 notElem = \x b -> not (x `elem` b)
 
+-- | Query a mutable Bloom filter for membership.  If the value is
+-- present, return @True@.  If the value is not present, there is
+-- /still/ some possibility that @True@ will be returned.
+read :: Hashable a => MBloom s a -> a -> ST s Bool
+read !mb !x = readHashes mb (hashesWithSalt (mbHashSalt mb) x)
+
 -- | Build an immutable Bloom filter from a seed value.  The seeding
 -- function populates the filter as follows.
 --
@@ -168,6 +211,7 @@ unfold bloomsize bloomsalt f k =
                     Nothing      -> pure ()
                     Just (a, j') -> insert mb a >> loop j'
 
+{-# INLINEABLE fromList #-}
 -- | Create a Bloom filter, populating it from a sequence of values.
 --
 -- For example
@@ -185,10 +229,11 @@ fromList policy bloomsalt xs =
   where
     bsize = sizeForPolicy policy (length xs)
 
-{-# SPECIALISE deserialise :: BloomSize
-                           -> Salt
-                           -> (MutableByteArray RealWorld -> Int -> Int -> IO ())
-                           -> IO (Bloom a) #-}
+{-# SPECIALISE deserialise ::
+     BloomSize
+  -> Salt
+  -> (MutableByteArray RealWorld -> Int -> Int -> IO ())
+  -> IO (Bloom a) #-}
 deserialise :: PrimMonad m
             => BloomSize
             -> Salt

@@ -18,6 +18,7 @@ module Data.BloomFilter.Blocked.BitArray (
     new,
     unsafeSet,
     prefetchSet,
+    unsafeRead,
     freeze,
     unsafeFreeze,
     thaw,
@@ -155,6 +156,17 @@ prefetchSet (MBitArray (MutablePrimArray mba#)) (BlockIx blockIx) = do
     ST (\s -> case prefetchMutableByteArray0# mba# i# s of
                 s' -> (# s', () #))
 
+unsafeRead :: MBitArray s -> BlockIx -> BitIx -> ST s Bool
+unsafeRead (MBitArray arr) blockIx blockBitIx = do
+#ifdef NO_IGNORE_ASSERTS
+    sz <- getSizeofMutablePrimArray arr
+    assert (wordIx >= 0 && wordIx < sz) $ pure ()
+#endif
+    w <- readPrimArray arr wordIx
+    pure $ unsafeTestBit w wordBitIx
+  where
+    (wordIx, wordBitIx) = wordAndBitIndex blockIx blockBitIx
+
 freeze :: MBitArray s -> ST s BitArray
 freeze (MBitArray arr) = do
     len <- getSizeofMutablePrimArray arr

@@ -13,8 +13,7 @@ module Data.BloomFilter.Blocked.Calc (
     policyForBits,
 ) where
 
-import           Data.BloomFilter.Classic.Calc (BitsPerEntry, BloomPolicy (..),
-                     BloomSize (..), FPR, NumEntries)
+import           Data.BloomFilter.Classic.Calc (BitsPerEntry, FPR, NumEntries)
 
 {-
 Calculating the relationship between bits and FPR for the blocked
@@ -49,6 +48,32 @@ Fit {
 
 -}
 
+-- | A policy on intended bloom filter size -- independent of the number of
+-- elements.
+--
+-- We can decide a policy based on:
+--
+-- 1. a target false positive rate (FPR) using 'policyForFPR'
+-- 2. a number of bits per entry using 'policyForBits'
+--
+-- A policy can be turned into a 'BloomSize' given a target 'NumEntries' using
+-- 'sizeForPolicy'.
+--
+-- Either way we define the policy, we can inspect the result to see:
+--
+-- 1. The bits per entry 'policyBits'. This will determine the
+--    size of the bloom filter in bits. In general the bits per entry can be
+--    fractional. The final bloom filter size in will be rounded to a whole
+--    number of bits.
+-- 2. The number of hashes 'policyHashes'.
+-- 3. The expected FPR for the policy using 'policyFPR'.
+--
+data BloomPolicy = BloomPolicy {
+       policyBits   :: !Double,
+       policyHashes :: !Int
+     }
+  deriving stock Show
+
 policyForFPR :: FPR -> BloomPolicy
 policyForFPR fpr | fpr <= 0 || fpr >= 1 =
     error "bloomPolicyForFPR: fpr out of range (0,1)"
@@ -103,6 +128,19 @@ policyFPR BloomPolicy {
     f1 =  0.5251544487138062
     f0 = -0.10110451821280719
 
+-- | Parameters for constructing a Bloom filter.
+--
+data BloomSize = BloomSize {
+                   -- | The requested number of bits in the filter.
+                   --
+                   -- The actual size will be rounded up to the nearest 512.
+                   sizeBits   :: !Int,
+
+                   -- | The number of hash functions to use.
+                   sizeHashes :: !Int
+                 }
+  deriving stock Show
+
 sizeForFPR :: FPR -> NumEntries -> BloomSize
 sizeForFPR = sizeForPolicy . policyForFPR