Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 62 additions & 16 deletions bloomfilter-blocked/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,75 @@
# A fast, space efficient Bloom filter implementation
# bloomfilter-blocked

Copyright 2008, 2009, 2010, 2011 Bryan O'Sullivan <[email protected]>.
`bloomfilter-blocked` is a Haskell library providing multiple fast and efficient
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple/two

what does efficient refer to as distinct from fast? Perhaps space-efficient?

implementations of [bloom filters][bloom-filter:wiki]. It is a full rewrite of
the [`bloomfilter`][bloomfilter:hackage] package, originally authored by Bryan
O'Sullivan <[email protected]>.

This package provides both mutable and immutable Bloom filter data
types, along with a family of hash function and an easy-to-use
interface.
A bloom filter is a space-efficient data structure representing a set that can
be probablistically queried for set membership. The set membership query returns
no false negatives, but it might return false positives. That is, if an element
was added to a bloom filter, then a subsequent query definitely returns `True`.
If an element was *not* added to a filter, then a subsequent query may still
return `True` if `False` would be the correct answer. The probabiliy of false
positives -- the false positive rate (FPR) -- is configurable, as we will
describe later.

To build:
The library includes two implementations of bloom filters: classic, and blocked.

cabal install bloomfilter
* **Classic** bloom filters, found in the `Data.BloomFilter.Classic` module: a
default implementation that is faithful to the canonical description of a
bloom filter data structure.

For examples of usage, see the Haddock documentation and the files in
the examples directory.
* **Blocked** floom filters, found in the `Data.BloomFilter.Blocked` module: an
implementation that optimises the memory layout of a classic bloom filter for
speed (cheaper CPU cache reads), at the cost of a slightly higher FPR for the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheaper/fewer

same amount of assigned memory.

The FPR scales inversely with how much memory is assigned to the filter. It also
scales inversely with how many elements are added to the set. The user can
configure how much memory is asisgned to a filter, and the user also controls
how many elements are added to a set. Each implementation comes with helper
functions, like `sizeForFPR` and `sizeForBits`, that the user can leverage to
configure filters.

# Get involved!
Both immutable (`Bloom`) and mutable (`MBloom`) bloom filters, including
functions to convert between the two, are provided for each implementation. Note
however that a (mutable) bloom filter can not be resized once created, and that
elements can not be deleted once inserted.

Please report bugs via the
[github issue tracker](https://github.com/haskell-pkg-janitors/bloomfilter).
For more information about the library and examples of how to use it, see the
Haddock documentation of the different modules.

Master [git repository](https://github.com/haskell-pkg-janitors/bloomfilter):
# Usage notes

* `git clone git://github.com/haskell-pkg-janitors/bloomfilter.git`
User should take into account the following:

* This package is not supported on 32bit systems.

# Authors
# Differences from the `bloomfilter` package

This library is written by Bryan O'Sullivan, <[email protected]>.
The library is a full rewrite of the [`bloomfilter`][bloomfilter:hackage]
package, originally authored by Bryan O'Sullivan <[email protected]>. The main
differences are:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: benchmarks to show it's faster!


* `bloomfilter-blocked` supports both classic and blocked bloom filters, whereas
`bloomfilter` only supports the former.
* `bloomfilter-blocked` supports bloom filters of arbitrary sizes, whereas
`bloomfilter` limits the sizes to powers of two.
* `bloomfilter-blocked` supports sizes up to `2^48` for classic bloom filters
and up to `2^41` for blocked bloom filters, instead of `2^32`.
* In `bloomfilter-blocked`, the `Bloom` and `MBloom` types are parameterised
over a `Hashable` type class, instead of having a `a -> [Hash]` typed field.
This separation in `bloomfilter-blocked` allows clean (de-)serialisation of
filters as the hashing scheme is static.
* `bloomfilter-blocked` uses [`XXH3`][xxh3] for hashing instead of [Jenkins'
`lookup3`][lookup3:wiki], which `bloomfilter` uses.
* The user can configure hash salts for improved security in
`bloomfilter-blocked`, whereas this is not supported in `bloomfilter`.

<!-- Sources -->

[bloom-filter:wiki]: https://en.wikipedia.org/wiki/Bloom_filter
[bloomfilter:hackage]: https://hackage.haskell.org/package/bloomfilter
[xxh3]: https://xxhash.com/
[lookup3:wiki]: https://en.wikipedia.org/wiki/Jenkins_hash_function#lookup3
58 changes: 58 additions & 0 deletions bloomfilter-blocked/src/Data/BloomFilter.hs
Original file line number Diff line number Diff line change
@@ -1,5 +1,63 @@
-- | By default, this module re-exports the classic bloom filter implementation
-- from "Data.BloomFilter.Classic". If you want to use the blocked bloom filter
-- implementation, import "Data.BloomFilter.Blocked".
module Data.BloomFilter (
module Data.BloomFilter.Classic
-- * Example: a spelling checker
-- $example

-- * Differences with the @bloomfilter@ package
-- $differences
) where

import Data.BloomFilter.Classic

-- $example
--
-- This example reads a dictionary file containing one word per line,
-- constructs a Bloom filter with a 1% false positive rate, and
-- spellchecks its standard input. Like the Unix @spell@ command, it
-- prints each word that it does not recognize.
--
-- >>> import Control.Monad (forM_)
-- >>> import System.Environment (getArgs)
-- >>> import qualified Data.BloomFilter as B
--
-- >>> :{
-- main :: IO ()
-- main = do
-- files <- getArgs
-- dictionary <- readFile "/usr/share/dict/words"
-- let !bloom = B.fromList (B.policyForFPR 0.01) 4 (words dictionary)
-- forM_ files $ \file ->
-- putStrLn . unlines . filter (`B.notElem` bloom) . words
-- =<< readFile file
-- :}

-- $differences
--
-- This package is an entirely rewritten fork of the
-- [bloomfilter](https://hackage.haskell.org/package/bloomfilter) package.
--
-- The main differences are
--
-- * Support for both classic and \"blocked\" Bloom filters. Blocked-structured
-- Bloom filters arrange all the bits for each insert or lookup into a single
-- cache line, which greatly reduces the number of slow uncached memory reads.
-- The trade-off for this performance optimisation is a slightly worse
-- trade-off between bits per element and the FPR. In practice for typical
-- FPRs of @1-e3@ up to @1e-4@, this requires a couple extra bits per element.
--
-- * This package support Bloom filters of arbitrary sizes (not limited to powers
-- of two).
--
-- * Sizes over @2^32@ are supported up to @2^48@ for classic Bloom filters and
-- @2^41@ for blocked Bloom filters.
--
-- * The 'Bloom' and 'MBloom' types are parametrised over a 'Hashable' type
-- class, instead of having a @a -> ['Hash']@ typed field.
-- This separation allows clean (de-)serialisation of Bloom filters in this
-- package, as the hashing scheme is static.
--
-- * [@XXH3@ hash](https://xxhash.com/) is used instead of [Jenkins'
-- @lookup3@](https://en.wikipedia.org/wiki/Jenkins_hash_function#lookup3).
89 changes: 67 additions & 22 deletions bloomfilter-blocked/src/Data/BloomFilter/Blocked.hs
Original file line number Diff line number Diff line change
@@ -1,19 +1,17 @@
-- |
-- | A fast, space efficient Bloom filter implementation. A Bloom filter is a
-- set-like data structure that provides a probabilistic membership test.
--
-- A fast, space efficient Bloom filter implementation. A Bloom
-- filter is a set-like data structure that provides a probabilistic
-- membership test.
-- * Queries do not give false negatives. When an element is added to a filter,
-- a subsequent membership test will definitely return 'True'.
--
-- * Queries do not give false negatives. When an element is added to
-- a filter, a subsequent membership test will definitely return
-- 'True'.
-- * False positives /are/ possible. If an element has not been added to a
-- filter, a membership test /may/ nevertheless indicate that the element is
-- present.
--
-- * False positives /are/ possible. If an element has not been added
-- to a filter, a membership test /may/ nevertheless indicate that
-- the element is present.
--

module Data.BloomFilter.Blocked (
-- * Overview
-- $overview

-- * Types
Hash,
Salt,
Expand Down Expand Up @@ -57,6 +55,7 @@ module Data.BloomFilter.Blocked (
maxSizeBits,
insert,
insertMany,
read,

-- ** Conversion
freeze,
Expand All @@ -68,6 +67,7 @@ module Data.BloomFilter.Blocked (
hashesWithSalt,
insertHashes,
elemHashes,
readHashes,
-- ** Prefetching
prefetchInsert,
prefetchElem,
Expand All @@ -80,23 +80,60 @@ import Data.Bits ((.&.))
import Data.Primitive.ByteArray (MutableByteArray)
import qualified Data.Primitive.PrimArray as P

import Data.BloomFilter.Blocked.Calc
import Data.BloomFilter.Blocked.Calc (BitsPerEntry, BloomPolicy (..),
BloomSize (..), FPR, NumEntries, policyFPR, policyForBits,
policyForFPR, sizeForBits, sizeForFPR, sizeForPolicy)
import Data.BloomFilter.Blocked.Internal hiding (deserialise)
import qualified Data.BloomFilter.Blocked.Internal as Internal
import Data.BloomFilter.Hash

import Prelude hiding (elem, notElem)
import Prelude hiding (elem, notElem, read)

-- $setup
--
-- >>> import Text.Printf

-- $overview
--
-- Each of the functions for creating Bloom filters accepts a 'BloomSize'. The
-- size determines the number of bits that should be used for the filter. Note
-- that a filter is fixed in size; it cannot be resized after creation.
--
-- The size can be specified by asking for a target false positive rate (FPR)
-- or a number of bits per element, and the number of elements in the filter.
-- For example:
--
-- * @'sizeForFPR' 1e-3 10_000@ for a Bloom filter sized for 10,000 elements
-- with a false positive rate of 1 in 1000
--
-- * @'sizeForBits' 10 10_000@ for a Bloom filter sized for 10,000 elements
-- with 10 bits per element
--
-- Depending on the application it may be more important to target a fixed
-- amount of memory to use, or target a specific FPR.
--
-- As a very rough guide for filter sizes, here are a range of FPRs and bits
-- per element:
--
-- * FPR of 1e-1 requires approximately 4.8 bits per element
-- * FPR of 1e-2 requires approximately 9.8 bits per element
-- * FPR of 1e-3 requires approximately 15.8 bits per element
-- * FPR of 1e-4 requires approximately 22.6 bits per element
-- * FPR of 1e-5 requires approximately 30.2 bits per element
--
-- >>> fmap (printf "%0.1f" . policyBits . policyForFPR) [1e-1, 1e-2, 1e-3, 1e-4, 1e-5] :: [String]
-- ["4.8","9.8","15.8","22.6","30.2"]

-- | Create an immutable Bloom filter, using the given setup function
-- which executes in the 'ST' monad.
--
-- Example:
--
-- @
-- >>> :{
-- filter = create (sizeForBits 16 2) 4 $ \mf -> do
-- insert mf \"foo\"
-- insert mf \"bar\"
-- @
-- insert mf "foo"
-- insert mf "bar"
-- :}
--
-- Note that the result of the setup function is not used.
create :: BloomSize
Expand Down Expand Up @@ -141,6 +178,12 @@ elem = \ !x !b -> elemHashes b (hashesWithSalt (hashSalt b) x)
notElem :: Hashable a => a -> Bloom a -> Bool
notElem = \x b -> not (x `elem` b)

-- | Query a mutable Bloom filter for membership. If the value is
-- present, return @True@. If the value is not present, there is
-- /still/ some possibility that @True@ will be returned.
read :: Hashable a => MBloom s a -> a -> ST s Bool
read !mb !x = readHashes mb (hashesWithSalt (mbHashSalt mb) x)

-- | Build an immutable Bloom filter from a seed value. The seeding
-- function populates the filter as follows.
--
Expand Down Expand Up @@ -168,6 +211,7 @@ unfold bloomsize bloomsalt f k =
Nothing -> pure ()
Just (a, j') -> insert mb a >> loop j'

{-# INLINEABLE fromList #-}
-- | Create a Bloom filter, populating it from a sequence of values.
--
-- For example
Expand All @@ -185,10 +229,11 @@ fromList policy bloomsalt xs =
where
bsize = sizeForPolicy policy (length xs)

{-# SPECIALISE deserialise :: BloomSize
-> Salt
-> (MutableByteArray RealWorld -> Int -> Int -> IO ())
-> IO (Bloom a) #-}
{-# SPECIALISE deserialise ::
BloomSize
-> Salt
-> (MutableByteArray RealWorld -> Int -> Int -> IO ())
-> IO (Bloom a) #-}
deserialise :: PrimMonad m
=> BloomSize
-> Salt
Expand Down
12 changes: 12 additions & 0 deletions bloomfilter-blocked/src/Data/BloomFilter/Blocked/BitArray.hs
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ module Data.BloomFilter.Blocked.BitArray (
new,
unsafeSet,
prefetchSet,
unsafeRead,
freeze,
unsafeFreeze,
thaw,
Expand Down Expand Up @@ -155,6 +156,17 @@ prefetchSet (MBitArray (MutablePrimArray mba#)) (BlockIx blockIx) = do
ST (\s -> case prefetchMutableByteArray0# mba# i# s of
s' -> (# s', () #))

unsafeRead :: MBitArray s -> BlockIx -> BitIx -> ST s Bool
unsafeRead (MBitArray arr) blockIx blockBitIx = do
#ifdef NO_IGNORE_ASSERTS
sz <- getSizeofMutablePrimArray arr
assert (wordIx >= 0 && wordIx < sz) $ pure ()
#endif
w <- readPrimArray arr wordIx
pure $ unsafeTestBit w wordBitIx
where
(wordIx, wordBitIx) = wordAndBitIndex blockIx blockBitIx

freeze :: MBitArray s -> ST s BitArray
freeze (MBitArray arr) = do
len <- getSizeofMutablePrimArray arr
Expand Down
42 changes: 40 additions & 2 deletions bloomfilter-blocked/src/Data/BloomFilter/Blocked/Calc.hs
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@ module Data.BloomFilter.Blocked.Calc (
policyForBits,
) where

import Data.BloomFilter.Classic.Calc (BitsPerEntry, BloomPolicy (..),
BloomSize (..), FPR, NumEntries)
import Data.BloomFilter.Classic.Calc (BitsPerEntry, FPR, NumEntries)

{-
Calculating the relationship between bits and FPR for the blocked
Expand Down Expand Up @@ -49,6 +48,32 @@ Fit {

-}

-- | A policy on intended bloom filter size -- independent of the number of
-- elements.
--
-- We can decide a policy based on:
--
-- 1. a target false positive rate (FPR) using 'policyForFPR'
-- 2. a number of bits per entry using 'policyForBits'
--
-- A policy can be turned into a 'BloomSize' given a target 'NumEntries' using
-- 'sizeForPolicy'.
--
-- Either way we define the policy, we can inspect the result to see:
--
-- 1. The bits per entry 'policyBits'. This will determine the
-- size of the bloom filter in bits. In general the bits per entry can be
-- fractional. The final bloom filter size in will be rounded to a whole
-- number of bits.
-- 2. The number of hashes 'policyHashes'.
-- 3. The expected FPR for the policy using 'policyFPR'.
--
data BloomPolicy = BloomPolicy {
policyBits :: !Double,
policyHashes :: !Int
}
deriving stock Show
Comment on lines +71 to +75
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Add `BloomPolicy` and `BloomSize` types for the blocked bloom filter instead
  of reusing the types of the same name from the classic bloom filter. This adds
  a bit of boilerplate, but it makes the documentation clearer because the
  hyperlinks were pointing from the blocked modules to the classic modules
  before.

Hmmm.

If the issue is just the docs, perhaps we should pull these types out into a common module, which could be marked not-home so they get documented as if they live in each module.


policyForFPR :: FPR -> BloomPolicy
policyForFPR fpr | fpr <= 0 || fpr >= 1 =
error "bloomPolicyForFPR: fpr out of range (0,1)"
Expand Down Expand Up @@ -103,6 +128,19 @@ policyFPR BloomPolicy {
f1 = 0.5251544487138062
f0 = -0.10110451821280719

-- | Parameters for constructing a Bloom filter.
--
data BloomSize = BloomSize {
-- | The requested number of bits in the filter.
--
-- The actual size will be rounded up to the nearest 512.
sizeBits :: !Int,

-- | The number of hash functions to use.
sizeHashes :: !Int
}
deriving stock Show

sizeForFPR :: FPR -> NumEntries -> BloomSize
sizeForFPR = sizeForPolicy . policyForFPR

Expand Down
Loading