Skip to content

Commit cc23a91

Browse files
authored
Update bloom-filter.md
1 parent c7bc0d1 commit cc23a91

File tree

1 file changed

+11
-15
lines changed

1 file changed

+11
-15
lines changed

content/develop/data-types/probabilistic/bloom-filter.md

Lines changed: 11 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -111,11 +111,11 @@ BF.RESERVE {key} {error_rate} {capacity} [EXPANSION expansion] [NONSCALING]
111111
The rate is a decimal value between 0 and 1. For example, for a desired false positive rate of 0.1% (1 in 1000), error_rate should be set to 0.001.
112112

113113
#### 2. Expected capacity (`capacity`)
114-
This is the number of elements you expect having in your filter in total and is trivial when you have a static set but it becomes more challenging when your set grows over time. It's important to get the number right because if you **oversize** - you'll end up wasting memory. If you **undersize**, the filter will fill up and a new one will have to be stacked on top of it (sub-filter stacking). In the cases when a filter consists of multiple sub-filters stacked on top of each other latency for adds stays the same, but the latency for presence checks increases. The reason for this is the way the checks work: a regular check would first be performed on the top (latest) filter and if a negative answer is returned the next one is checked and so on. That's where the added latency comes from.
114+
This is the number of items you expect having in your filter in total and is trivial when you have a static set but it becomes more challenging when your set grows over time. It's important to get the number right because if you **oversize** - you'll end up wasting memory. If you **undersize**, the filter will fill up and a new one will have to be stacked on top of it (sub-filter stacking). In the cases when a filter consists of multiple sub-filters stacked on top of each other latency for adds stays the same, but the latency for presence checks increases. The reason for this is the way the checks work: a regular check would first be performed on the top (latest) filter and if a negative answer is returned the next one is checked and so on. That's where the added latency comes from.
115115

116116
#### 3. Scaling (`EXPANSION`)
117-
Adding an element to a Bloom filter never fails due to the data structure "filling up". Instead the error rate starts to grow. To keep the error close to the one set on filter initialisation - the Bloom filter will auto-scale, meaning when capacity is reached an additional sub-filter will be created.
118-
The size of the new sub-filter is the size of the last sub-filter multiplied by `EXPANSION`. If the number of elements to be stored in the filter is unknown, we recommend that you use an expansion of 2 or more to reduce the number of sub-filters. Otherwise, we recommend that you use an expansion of 1 to reduce memory consumption. The default expansion value is 2.
117+
Adding an item to a Bloom filter never fails due to the data structure "filling up". Instead, the error rate starts to grow. To keep the error close to the one set on filter initialization - the Bloom filter will auto-scale, meaning, when capacity is reached, an additional sub-filter will be created.
118+
The size of the new sub-filter is the size of the last sub-filter multiplied by `EXPANSION`. If the number of items to be stored in the filter is unknown, we recommend that you use an expansion of 2 or more to reduce the number of sub-filters. Otherwise, we recommend that you use an expansion of 1 to reduce memory consumption. The default expansion value is 2.
119119

120120
The filter will keep adding more hash functions for every new sub-filter in order to keep your desired error rate.
121121

@@ -127,26 +127,22 @@ If you know you're not going to scale use the `NONSCALING` flag because that way
127127

128128
### Total size of a Bloom filter
129129
The actual memory used by a Bloom filter is a function of the chosen error rate:
130-
```
131-
bits_per_item = -log(error)/ln(2)
132-
memory = capacity * bits_per_item
133-
134-
memory = capacity * (-log(error)/ln(2))
135-
```
136130

137-
- 1% error rate requires 10.08 bits per item
138-
- 0.1% error rate requires 14.4 bits per item
139-
- 0.01% error rate requires 20.16 bits per item
131+
The optimal number of hash functions is `ceil(-ln(error_rate) / ln(2))`.
132+
133+
The required number of bits per item, given the desired `error_rate` and the optimal number of hash functions, is `-ln(error_rate) / ln(2)^2`. Hence, the required number of bits in the filter is `capacity * -ln(error_rate) / ln(2)^2`.
134+
135+
* **1%** error rate requires 7 hash functions and 9.585 bits per item.
136+
* **0.1%** error rate requires 10 hash functions and 14.378 bits per item.
137+
* **0.01%** error rate requires 14 hash functions and 19.170 bits per item.
140138

141139
Just as a comparison, when using a Redis set for membership testing the memory needed is:
142140

143141
```
144142
memory_with_sets = capacity*(192b + value)
145143
```
146144

147-
For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element, which is considerably higher than the 20 bits per element we need for a Bloom filter with 0.01% precision.
148-
149-
145+
For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element, which is considerably higher than the 19.170 bits per item we need for a Bloom filter with 0.01% precision.
150146

151147

152148
## Bloom vs. Cuckoo filters

0 commit comments

Comments
 (0)