You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The rate is a decimal value between 0 and 1. For example, for a desired false positive rate of 0.1% (1 in 1000), error_rate should be set to 0.001.
112
112
113
113
#### 2. Expected capacity (`capacity`)
114
-
This is the number of elements you expect having in your filter in total and is trivial when you have a static set but it becomes more challenging when your set grows over time. It's important to get the number right because if you **oversize** - you'll end up wasting memory. If you **undersize**, the filter will fill up and a new one will have to be stacked on top of it (sub-filter stacking). In the cases when a filter consists of multiple sub-filters stacked on top of each other latency for adds stays the same, but the latency for presence checks increases. The reason for this is the way the checks work: a regular check would first be performed on the top (latest) filter and if a negative answer is returned the next one is checked and so on. That's where the added latency comes from.
114
+
This is the number of items you expect having in your filter in total and is trivial when you have a static set but it becomes more challenging when your set grows over time. It's important to get the number right because if you **oversize** - you'll end up wasting memory. If you **undersize**, the filter will fill up and a new one will have to be stacked on top of it (sub-filter stacking). In the cases when a filter consists of multiple sub-filters stacked on top of each other latency for adds stays the same, but the latency for presence checks increases. The reason for this is the way the checks work: a regular check would first be performed on the top (latest) filter and if a negative answer is returned the next one is checked and so on. That's where the added latency comes from.
115
115
116
116
#### 3. Scaling (`EXPANSION`)
117
-
Adding an element to a Bloom filter never fails due to the data structure "filling up". Instead the error rate starts to grow. To keep the error close to the one set on filter initialisation - the Bloom filter will auto-scale, meaning when capacity is reached an additional sub-filter will be created.
118
-
The size of the new sub-filter is the size of the last sub-filter multiplied by `EXPANSION`. If the number of elements to be stored in the filter is unknown, we recommend that you use an expansion of 2 or more to reduce the number of sub-filters. Otherwise, we recommend that you use an expansion of 1 to reduce memory consumption. The default expansion value is 2.
117
+
Adding an item to a Bloom filter never fails due to the data structure "filling up". Instead, the error rate starts to grow. To keep the error close to the one set on filter initialization - the Bloom filter will auto-scale, meaning, when capacity is reached, an additional sub-filter will be created.
118
+
The size of the new sub-filter is the size of the last sub-filter multiplied by `EXPANSION`. If the number of items to be stored in the filter is unknown, we recommend that you use an expansion of 2 or more to reduce the number of sub-filters. Otherwise, we recommend that you use an expansion of 1 to reduce memory consumption. The default expansion value is 2.
119
119
120
120
The filter will keep adding more hash functions for every new sub-filter in order to keep your desired error rate.
121
121
@@ -127,26 +127,22 @@ If you know you're not going to scale use the `NONSCALING` flag because that way
127
127
128
128
### Total size of a Bloom filter
129
129
The actual memory used by a Bloom filter is a function of the chosen error rate:
130
-
```
131
-
bits_per_item = -log(error)/ln(2)
132
-
memory = capacity * bits_per_item
133
-
134
-
memory = capacity * (-log(error)/ln(2))
135
-
```
136
130
137
-
- 1% error rate requires 10.08 bits per item
138
-
- 0.1% error rate requires 14.4 bits per item
139
-
- 0.01% error rate requires 20.16 bits per item
131
+
The optimal number of hash functions is `ceil(-ln(error_rate) / ln(2))`.
132
+
133
+
The required number of bits per item, given the desired `error_rate` and the optimal number of hash functions, is `-ln(error_rate) / ln(2)^2`. Hence, the required number of bits in the filter is `capacity * -ln(error_rate) / ln(2)^2`.
134
+
135
+
***1%** error rate requires 7 hash functions and 9.585 bits per item.
136
+
***0.1%** error rate requires 10 hash functions and 14.378 bits per item.
137
+
***0.01%** error rate requires 14 hash functions and 19.170 bits per item.
140
138
141
139
Just as a comparison, when using a Redis set for membership testing the memory needed is:
142
140
143
141
```
144
142
memory_with_sets = capacity*(192b + value)
145
143
```
146
144
147
-
For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element, which is considerably higher than the 20 bits per element we need for a Bloom filter with 0.01% precision.
148
-
149
-
145
+
For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element, which is considerably higher than the 19.170 bits per item we need for a Bloom filter with 0.01% precision.
0 commit comments