Skip to content

Commit 1c2ed6d

Browse files
committed
BIP-0158: allow filters to define values for P and M, reparameterize default filter
1 parent 4a85759 commit 1c2ed6d

File tree

1 file changed

+39
-26
lines changed

1 file changed

+39
-26
lines changed

bip-0158.mediawiki

Lines changed: 39 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -65,11 +65,14 @@ For each block, compact filters are derived containing sets of items associated
6565
with the block (eg. addresses sent to, outpoints spent, etc.). A set of such
6666
data objects is compressed into a probabilistic structure called a
6767
''Golomb-coded set'' (GCS), which matches all items in the set with probability
68-
1, and matches other items with probability <code>2^(-P)</code> for some integer
69-
parameter <code>P</code>.
68+
1, and matches other items with probability <code>2^(-P)</code> for some
69+
integer parameter <code>P</code>. We also introduce parameter <code>M</code>
70+
which allows filter to uniquely tune the range that items are hashed onto
71+
before compressing. Each defined filter also selects distinct parameters for P
72+
and M.
7073

7174
At a high level, a GCS is constructed from a set of <code>N</code> items by:
72-
# hashing all items to 64-bit integers in the range <code>[0, N * 2^P)</code>
75+
# hashing all items to 64-bit integers in the range <code>[0, N * M)</code>
7376
# sorting the hashed values in ascending order
7477
# computing the differences between each value and the previous one
7578
# writing the differences sequentially, compressed with Golomb-Rice coding
@@ -80,9 +83,13 @@ The following sections describe each step in greater detail.
8083

8184
The first step in the filter construction is hashing the variable-sized raw
8285
items in the set to the range <code>[0, F)</code>, where <code>F = N *
83-
2^P</code>. Set membership queries against the hash outputs will have a false
84-
positive rate of <code>2^(-P)</code>. To avoid integer overflow, the number of
85-
items <code>N</code> MUST be <2^32 and <code>P</code> MUST be <=32.
86+
M</code>. Customarily, <code>M</code> is set to <code>2^P</code>. However, if
87+
one is able to select both Parameters independently, then more optimal values
88+
can be
89+
selected<ref>https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc845</ref>.
90+
Set membership queries against the hash outputs will have a false positive rate
91+
of <code>2^(-P)</code>. To avoid integer overflow, the
92+
number of items <code>N</code> MUST be <2^32 and <code>M</code> MUST be <2^32.
8693

8794
The items are first passed through the pseudorandom function ''SipHash'', which
8895
takes a 128-bit key <code>k</code> and a variable-sized byte vector and produces
@@ -104,9 +111,9 @@ result.
104111
hash_to_range(item: []byte, F: uint64, k: [16]byte) -> uint64:
105112
return (siphash(k, item) * F) >> 64
106113

107-
hashed_set_construct(raw_items: [][]byte, P: uint, k: [16]byte) -> []uint64:
114+
hashed_set_construct(raw_items: [][]byte, k: [16]byte, M: uint) -> []uint64:
108115
let N = len(raw_items)
109-
let F = N << P
116+
let F = N * M
110117

111118
let set_items = []
112119
@@ -197,8 +204,8 @@ with Golomb-Rice coding. Finally, the bit stream is padded with 0's to the
197204
nearest byte boundary and serialized to the output byte vector.
198205

199206
<pre>
200-
construct_gcs(L: [][]byte, P: uint, k: [16]byte) -> []byte:
201-
let set_items = hashed_set_construct(L, P, k)
207+
construct_gcs(L: [][]byte, P: uint, k: [16]byte, M: uint) -> []byte:
208+
let set_items = hashed_set_construct(L, k, M)
202209

203210
set_items.sort()
204211
@@ -224,8 +231,8 @@ against the reconstructed values. Note that querying does not require the entire
224231
decompressed set be held in memory at once.
225232

226233
<pre>
227-
gcs_match(key: [16]byte, compressed_set: []byte, target: []byte, P: uint, N: uint) -> bool:
228-
let F = N << P
234+
gcs_match(key: [16]byte, compressed_set: []byte, target: []byte, P: uint, N: uint, M: uint) -> bool:
235+
let F = N * M
229236
let target_hash = hash_to_range(target, F, k)
230237

231238
stream = new_bit_stream(compressed_set)
@@ -260,6 +267,8 @@ against the decompressed GCS contents. See
260267

261268
This BIP defines one initial filter type:
262269
* Basic (<code>0x00</code>)
270+
* <code>M = 784931</code>
271+
* <code>P = 19</code>
263272
264273
==== Contents ====
265274

@@ -271,24 +280,27 @@ items for each transaction in a block:
271280
272281
==== Construction ====
273282

274-
Both the basic and extended filter types are constructed as Golomb-coded sets
275-
with the following parameters.
283+
The basic type is constructed as Golomb-coded sets with the following
284+
parameters.
276285

277-
The parameter <code>P</code> MUST be set to <code>20</code>. This value was
278-
chosen as simulations show that it minimizes the bandwidth utilized, considering
279-
both the expected number of blocks downloaded due to false positives and the
280-
size of the filters themselves. The code along with a demo used for the
281-
parameter tuning can be found
282-
[https://github.com/Roasbeef/bips/blob/83b83c78e189be898573e0bfe936dd0c9b99ecb9/gcs_light_client/gentestvectors.go here].
286+
The parameter <code>P</code> MUST be set to <code>19</code>, and the parameter
287+
<code>M</code> MUST be set to <code>784931</code>. Analysis has shown that if
288+
one is able to select <code>P</code> and <code>M</code> independently, then
289+
setting <code>M=1.497137 * 2^P</code> is close to optimal
290+
<ref>https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc845</ref>.
291+
292+
Empirical analysis also shows that was chosen as these parameters minimize the
293+
bandwidth utilized, considering both the expected number of blocks downloaded
294+
due to false positives and the size of the filters themselves.
283295

284296
The parameter <code>k</code> MUST be set to the first 16 bytes of the hash of
285297
the block for which the filter is constructed. This ensures the key is
286298
deterministic while still varying from block to block.
287299

288300
Since the value <code>N</code> is required to decode a GCS, a serialized GCS
289-
includes it as a prefix, written as a CompactSize. Thus, the complete
290-
serialization of a filter is:
291-
* <code>N</code>, encoded as a CompactSize
301+
includes it as a prefix, written as a <code>CompactSize</code>. Thus, the
302+
complete serialization of a filter is:
303+
* <code>N</code>, encoded as a <code>CompactSize</code>
292304
* The bytes of the compressed filter itself
293305
294306
==== Signaling ====
@@ -311,7 +323,8 @@ though it requires implementation of the new filters.
311323

312324
We would like to thank bfd (from the bitcoin-dev mailing list) for bringing the
313325
basis of this BIP to our attention, Greg Maxwell for pointing us in the
314-
direction of Golomb-Rice coding and fast range optimization, and Pedro
326+
direction of Golomb-Rice coding and fast range optimization, Pieter Wullie for
327+
his analysis of optimal GCS parameters, and Pedro
315328
Martelletto for writing the initial indexing code for <code>btcd</code>.
316329

317330
We would also like to thank Dave Collins, JJ Jeffrey, and Eric Lombrozo for
@@ -363,8 +376,8 @@ easier to understand.
363376
=== Golomb-Coded Set Multi-Match ===
364377

365378
<pre>
366-
gcs_match_any(key: [16]byte, compressed_set: []byte, targets: [][]byte, P: uint, N: uint) -> bool:
367-
let F = N << P
379+
gcs_match_any(key: [16]byte, compressed_set: []byte, targets: [][]byte, P: uint, N: uint, M: uint) -> bool:
380+
let F = N * M
368381

369382
// Map targets to the same range as the set hashes.
370383
let target_hashes = []

0 commit comments

Comments
 (0)