@@ -65,11 +65,14 @@ For each block, compact filters are derived containing sets of items associated
65
65
with the block (eg. addresses sent to, outpoints spent, etc.). A set of such
66
66
data objects is compressed into a probabilistic structure called a
67
67
''Golomb-coded set'' (GCS), which matches all items in the set with probability
68
- 1, and matches other items with probability <code>2^(-P) </code> for some integer
69
- parameter <code>P </code>.
68
+ 1, and matches other items with probability <code>2^(-P) </code> for some
69
+ integer parameter <code>P </code>. We also introduce parameter <code>M </code>
70
+ which allows filter to uniquely tune the range that items are hashed onto
71
+ before compressing. Each defined filter also selects distinct parameters for P
72
+ and M.
70
73
71
74
At a high level, a GCS is constructed from a set of <code>N </code> items by:
72
- # hashing all items to 64-bit integers in the range <code>[0, N * 2^P ) </code>
75
+ # hashing all items to 64-bit integers in the range <code>[0, N * M ) </code>
73
76
# sorting the hashed values in ascending order
74
77
# computing the differences between each value and the previous one
75
78
# writing the differences sequentially, compressed with Golomb-Rice coding
@@ -80,9 +83,13 @@ The following sections describe each step in greater detail.
80
83
81
84
The first step in the filter construction is hashing the variable-sized raw
82
85
items in the set to the range <code>[0, F) </code>, where <code>F = N *
83
- 2^P </code>. Set membership queries against the hash outputs will have a false
84
- positive rate of <code>2^(-P) </code>. To avoid integer overflow, the number of
85
- items <code>N </code> MUST be <2^32 and <code>P </code> MUST be <=32.
86
+ M </code>. Customarily, <code>M </code> is set to <code>2^P </code>. However, if
87
+ one is able to select both Parameters independently, then more optimal values
88
+ can be
89
+ selected<ref >https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc845</ref >.
90
+ Set membership queries against the hash outputs will have a false positive rate
91
+ of <code>2^(-P) </code>. To avoid integer overflow, the
92
+ number of items <code>N </code> MUST be <2^32 and <code>M </code> MUST be <2^32.
86
93
87
94
The items are first passed through the pseudorandom function ''SipHash'' , which
88
95
takes a 128-bit key <code>k </code> and a variable-sized byte vector and produces
@@ -104,9 +111,9 @@ result.
104
111
hash_to_range(item: []byte, F: uint64, k: [16 ]byte) -> uint64:
105
112
return (siphash(k, item) * F) >> 64
106
113
107
- hashed_set_construct(raw_items: [][]byte, P: uint, k: [16 ]byte) -> []uint64:
114
+ hashed_set_construct(raw_items: [][]byte, k: [16 ]byte, M: uint ) -> []uint64:
108
115
let N = len(raw_items)
109
- let F = N << P
116
+ let F = N * M
110
117
111
118
let set_items = []
112
119
@@ -197,8 +204,8 @@ with Golomb-Rice coding. Finally, the bit stream is padded with 0's to the
197
204
nearest byte boundary and serialized to the output byte vector.
198
205
199
206
<pre>
200
- construct_gcs(L: [][]byte, P: uint, k: [16 ]byte) -> []byte:
201
- let set_items = hashed_set_construct(L, P, k )
207
+ construct_gcs(L: [][]byte, P: uint, k: [16 ]byte, M: uint ) -> []byte:
208
+ let set_items = hashed_set_construct(L, k, M )
202
209
203
210
set_items.sort()
204
211
@@ -224,8 +231,8 @@ against the reconstructed values. Note that querying does not require the entire
224
231
decompressed set be held in memory at once.
225
232
226
233
<pre>
227
- gcs_match(key: [16]byte, compressed_set: [ ]byte, target: []byte, P: uint, N: uint) -> bool:
228
- let F = N << P
234
+ gcs_match(key: [16]byte, compressed_set: [ ]byte, target: []byte, P: uint, N: uint, M: uint ) -> bool:
235
+ let F = N * M
229
236
let target_hash = hash_to_range(target, F, k)
230
237
231
238
stream = new_bit_stream(compressed_set)
@@ -260,6 +267,8 @@ against the decompressed GCS contents. See
260
267
261
268
This BIP defines one initial filter type:
262
269
* Basic (<code>0x00 </code>)
270
+ * <code>M = 784931 </code>
271
+ * <code>P = 19 </code>
263
272
264
273
==== Contents ====
265
274
@@ -271,24 +280,27 @@ items for each transaction in a block:
271
280
272
281
==== Construction ====
273
282
274
- Both the basic and extended filter types are constructed as Golomb-coded sets
275
- with the following parameters.
283
+ The basic type is constructed as Golomb-coded sets with the following
284
+ parameters.
276
285
277
- The parameter <code>P </code> MUST be set to <code>20 </code>. This value was
278
- chosen as simulations show that it minimizes the bandwidth utilized, considering
279
- both the expected number of blocks downloaded due to false positives and the
280
- size of the filters themselves. The code along with a demo used for the
281
- parameter tuning can be found
282
- [https://github.com/Roasbeef/bips/blob/83b83c78e189be898573e0bfe936dd0c9b99ecb9/gcs_light_client/gentestvectors.go here ].
286
+ The parameter <code>P </code> MUST be set to <code>19 </code>, and the parameter
287
+ <code>M </code> MUST be set to <code>784931 </code>. Analysis has shown that if
288
+ one is able to select <code>P </code> and <code>M </code> independently, then
289
+ setting <code>M=1.497137 * 2^P </code> is close to optimal
290
+ <ref >https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc845</ref >.
291
+
292
+ Empirical analysis also shows that was chosen as these parameters minimize the
293
+ bandwidth utilized, considering both the expected number of blocks downloaded
294
+ due to false positives and the size of the filters themselves.
283
295
284
296
The parameter <code>k </code> MUST be set to the first 16 bytes of the hash of
285
297
the block for which the filter is constructed. This ensures the key is
286
298
deterministic while still varying from block to block.
287
299
288
300
Since the value <code>N </code> is required to decode a GCS, a serialized GCS
289
- includes it as a prefix, written as a CompactSize. Thus, the complete
290
- serialization of a filter is:
291
- * <code>N </code>, encoded as a CompactSize
301
+ includes it as a prefix, written as a <code> CompactSize </code> . Thus, the
302
+ complete serialization of a filter is:
303
+ * <code>N </code>, encoded as a <code> CompactSize </code>
292
304
* The bytes of the compressed filter itself
293
305
294
306
==== Signaling ====
@@ -311,7 +323,8 @@ though it requires implementation of the new filters.
311
323
312
324
We would like to thank bfd (from the bitcoin-dev mailing list) for bringing the
313
325
basis of this BIP to our attention, Greg Maxwell for pointing us in the
314
- direction of Golomb-Rice coding and fast range optimization, and Pedro
326
+ direction of Golomb-Rice coding and fast range optimization, Pieter Wullie for
327
+ his analysis of optimal GCS parameters, and Pedro
315
328
Martelletto for writing the initial indexing code for <code>btcd </code>.
316
329
317
330
We would also like to thank Dave Collins, JJ Jeffrey, and Eric Lombrozo for
@@ -363,8 +376,8 @@ easier to understand.
363
376
=== Golomb-Coded Set Multi-Match ===
364
377
365
378
<pre>
366
- gcs_match_any(key: [16]byte, compressed_set: [ ]byte, targets: [][]byte, P: uint, N: uint) -> bool:
367
- let F = N << P
379
+ gcs_match_any(key: [16]byte, compressed_set: [ ]byte, targets: [][]byte, P: uint, N: uint, M: uint ) -> bool:
380
+ let F = N * M
368
381
369
382
// Map targets to the same range as the set hashes.
370
383
let target_hashes = []
0 commit comments