BIP-0158: allow filters to define values for P and M, reparameterize default filter

Roasbeef · Roasbeef · commit 1c2ed6dce331 · 2018-07-04T15:41:05.000-05:00
diff --git a/bip-0158.mediawiki b/bip-0158.mediawiki
@@ -65,11 +65,14 @@ For each block, compact filters are derived containing sets of items associated
 with the block (eg. addresses sent to, outpoints spent, etc.). A set of such
 data objects is compressed into a probabilistic structure called a
 ''Golomb-coded set'' (GCS), which matches all items in the set with probability
-1, and matches other items with probability <code>2^(-P)</code> for some integer
-parameter <code>P</code>.
+1, and matches other items with probability <code>2^(-P)</code> for some
+integer parameter <code>P</code>. We also introduce parameter <code>M</code>
+which allows filter to uniquely tune the range that items are hashed onto
+before compressing. Each defined filter also selects distinct parameters for P
+and M.
 
 At a high level, a GCS is constructed from a set of <code>N</code> items by:
-# hashing all items to 64-bit integers in the range <code>[0, N * 2^P)</code>
+# hashing all items to 64-bit integers in the range <code>[0, N * M)</code>
 # sorting the hashed values in ascending order
 # computing the differences between each value and the previous one
 # writing the differences sequentially, compressed with Golomb-Rice coding
@@ -80,9 +83,13 @@ The following sections describe each step in greater detail.
 
 The first step in the filter construction is hashing the variable-sized raw
 items in the set to the range <code>[0, F)</code>, where <code>F = N *
-2^P</code>. Set membership queries against the hash outputs will have a false
-positive rate of <code>2^(-P)</code>. To avoid integer overflow, the number of
-items <code>N</code> MUST be <2^32 and <code>P</code> MUST be <=32.
+M</code>. Customarily, <code>M</code> is set to <code>2^P</code>. However, if
+one is able to select both Parameters independently, then more optimal values
+can be
+selected<ref>https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc845</ref>.
+Set membership queries against the hash outputs will have a false positive rate
+of <code>2^(-P)</code>. To avoid integer overflow, the
+number of items <code>N</code> MUST be <2^32 and <code>M</code> MUST be <2^32.
 
 The items are first passed through the pseudorandom function ''SipHash'', which
 takes a 128-bit key <code>k</code> and a variable-sized byte vector and produces
@@ -104,9 +111,9 @@ result.
 hash_to_range(item: []byte, F: uint64, k: [16]byte) -> uint64:
     return (siphash(k, item) * F) >> 64
 
-hashed_set_construct(raw_items: [][]byte, P: uint, k: [16]byte) -> []uint64:
+hashed_set_construct(raw_items: [][]byte, k: [16]byte, M: uint) -> []uint64:
     let N = len(raw_items)
-    let F = N << P
+    let F = N * M
 
     let set_items = []
 
@@ -197,8 +204,8 @@ with Golomb-Rice coding. Finally, the bit stream is padded with 0's to the
 nearest byte boundary and serialized to the output byte vector.
 
 <pre>
-construct_gcs(L: [][]byte, P: uint, k: [16]byte) -> []byte:
-    let set_items = hashed_set_construct(L, P, k)
+construct_gcs(L: [][]byte, P: uint, k: [16]byte, M: uint) -> []byte:
+    let set_items = hashed_set_construct(L, k, M)
 
     set_items.sort()
 
@@ -224,8 +231,8 @@ against the reconstructed values. Note that querying does not require the entire
 decompressed set be held in memory at once.
 
 <pre>
-gcs_match(key: [16]byte, compressed_set: []byte, target: []byte, P: uint, N: uint) -> bool:
-    let F = N << P
+gcs_match(key: [16]byte, compressed_set: []byte, target: []byte, P: uint, N: uint, M: uint) -> bool:
+    let F = N * M
     let target_hash = hash_to_range(target, F, k)
 
     stream = new_bit_stream(compressed_set)
@@ -260,6 +267,8 @@ against the decompressed GCS contents. See
 
 This BIP defines one initial filter type:
 * Basic (<code>0x00</code>)
+  * <code>M = 784931</code>
+  * <code>P = 19</code>
 
 ==== Contents ====
 
@@ -271,24 +280,27 @@ items for each transaction in a block:
 
 ==== Construction ====
 
-Both the basic and extended filter types are constructed as Golomb-coded sets
-with the following parameters.
+The basic type is constructed as Golomb-coded sets with the following
+parameters.
 
-The parameter <code>P</code> MUST be set to <code>20</code>. This value was
-chosen as simulations show that it minimizes the bandwidth utilized, considering
-both the expected number of blocks downloaded due to false positives and the
-size of the filters themselves. The code along with a demo used for the
-parameter tuning can be found
-[https://github.com/Roasbeef/bips/blob/83b83c78e189be898573e0bfe936dd0c9b99ecb9/gcs_light_client/gentestvectors.go here].
+The parameter <code>P</code> MUST be set to <code>19</code>, and the parameter
+<code>M</code> MUST be set to <code>784931</code>. Analysis has shown that if
+one is able to select <code>P</code> and <code>M</code> independently, then
+setting <code>M=1.497137 * 2^P</code> is close to optimal
+<ref>https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc845</ref>.
+
+Empirical analysis also shows that was chosen as these parameters minimize the
+bandwidth utilized, considering both the expected number of blocks downloaded
+due to false positives and the size of the filters themselves. 
 
 The parameter <code>k</code> MUST be set to the first 16 bytes of the hash of
 the block for which the filter is constructed. This ensures the key is
 deterministic while still varying from block to block.
 
 Since the value <code>N</code> is required to decode a GCS, a serialized GCS
-includes it as a prefix, written as a CompactSize. Thus, the complete
-serialization of a filter is:
-* <code>N</code>, encoded as a CompactSize
+includes it as a prefix, written as a <code>CompactSize</code>. Thus, the
+complete serialization of a filter is:
+* <code>N</code>, encoded as a <code>CompactSize</code>
 * The bytes of the compressed filter itself
 
 ==== Signaling ====
@@ -311,7 +323,8 @@ though it requires implementation of the new filters.
 
 We would like to thank bfd (from the bitcoin-dev mailing list) for bringing the
 basis of this BIP to our attention, Greg Maxwell for pointing us in the
-direction of Golomb-Rice coding and fast range optimization, and Pedro
+direction of Golomb-Rice coding and fast range optimization, Pieter Wullie for
+his analysis of optimal GCS parameters, and Pedro
 Martelletto for writing the initial indexing code for <code>btcd</code>.
 
 We would also like to thank Dave Collins, JJ Jeffrey, and Eric Lombrozo for
@@ -363,8 +376,8 @@ easier to understand.
 === Golomb-Coded Set Multi-Match ===
 
 <pre>
-gcs_match_any(key: [16]byte, compressed_set: []byte, targets: [][]byte, P: uint, N: uint) -> bool:
-    let F = N << P
+gcs_match_any(key: [16]byte, compressed_set: []byte, targets: [][]byte, P: uint, N: uint, M: uint) -> bool:
+    let F = N * M
 
     // Map targets to the same range as the set hashes.
     let target_hashes = []