Add an explanation of quickly hashing onto a non-power of two range.

gmaxwell · gmaxwell · commit dd869c60ca06 · 2017-06-12T22:44:55.000Z
In Olaoluwa Osuntokun's recent protocol proposal they were using a
 mod in an inner loop.  I wanted to suggest a normative protocol
 change to use the trick we use here, but to find an explanation
 of it I had to dig up the PR on github.  After I posted about it
 several other developers commented that it was very interesting
 and they were unaware of it.

I think ideally the code should be self documenting and help
 educate other contributors about non-obvious techniques that
 we use.  So I've written a description of the technique with
 citations for future reference.
diff --git a/src/cuckoocache.h b/src/cuckoocache.h
@@ -206,6 +206,37 @@ class cache
     /** compute_hashes is convenience for not having to write out this
      * expression everywhere we use the hash values of an Element.
      *
+     * We need to map the 32-bit input hash onto a hash bucket in a range [0, size) in a
+     *  manner which preserves as much of the hash's uniformity as possible.  Ideally
+     *  this would be done by bitmasking but the size is usually not a power of two.
+     *
+     * The naive approach would be to use a mod -- which isn't perfectly uniform but so
+     *  long as the hash is much larger than size it is not that bad.  Unfortunately,
+     *  mod/division is fairly slow on ordinary microprocessors (e.g. 90-ish cycles on
+     *  haswell, ARM doesn't even have an instruction for it.); when the divisor is a
+     *  constant the compiler will do clever tricks to turn it into a multiply+add+shift,
+     *  but size is a run-time value so the compiler can't do that here.
+     *
+     * One option would be to implement the same trick the compiler uses and compute the
+     *  constants for exact division based on the size, as described in "{N}-bit Unsigned
+     *  Division via {N}-bit Multiply-Add" by Arch D. Robison in 2005. But that code is
+     *  somewhat complicated and the result is still slower than other options:
+     *
+     * Instead we treat the 32-bit random number as a Q32 fixed-point number in the range
+     *  [0,1) and simply multiply it by the size.  Then we just shift the result down by
+     *  32-bits to get our bucket number.  The results has non-uniformity the same as a
+     *  mod, but it is much faster to compute. More about this technique can be found at
+     *  http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
+     *
+     * The resulting non-uniformity is also more equally distributed which would be
+     *  advantageous for something like linear probing, though it shouldn't matter
+     *  one way or the other for a cuckoo table.
+     *
+     * The primary disadvantage of this approach is increased intermediate precision is
+     *  required but for a 32-bit random number we only need the high 32 bits of a
+     *  32*32->64 multiply, which means the operation is reasonably fast even on a
+     *  typical 32-bit processor.
+     *
      * @param e the element whose hashes will be returned
      * @returns std::array<uint32_t, 8> of deterministic hashes derived from e
      */