@@ -206,6 +206,37 @@ class cache
206
206
/* * compute_hashes is convenience for not having to write out this
207
207
* expression everywhere we use the hash values of an Element.
208
208
*
209
+ * We need to map the 32-bit input hash onto a hash bucket in a range [0, size) in a
210
+ * manner which preserves as much of the hash's uniformity as possible. Ideally
211
+ * this would be done by bitmasking but the size is usually not a power of two.
212
+ *
213
+ * The naive approach would be to use a mod -- which isn't perfectly uniform but so
214
+ * long as the hash is much larger than size it is not that bad. Unfortunately,
215
+ * mod/division is fairly slow on ordinary microprocessors (e.g. 90-ish cycles on
216
+ * haswell, ARM doesn't even have an instruction for it.); when the divisor is a
217
+ * constant the compiler will do clever tricks to turn it into a multiply+add+shift,
218
+ * but size is a run-time value so the compiler can't do that here.
219
+ *
220
+ * One option would be to implement the same trick the compiler uses and compute the
221
+ * constants for exact division based on the size, as described in "{N}-bit Unsigned
222
+ * Division via {N}-bit Multiply-Add" by Arch D. Robison in 2005. But that code is
223
+ * somewhat complicated and the result is still slower than other options:
224
+ *
225
+ * Instead we treat the 32-bit random number as a Q32 fixed-point number in the range
226
+ * [0,1) and simply multiply it by the size. Then we just shift the result down by
227
+ * 32-bits to get our bucket number. The results has non-uniformity the same as a
228
+ * mod, but it is much faster to compute. More about this technique can be found at
229
+ * http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
230
+ *
231
+ * The resulting non-uniformity is also more equally distributed which would be
232
+ * advantageous for something like linear probing, though it shouldn't matter
233
+ * one way or the other for a cuckoo table.
234
+ *
235
+ * The primary disadvantage of this approach is increased intermediate precision is
236
+ * required but for a 32-bit random number we only need the high 32 bits of a
237
+ * 32*32->64 multiply, which means the operation is reasonably fast even on a
238
+ * typical 32-bit processor.
239
+ *
209
240
* @param e the element whose hashes will be returned
210
241
* @returns std::array<uint32_t, 8> of deterministic hashes derived from e
211
242
*/
0 commit comments