-
Notifications
You must be signed in to change notification settings - Fork 3
Description
After extensively reading the code, my best guess was:
I think the algorithm (compressed_map) is using "arithmetic coding" (or some homegrown alternative). But I can explain it better using Huffman encoding.
The size of a probabilistic map is (some constant C) * (the number of keys we want to guarantee we get the correct answer for) * (the size of each value) (if we ignore taking advantage of "accidentally getting the correct answer"). At first blush it seems like we're stuck. The size of the size of each value is basically fixed, as are the number of keys we care about getting the correct answer for. We could deal with different bits of the value in different structures, but our formula is linear so that doesn't get us anything. It takes the same amount of space to say we have N structures that each encode 1 bit as it does if we say we have 1 structure that encodes N bits.
That's where Huffman encoding comes in. We encode the values according to the Huffman algorithm. Then we process things one bit at a time. The first data structure encodes the first bit and needs to be correct for every key. So it takes quite a bit of space, it avoids being a disastrous amount because it's only one bit wide. We can do this again for the next bit but this time we can skip any keys whose Huffman encoding has already reached a leaf node. We don't care what value is returned by the structure for that bit of that key, because a reader processing that key will have found a value. With every additional bit we have fewer keys to process and therefore a smaller data structure.
and you responded with:
My compressed_map crate can be thought of as more or less the same idea as a Huffman code plus a BinaryFuse-like matrix map, much as @Eh2406 suggests. It's not quite a Huffman code because the possibility of getting the correct answer "by accident" changes the cost model, but it's the same idea.
I would love to hear more about how you analyze the cost model, and how you took advantage of getting the correct answer by accident. But I don't want to sidetrack that issue anymore than I already have. So I thought I would open a separate issue for you to expound on your insights, if you're willing to.