how does the compression work?

After extensively reading the code, [my best guess was](https://github.com/rust-lang/cargo/issues/10928#issuecomment-2455838796):
> I think the algorithm (compressed_map) is using "arithmetic coding" (or some homegrown alternative). But I can explain it better using Huffman encoding.
>
>  The size of a probabilistic map is (some constant C) * (the number of keys we want to guarantee we get the correct answer for) * (the size of each value) (if we ignore taking advantage of "accidentally getting the correct answer"). At first blush it seems like we're stuck. The size of the size of each value is basically fixed, as are the number of keys we care about getting the correct answer for. We could deal with different bits of the value in different structures, but our formula is linear so that doesn't get us anything. It takes the same amount of space to say we have N structures that each encode 1 bit as it does if we say we have 1 structure that encodes N bits.
> 
> That's where Huffman encoding comes in. We encode the values according to the Huffman algorithm. Then we process things one bit at a time. The first data structure encodes the first bit and needs to be correct for every key. So it takes quite a bit of space, it avoids being a disastrous amount because it's only one bit wide. We can do this again for the next bit but this time we can skip any keys whose Huffman encoding has already reached a leaf node. We don't care what value is returned by the structure for that bit of that key, because a reader processing that key will have found a value. With every additional bit we have fewer keys to process and therefore a smaller data structure.

and [you responded with](https://github.com/rust-lang/cargo/issues/10928#issuecomment-2455838796):

> My compressed_map crate can be thought of as more or less the same idea as a Huffman code plus a BinaryFuse-like matrix map, much as @Eh2406 suggests. It's not quite a Huffman code because the possibility of getting the correct answer "by accident" changes the cost model, but it's the same idea.

I would love to hear more about how you analyze the cost model, and how you took advantage of getting the correct answer by accident. But I don't want to sidetrack that issue anymore than I already have. So I thought I would open a separate issue for  you to  expound on your insights, if you're willing to.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how does the compression work? #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

how does the compression work? #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions