Skip to content

Could do with some intuition into how get_residual_entropy method works :) #3

@hughperkins

Description

@hughperkins

Hi, I've been trying to understand get_residual_entropy method, and can't quite figure it out :) Awesome paper by the way :)

Really, my underlying need is to be able to use it on input where the language L has a vocab size larger than 2. I'm unclear if that's possible in the existing implementation? I'm sort of guessing that it's using a greedy approach to minimizing over all partitions? And that the greedy approach only works for a vocab size of 2 in L?

One question I'm wondering is: is the current implementation actually normalizing, as per the formula in section 5.1 of the paper? My entropy skills are not that great, so I may be misunderstanding. Do you mind if I walk through a simplish case study, and you can tell me where I'm going wrong?

let's suppose that we have num categories = 2, and each category C[i] can take 3 possible values. The utterances will have a vocab size of 2 (as per the paper), and total 6 bits.

Then, L* will comprise the following:

tensor([[0, 0],
        [0, 1],
        [0, 2],
        [1, 0],
        [1, 1],
        [1, 2],
        [2, 0],
        [2, 1],
        [2, 2]])

Actually, you are offsetting each category, so per your convention L* will look like:

meanings_for_resnick tensor([[0, 3],
        [0, 4],
        [0, 5],
        [1, 3],
        [1, 4],
        [1, 5],
        [2, 3],
        [2, 4],
        [2, 5]])

The entropy of each C[i] will be (and I'm using base 2 entropy here, since I like thinking of it as the number of bits required):

print('H_Ci', - 3 * 1 / 3 * math.log(1 / 3) / math.log(2))

which gives

H_Ci 1.5849625007211563

We generate a random holistic language, where every utterance is different, eg:

tensor([[1, 1, 1, 0],
        [0, 1, 0, 1],
        [0, 0, 1, 0],
        [0, 0, 0, 0],
        [1, 0, 1, 0],
        [1, 0, 0, 1],
        [0, 0, 0, 1],
        [1, 1, 0, 1],
        [1, 1, 0, 0]])

Now we pass L* and L through the get_residual_entropy method from this repo. I've modified it in two ways:

  • prints indx, so we can see the partition
  • uses base 2 entropy (since, as stated above, I find it easier to reason about)
  • I also replaced 10 with num_att_vals, which is passed in as a parameter
resent = get_residual_entropy(target=meanings_for_resnick.numpy(), lang=_utts.numpy(), num_att_vals=3)
print('resent %.4f' % resent)

The result is:

indx [[0, 1, 2, 3], []]
resent 0.7925

So, the partition comprises one sub-sequence containing entire z, and one empty sub-sequence.

Let's suppose that somehow indx is the partition which results from minimizing over all partitions. I don't quite understand how this happens, but let's suppose that somehow it does happen, e.g. maybe by using some kind of greedy approach?

Looking at the formula in 5.1, we need to calculate H(C[i], z), twice, once for the empty sub-sequence, and once for the sub-sequence containing all positions

  • the conditional entropy of C[i] given the entire utterances will be simply 0, since all L* are unique, and all L utterances are unique
    • so we can ignore this one
  • the conditional entropy of C[i] given no utterances at all will I guess simply be the entropy of C[i]? Which we calculated above, to be 1.58496
  • then, if we ignore the denominator of H(C[i]), we have:
1 / 2 (1.58496 + 0)
= 0.7925

... which is what get_residual_entropy method returns.

However, this is without the denominiator of H(C[i]). If we add in the denominator of H(C[i])), then we have, I think, but, to be honest, I'm not sure the difference between H(C[i]) with respect to L (in the numerator), and H(C[i]) with respect to L* (in the denominator). I'm assuming they are the same, since L and L* are both the same length, and each utterance from L and L* are unique. so, this gives:

1 / 2 (1.58496/ 1.58496 + 0)
= 1/2

... which is differnet from what get_residual_entropy method returns?

Dont suppose... do you have a moment to go over with me what is the difference between H(C[i]) with respect to L and H(C[i]) with respect to L*? It looks like somehow H(C[i]) with respect to L* must evaluate to 1 somehow?

Also, if you have a moment, I'm really interested in knowing how does the code manage to minimize over all possible partitions?

And lastly, what do I need to modify in the code in order to measure residual entropy in languages L having a vocab size greater than 2? (this is my actual goal in fact :) so if there is a way of short-circuiting my having to understand the code ,and just run this, this could be great too :) )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions