Could do with some intuition into how `get_residual_entropy` method works :)

Hi, I've been trying to understand get_residual_entropy method, and can't quite figure it out :) Awesome paper by the way :)

Really, my underlying need is to be able to use it on input where the language L has a vocab size larger than 2. I'm unclear if that's possible in the existing implementation? I'm sort of guessing that it's using a greedy approach to minimizing over all partitions? And that the greedy approach only works for a vocab size of 2 in L?

One question I'm wondering is: is the current implementation actually normalizing, as per the formula in section 5.1 of the paper? My entropy skills are not that great, so I may be misunderstanding. Do you mind if I walk through a simplish case study, and you can tell me where I'm going wrong?

let's suppose that we have num categories = 2, and each category C[i] can take 3 possible values. The utterances will have a vocab size of 2 (as per the paper), and total 6 bits.

Then, L* will comprise the following:
```
tensor([[0, 0],
        [0, 1],
        [0, 2],
        [1, 0],
        [1, 1],
        [1, 2],
        [2, 0],
        [2, 1],
        [2, 2]])
```
Actually, you are offsetting each category, so per your convention L* will look like:
```
meanings_for_resnick tensor([[0, 3],
        [0, 4],
        [0, 5],
        [1, 3],
        [1, 4],
        [1, 5],
        [2, 3],
        [2, 4],
        [2, 5]])
```

The entropy of each C[i] will be (and I'm using base 2 entropy here, since I like thinking of it as the number of bits required):
```
print('H_Ci', - 3 * 1 / 3 * math.log(1 / 3) / math.log(2))
```
which gives
```
H_Ci 1.5849625007211563
```

We generate a random holistic language, where every utterance is different, eg:
```
tensor([[1, 1, 1, 0],
        [0, 1, 0, 1],
        [0, 0, 1, 0],
        [0, 0, 0, 0],
        [1, 0, 1, 0],
        [1, 0, 0, 1],
        [0, 0, 0, 1],
        [1, 1, 0, 1],
        [1, 1, 0, 0]])
```

Now we pass L* and L through  the `get_residual_entropy` method from this repo. I've modified it in two ways:
- prints `indx`, so we can see the partition
- uses base 2 entropy (since, as stated above, I find it easier to reason about)
- I also replaced `10` with `num_att_vals`, which is passed in as a parameter

```
resent = get_residual_entropy(target=meanings_for_resnick.numpy(), lang=_utts.numpy(), num_att_vals=3)
print('resent %.4f' % resent)
```
The result is:
```
indx [[0, 1, 2, 3], []]
resent 0.7925
```

So, the partition comprises one sub-sequence containing entire z, and one empty sub-sequence.

Let's suppose that somehow `indx` is the partition which results from minimizing over all partitions. I don't quite understand how this happens, but let's suppose that somehow it does happen, e.g. maybe by using some kind of greedy approach?

Looking at the formula in 5.1, we need to calculate H(C[i], z), twice, once for the empty sub-sequence, and once for the sub-sequence containing all positions

- the conditional entropy of C[i] given the entire utterances will be simply 0, since all L* are unique, and all L utterances are unique
     - so we can ignore this one
- the conditional entropy of C[i] given no utterances at all will I guess simply be the entropy of C[i]? Which we calculated above, to be 1.58496
- then, if we ignore the denominator of H(C[i]), we have:
```
1 / 2 (1.58496 + 0)
= 0.7925
```
... which is what `get_residual_entropy` method returns.

However, this is without the denominiator of H(C[i]). If we add in the denominator of H(C[i])), then we have, I think, but, to be honest, I'm not sure the difference between H(C[i]) with respect to L (in the numerator), and H(C[i]) with respect to L* (in the denominator). I'm assuming they are the same, since L and L* are both the same length, and each utterance from L and L* are unique. so, this gives:

```
1 / 2 (1.58496/ 1.58496 + 0)
= 1/2
```

... which is differnet from what `get_residual_entropy` method returns?

Dont suppose... do you have a moment to go over with me what is the difference between H(C[i]) with respect to L and H(C[i]) with respect to L*? It looks like somehow H(C[i]) with respect to L* must evaluate to 1 somehow?

Also, if you have a moment, I'm really interested in knowing how does the code manage to minimize over all possible partitions?

And lastly, what do I need to modify in the code in order to measure residual entropy in languages L having a vocab size greater than 2? (this is my actual goal in fact :) so if there is a way of short-circuiting my having to understand the code ,and just run this, this could be great too :) )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could do with some intuition into how `get_residual_entropy` method works :) #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Could do with some intuition into how get_residual_entropy method works :) #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Could do with some intuition into how `get_residual_entropy` method works :) #3