-
Notifications
You must be signed in to change notification settings - Fork 265
Description
Hello @YodaEmbedding,
I was reading your answer in this previously posted issue:
For a single fixed encoding distribution
$p$ , the average rate cost for encoding a single symbol that is drawn from the same distribution$p$ is:
$$R = \sum_t - p(t) \, \log p(t)$$ But this is not what we're doing. What we're actually interested in is the cross-entropy. That is the average rate cost for encoding a single symbol drawn from the true distribution
$\hat{p}$ :
$$R = \sum_t - \hat{p}(t) \, \log p(t)$$ To be consistent with our notation above, we should also sprinkle in some
$i$ s:
$$R_i = \sum_t - \hat{p}_i(t) \, \log p_i(t)$$ In our case, we know exactly what
$\hat{p}$ is...
$$\hat{p}_i(t) = \delta[t - \hat{y}_i] =\begin{cases}1 & \text{if } t = \hat{y}_i \\ 0 & \text{otherwise}\end{cases}$$ If we plug this into the earlier equation, the rate cost for encoding the
$i$ -th element becomes:
$$R_i = -\log p_i(\hat{y}_i)$$
Originally posted by @YodaEmbedding in #314
I’m trying to better understand the reasoning behind this formulation, and I have two main questions:
-
Why is the true distribution
$\hat{p}_i$ considered different from the encoding distribution$p_i$ ?
What does it mean to refer to a “true” distribution in this context? -
Why is
$\hat{p}_i$ modeled as a delta function?
This seems to imply that there is no uncertainty at all — only one possible symbol$\hat{y}_i$ with probability 1. If that’s the case, what motivates using a probabilistic framework at all?
Thanks in advance for any clarification!