On the empirical priors in the EFEasVFE paper #598

meditans · 2026-02-25T21:18:46Z

meditans
Feb 25, 2026

Hi all, I'm still thinking about the EFEasVFE paper, and another question has come up. After having done the stochastic maze, I started looking at how the empiric priors are constructed in general, using the minigrid example as a reference.

I have created a simpler scenario, but I cannot reproduce the curious behavior I would expect.

In front of the agent there's a box, which might contain one of three items. The box has a lid that might be open or closed. When closed, it obstructs the perception of the content. The agent observes the closed lid, then has to decide what to do for just one action.

Here is a complete model to make this exchange more precise; since my question concerns the joint distributions, I added a LogMeta and relative @marginalrule just to intercept the joint marginals.

using RxInfer, Tullio, LinearAlgebra, LogExpFunctions, PrettyPrinting
import Pipe: @pipe as @p


@enum Content     C1=1 C2 C3
@enum Lid         Open=1 Closed
@enum Observation O1=1 O2 O3 Impaired
@enum Action      OpenLid=1 Wait

uniform(data::Type{T}) where {T <: Enum} = @p data |> instances |> length |> ones  |> normalize(_, 1)
onehot(elem :: T)      where {T <: Enum} = @p T    |> instances |> length |> zeros |> setindex!(_, 1, Int(elem))

# Observation_t | Content, Lid_t
observation_tensor = [1 0 0; 0 1 0; 0 0 1; 0 0 0 ;;;
                      0 0 0; 0 0 0; 0 0 0; 1 1 1]

# Lid_t | Lid_{t-1}, Action
transition_tensor = [1 1; 0 0 ;;;
                     1 0; 0 1]

struct LogMeta end

@model function model(observation)
    content     ~ Categorical(uniform(Content))
    lid_0       ~ Categorical(uniform(Lid))

    observation ~ DiscreteTransition(content, observation_tensor, lid_0)

    action ~ Categorical(uniform(Action))
    lid_1  ~ DiscreteTransition(lid_0, transition_tensor, action) where {meta=LogMeta()}

    observation_1 ~ DiscreteTransition(content, observation_tensor, lid_1) where {meta=LogMeta()}
    observation_1 ~ Categorical(uniform(Observation))
end

@initialization function init()
    μ(content) = Categorical(uniform(Content))
end

@marginalrule DiscreteTransition(:out_in_T1) (m_out::Categorical,
                                              m_in::Categorical,
                                              m_T1::Categorical,
                                              q_a::PointMass{<:AbstractArray{T,3}},
                                              meta::LogMeta) where {T} = begin
    @tullio result[a, b, c] := q_a.point[a, b, c] * probvec(m_out)[a] * probvec(m_in)[b] * probvec(m_T1)[c]
    normalize!(result, 1)
    @show result
    marginal = Contingency(result, Val(false))
    return marginal
end

result = infer(
    model = model(),
    data  = (observation = onehot(Impaired), ),
    initialization = init(),
    options        = (force_marginal_computation=true,),
    iterations     = 10
)

@p result.posteriors |> Dict(k => last(v) for (k, v) in _) |> pprintln

Running the model yields the following joint observation marginals:

observation_marginal[:,:,1] =
 1/6  0    0
 0    1/6  0
 0    0    1/6
 0    0    0

observation_marginal[:,:,2] =
 0    0    0
 0    0    0
 0    0    0
 1/6  1/6  1/6

And the joint transition marginals are:

result[:,:,1] =
 0    1/2
 0    0

result[:,:,2] =
 0    0
 0    1/2

If I understand correctly the argument used in the paper, the structure of the model should generate three empirical priors, one attached to the action, and one for each state variable that is included in the observation. So, using shorthand for brevity (the conditional entropies are calculated per slice):

$$\text{action} \sim H(\text{lid}_1 | \text{lid}_0, \text{action} = a)$$ $$\text{lid}_1 \sim - H(\text{obs} | \text{content}, \text{lid}_1 = l)$$ $$\text{content} \sim - H(\text{obs} | \text{lid}_1, \text{content} = c)$$

Now, here are the calculations:

observation_marginal = [1/6 0 0; 0 1/6 0; 0 0 1/6; 0 0 0;;; 0 0 0; 0 0 0; 0 0 0; 1/6 1/6 1/6]
transition_marginal  = [0 1/2; 0 0;;; 0 0; 0 1/2]

function conditional_entropy(x)
    h = sum(x, dims=1)
    @tullio res := - (xlogx(x[a,b]) - xlogy(x[a,b], h[b]))
end

@p observation_marginal[:,:,1] |> normalize(_, 1) |> conditional_entropy  # 0.0
@p observation_marginal[:,:,2] |> normalize(_, 1) |> conditional_entropy  # 0.0

@p observation_marginal |> permutedims(_, (1,3,2)) |> _[:,:,1] |> normalize(_, 1) |> conditional_entropy # 0.0
@p observation_marginal |> permutedims(_, (1,3,2)) |> _[:,:,2] |> normalize(_, 1) |> conditional_entropy # 0.0
@p observation_marginal |> permutedims(_, (1,3,2)) |> _[:,:,3] |> normalize(_, 1) |> conditional_entropy # 0.0

@p transition_marginal[:,:,1] |> normalize(_, 1) |> conditional_entropy # 0.0
@p transition_marginal[:,:,2] |> normalize(_, 1) |> conditional_entropy # 0.0

All of the conditional entropies are zero, so the priors do not guide the agent toward opening the box.

There are other possible priors that could distinguish the two actions, for example:

$$\text{lid}_1 \sim - H(\text{content} | \text{obs}, \text{lid}_1 = l)$$

@p observation_marginal |> permutedims(_, (2,1,3)) |> _[:,:,1] |> normalize(_, 1) |> conditional_entropy  # 0.0
@p observation_marginal |> permutedims(_, (2,1,3)) |> _[:,:,2] |> normalize(_, 1) |> conditional_entropy  # 1.09

but this choice seems unprincipled. This brings me to my question: is there something I am overlooking? Can I write a generic prior that follows the structure of the paper without directly saying I want to minimize the entropy of the content?

skoghoern · 2026-02-27T16:22:04Z

skoghoern
Feb 27, 2026

hi carlo, just had a look again at it. nice and clear example you came up with.
as we were discussing with @FraserP117 before, there are two hypothesis i have:

restructure your observation variable & obs_tensor: due to the "impaired" obs, the entropy is same for lid closed vs. open. similarly as in the tmaze when the agent is not on the reward cue, remove the "impaired" and make the observation uniform over the content when lid is closed in the tensor. (not sure whether this is actually what you want?")
even with including this the posterior action is still indifferent. i guess this is due the design of the model. you build the future_roll out, but this doesnt actually calculate the entropy of the epistemic state prior (as wouter did with the ambiguity node).

in the PR of the tmaze i build a function that checks, whether the entropy of the tensor depends on the variational beliefs. if not, it simplifies the whole set up drastically, and we can just use the tensor to calculate the prior.

using these implementations yields a posterior over actions [open_lid, close_lid]: p=[0.7499999887495005, 0.25000001125049953]

using RxInfer, Tullio, LinearAlgebra, LogExpFunctions, PrettyPrinting
import Pipe: @pipe as @p
@enum Content     C1=1 C2 C3
@enum Lid         Open=1 Closed
@enum Observation O1=1 O2 O3 # Impaired
@enum Action      OpenLid=1 Wait

uniform(data::Type{T}) where {T <: Enum} = @p data |> instances |> length |> ones  |> normalize(_, 1)
onehot(elem :: T)      where {T <: Enum} = @p T    |> instances |> length |> zeros |> setindex!(_, 1, Int(elem))

# Observation_t | Content, Lid_t
# observation_tensor = [1 0 0; 0 1 0; 0 0 1; 0 0 0 ;;;
#                       0 0 0; 0 0 0; 0 0 0; 1 1 1]
# Identity matrix for slice 1, Uniform for slice 2
observation_tensor = [1 0 0; 0 1 0; 0 0 1 ;;;
                 1/3 1/3 1/3; 1/3 1/3 1/3; 1/3 1/3 1/3]
# Lid_t | Lid_{t-1}, Action
transition_tensor = [1 1; 0 0 ;;;
                     1 0; 0 1]

struct LogMeta end

@model function model(observation)
    content     ~ Categorical(uniform(Content))
    lid_0       ~ Categorical(uniform(Lid))

    observation ~ DiscreteTransition(content, observation_tensor, lid_0)

    action ~ Categorical(uniform(Action))
    lid_1  ~ DiscreteTransition(lid_0, transition_tensor, action)# where {meta=LogMeta()}
    lid_1 ~ Categorical(calc_epis_prior_vec(observation_tensor))

    # you can both include or exclude these nodes
    # observation_1 ~ DiscreteTransition(content, observation_tensor, lid_1) where {meta=LogMeta()}
    # observation_1 ~ Categorical(uniform(Observation))
end

@initialization function init()
    μ(content) = Categorical(uniform(Content))
end

@marginalrule DiscreteTransition(:out_in_T1) (m_out::Categorical,
                                              m_in::Categorical,
                                              m_T1::Categorical,
                                              q_a::PointMass{<:AbstractArray{T,3}},
                                              meta::LogMeta) where {T} = begin
    @tullio result[a, b, c] := q_a.point[a, b, c] * probvec(m_out)[a] * probvec(m_in)[b] * probvec(m_T1)[c]
    normalize!(result, 1)
    @show result
    marginal = Contingency(result, Val(false))
    return marginal
end

result = infer(
    model = model(),
    data  = (observation = [1/3, 1/3, 1/3], ),
    initialization = init(),
    options        = (force_marginal_computation=true,),
    iterations     = 1
)

@p result.posteriors |> Dict(k => last(v) for (k, v) in _) |> pprintln

where i had recycled the code from tmaze PR to calculate the epistemic prior (you could also just calculate it once and insert it in the Categorical if you want the model to be lighter)

function compute_invariant_entropy_prior(
    A::AbstractArray{T, 3}, 
    dim_keep::Int, 
    dim_reduce::Int;
    atol=1e-8, 
    err_msg="Entropy of tensor depends on hidden state"
) where T <: Real

    # 1. Calculate Entropy Map 
    # We collapse Dim 1 (the probability distribution axis)
    # Resulting H has the remaining 2 dimensions
    H = map(eachslice(A, dims=(2, 3))) do slice
        # calculate -Σ p log p, handling zeros
        sum(p -> p > 0 ? p * log(p) : zero(T), slice)
    end

    # 2. Check Invariance and Extract
    # We want to ensure H is uniform along 'dim_reduce'
    # Since H is 2D, we need to map the original dims (2,3) to (1,2)
    h_axis_to_check = (dim_reduce == 2) ? 1 : 2
    
    # Check if all slices along the invariant dimension are approximately equal
    first_slice = selectdim(H, h_axis_to_check, 1)
    
    for i in 2:size(H, h_axis_to_check)
        if !isapprox(first_slice, selectdim(H, h_axis_to_check, i); atol=atol)
            throw(ArgumentError(err_msg))
        end
    end

    # 3. Return Softmax of the invariant entropy vector
    return softmax(first_slice)
end

function calc_epis_prior_vec(tensor::AbstractArray{T, 3}; atol=1e-8) where T <: Real
    return compute_invariant_entropy_prior(
        tensor, 
        3, # Keep Dimension (Lid/k)
        2  # Marginalize Dimension (Content/j)
    )
end

0 replies

meditans · 2026-02-28T10:49:43Z

meditans
Feb 28, 2026
Author

Hi @skoghoern, thank you for your response, it gave me a lot to think about. For starters, you are right in your first point, that if possible observations were just

@enum Observation O1=1 O2 O3

then the calculation H(observation | content, lid=l) would nudge the agent towards opening the box.

But that feels somewhat like a pyrrhic victory to me: that works because I crafted an observation that is isomorphic to the state I'm interested in, and then I ask to minimize the entropy of this observation. So I see two possible alternatives:

accepting that in order for the machinery to work, I have to have observations that (more) closely track states, as you propose. Then, what the agent is curious about is dependent on the way you write your observation.
asking if there's some more information in that joint marginal [0] that should compel us to open the box. When I look at it, I feel myself compelled to minimize not only H[obs|content, lid=l], but also H[content|obs, lid=l]. If, like in this case, all slices have the same joint entropy, that is equivalent to maximizing the mutual information between obs and content, which seems intuitively desirable for me, and reminds me of epistemic value [1]. In my example, this also compels the agent to open the box, even with my original observation setup.

So a more direct way of posing the question is: could the priors compute the mutual information instead of the conditional entropy as written in the paper? Would this have any drawback? The example was crafted to illustrate this tension.

[0] The joint marginal we collected:

observation_marginal[:,:,1] =
 1/6  0    0
 0    1/6  0
 0    0    1/6
 0    0    0

observation_marginal[:,:,2] =
 0    0    0
 0    0    0
 0    0    0
 1/6  1/6  1/6

[1] "Epistemic value is the expected information gain (i.e., mutual information) afforded to hidden states by future outcomes and vice-versa", from Active inference, a process theory

1 reply

skoghoern Mar 1, 2026

cheers @meditans, i am very glad you remained so persistent^^ your comment and the linked paper made me for the first time understand how efe is derived from the future vfe and i think i might be able to offer a better solution now:)
looking at the efe equations, the "information gain" you mentioned arises when we define our preferences over observations. Whereas the epistemic priors in the paper were derived from finding an EFE with preferences over states.
So the idea is to do the same derivation as in the paper but inserting an EFE with preferences over obs.
Following the same steps as in the paper we get the same epistemic prior over actions, but the second one is different. instead of an epistemic prior over states this turns into an epistemic prior over observations defined as $$\tilde{p}(y_t) \propto \exp(-H[q(x_t|y_t)])$$. I think this relates very closely to your intuition of the mutual information: $I(X; Y) = H(X) - H(X|Y)$, even though not 100%. maybe you could explore more here.

using this with your original observation_tensor yields the entropies you were aiming at: H(X|y) for each observation: [0.3, 0.3, 0.3, 0.1]

so you are actually pointing out a nice CAVEAT. it seems to depend on how we structure our "A matrix/observation model" which EFE we should use -> ergo which epistemic priors arise.

while in this case the two different combinations yield the same result over action posterior [0.75,0.25], i saw that in the 2022 textbook the G(u) using prefs defined over obs generally is equal or smaller than the G(u) over states. if i understand correctly, this simply means that the agent has "more ways to win" when its goals are sensory-based [intuitively makes sense, since multiple states can satisfy the same sensory preference (e.g. obs="seeing a white wall" might be achieved by being in the kitchen OR the hallway). therefore the "risk" associated with observations is always lower or equal than the risk associated with states.]

code with epistemic observation prior and cond_entropy function

using RxInfer, Tullio, LinearAlgebra, LogExpFunctions, PrettyPrinting
import Pipe: @pipe as @p
@enum Content     C1=1 C2 C3
@enum Lid         Open=1 Closed
@enum Observation O1=1 O2 O3 Impaired
@enum Action      OpenLid=1 Wait

uniform(data::Type{T}) where {T <: Enum} = @p data |> instances |> length |> ones  |> normalize(_, 1)
onehot(elem :: T)      where {T <: Enum} = @p T    |> instances |> length |> zeros |> setindex!(_, 1, Int(elem))

# Observation_t | Content, Lid_t
observation_tensor = [1 0 0; 0 1 0; 0 0 1; 0 0 0 ;;;
                      0 0 0; 0 0 0; 0 0 0; 1 1 1]
# Lid_t | Lid_{t-1}, Action
transition_tensor = [1 1; 0 0 ;;;
                     1 0; 0 1]
                     
function conditional_entropy(obs_tensor)
    # 1. Normalize over the hidden states (Content=dim 2, Lid=dim 3) to get P(X|y)
    P = obs_tensor ./ sum(obs_tensor, dims=(2, 3))
    # 2. Compute -p * log(p) for each element, handling the 0 log(0) case
    H = sum(p -> p > 0 ? -p * log(p) : 0.0, P, dims=(2, 3))
    return softmax(-vec(H)) # drop all singleton dims and return -H
end

@model function model(observation)
    content     ~ Categorical(uniform(Content))
    lid_0       ~ Categorical(uniform(Lid))

    observation ~ DiscreteTransition(content, observation_tensor, lid_0)
    # Future roll-out
    action ~ Categorical(uniform(Action))
    lid_1  ~ DiscreteTransition(lid_0, transition_tensor, action)
    observation_1 ~ DiscreteTransition(content, observation_tensor, lid_1) 
    # Epistemic observation prior
    observation_1 ~ Categorical(conditional_entropy(observation_tensor)) 
end

@initialization function init()
    μ(content) = Categorical(uniform(Content))
end

result = infer(
    model = model(),
    data  = (observation = [0,0,0,1], ),
    initialization = init()
    )
@p result.posteriors |> Dict(k => v for (k, v) in _) |> pprintln

skoghoern · 2026-03-01T12:28:55Z

skoghoern
Mar 1, 2026

Here is my derivation (simply adapted from the great paper):
The main idea in the derivation is to infuse the EFE-terms via +log()-log() in equation 17b.
the core EFE can be defined as:

$$\mathbb{E}_{Q(\tilde{x},\tilde{y}|\pi)}[\log (\frac{Q(\tilde{x}|\pi)}{P(\tilde{x},\tilde{y}|C)})]$$

factorizing the joint $P(\tilde{x},\tilde{y}|C)$ differently we can either get $P(\tilde{x}|C)P(\tilde{y}|\tilde{x})$ (prefs over states) or $P(\tilde{y}|C)P(\tilde{x}|\tilde{y})$ (prefs over obs).
while in the paper they use prefs over states $P(\tilde{x}|C)$ in this example here we will use $P(\tilde{y}|C)P(\tilde{x}|\tilde{y})$.
Thus we need to adapt the derivation of the epistemic priors accordingly. First we set up our VFE correctly with the preferences and epistemic prior over obs.

$$F[q] = E_{q(\boldsymbol{y}, \boldsymbol{x}, \boldsymbol{u})} \left[ \log \frac{q(\boldsymbol{y}, \boldsymbol{x}, \boldsymbol{u})}{p(\boldsymbol{y}, \boldsymbol{x}, \boldsymbol{u})\hat{p}(\boldsymbol{y})\tilde{p}(\boldsymbol{u})\tilde{p}(\boldsymbol{y})} \right]$$

The proof we want to make is then just as in the paper (ill keep their numeration, to make the comparison easier):

$$ \begin{aligned} F[q] &= E_{q(\boldsymbol{y}, \boldsymbol{x}, \boldsymbol{u})} \left[ \log \frac{q(\boldsymbol{y}, \boldsymbol{x}, \boldsymbol{u})}{p(\boldsymbol{y}, \boldsymbol{x}, \boldsymbol{u})\hat{p}(\boldsymbol{y})\tilde{p}(\boldsymbol{u})\tilde{p}(\boldsymbol{y})} \right] &\text{(16a)} \\ &= E_{q(\boldsymbol{u})} \bigg[ \log \frac{q(\boldsymbol{u})}{p(\boldsymbol{u})} + \underbrace{E_{q(\boldsymbol{y}, \boldsymbol{x}|\boldsymbol{u})} \left[ \log \frac{q(\boldsymbol{y}, \boldsymbol{x}|\boldsymbol{u})}{p(\boldsymbol{y}, \boldsymbol{x}|\boldsymbol{u})\hat{p}(\boldsymbol{y})\tilde{p}(\boldsymbol{u})\tilde{p}(\boldsymbol{y})} \right]}_{C(\boldsymbol{u})} \bigg] &\text{(16b)} \\ &= E_{q(\boldsymbol{u})} \bigg[ \log \frac{q(\boldsymbol{u})}{p(\boldsymbol{u})} + \underbrace{G(\boldsymbol{u}) + E_{q(\boldsymbol{y}, \boldsymbol{x}|\boldsymbol{u})} \left[ \log \frac{q(\boldsymbol{y}, \boldsymbol{x}|\boldsymbol{u})}{p(\boldsymbol{y}, \boldsymbol{x}|\boldsymbol{u})} \right] + \text{constant}}_{C(\boldsymbol{u}) \text{ if (13) holds}} \bigg] &\text{(16c)} \\ &= E_{q(\boldsymbol{u})} [G(\boldsymbol{u})] + E_{q(\boldsymbol{y}, \boldsymbol{x}, \boldsymbol{u})} \left[ \log \frac{q(\boldsymbol{y}, \boldsymbol{x}, \boldsymbol{u})}{p(\boldsymbol{y}, \boldsymbol{x}, \boldsymbol{u})} \right] + \text{constant} \quad \text{if (13) holds} &\text{(16d)} \end{aligned}$$

Now in (17) comes the time to infuse our different version of the EFE with preferences over observations.

$$\begin{aligned} C(\boldsymbol{u}) &= \mathbb{E}_{q(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})} \left[ \log \frac{\overbrace{q(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})}^{\text{posterior}}}{\underbrace{p(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})}_{\text{predictive}} \underbrace{\hat{p}(\boldsymbol{y})}_{\text{utility}} \underbrace{\tilde{p}(\boldsymbol{u}) \tilde{p}(\boldsymbol{y})}_{\text{epistemic priors}}} \right] &\text{(17a)} \\ &= \underbrace{\mathbb{E}_{q(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})} \left[ \log \left( \frac{q(\boldsymbol{x} | \boldsymbol{u})}{q(\boldsymbol{x} | \boldsymbol{y})} \cdot \frac{1}{\hat{p}(\boldsymbol{y})}\right) \right] +}_{G(\boldsymbol{u}) = \text{Expected Free Energy}} &\text{(17b)} \\ &\quad + \mathbb{E}_{q(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})} \left[ \log \bigg( \underbrace{\frac{\hat{p}(\boldsymbol{y}) q(\boldsymbol{x} | \boldsymbol{y})}{q(\boldsymbol{x} | \boldsymbol{u})}}_{\text{inverse factors from } G(\boldsymbol{u})} \cdot \underbrace{\frac{q(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})}{p(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u}) \hat{p}(\boldsymbol{y}) \tilde{p}(\boldsymbol{u}) \tilde{p}(\boldsymbol{y})}}_{\text{factors from (17a)}} \bigg) \right] \end{aligned}$$

$$ \begin{aligned} &= G(\boldsymbol{u}) + \underbrace{\mathbb{E}_{q(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})} \left[ \log \frac{q(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})}{p(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})} \right]}_{=B(\boldsymbol{u})} + \underbrace{\mathbb{E}_{q(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})} \left[ \log \frac{q(\boldsymbol{x} | \boldsymbol{y})}{q(\boldsymbol{x} | \boldsymbol{u}) \tilde{p}(\boldsymbol{u}) \tilde{p}(\boldsymbol{y})} \right]}_{\text{choose epistemic priors to let this vanish}} \quad \text{(17c)} \\ &= G(\boldsymbol{u}) + B(\boldsymbol{u}) + \mathbb{E}_{q(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})} \left[ \log \frac{1}{q(\boldsymbol{x} | \boldsymbol{u}) \tilde{p}(\boldsymbol{u})} \right] + \mathbb{E}_{q(\boldsymbol{y}, \boldsymbol{x} | \boldsymbol{u})} \left[ \log \frac{q(\boldsymbol{x} | \boldsymbol{y})}{\tilde{p}(\boldsymbol{y})} \right] \end{aligned}$$

Following the paper: "Here we can replace the general $q(\boldsymbol{x}|\boldsymbol{y})$ and $q(\boldsymbol{x}|\boldsymbol{u})$ with the factorised $\prod_t q(x_t|y_t)$ and $\prod_t q(x_t|x_{t-1}, u_t)$."

$$\begin{aligned} C(\boldsymbol{u}) &= G(\boldsymbol{u}) + B(\boldsymbol{u}) + \mathbb{E}_{q(\boldsymbol{x}|\boldsymbol{u})} \left[ \log \frac{1}{\prod_t q(x_t|x_{t-1}, u_t) \tilde{p}(u_t)} \right] + \nonumber \\ &\quad + \mathbb{E}_{q(\boldsymbol{y}|\boldsymbol{u})} \bigg[ \mathbb{E}_{q(\boldsymbol{x}|\boldsymbol{y})} \left[ \log q(x_t|y_t) - \log \tilde{p}(y_t) \right] \bigg] & \text{(18a)} \\ &= G(\boldsymbol{u}) + B(\boldsymbol{u}) \nonumber \\ &\quad + \sum_{t=1}^T \mathbb{E}_{q(x_t, x_{t-1}|u_t)} \left[ -\log q(x_t|x_{t-1}, u_t) - \log \tilde{p}(u_t) \right] + \nonumber \\ &\quad + \sum_{t=1}^T \mathbb{E}_{q(y_t|u_t)} \left[ \mathbb{E}_{q(x_t|y_t)} \left[ \log q(x_t|y_t) - \log \tilde{p}(y_t) \right] \right] & \text{(18b)} \end{aligned}$$

Here we recognize:

$$\mathbb{E}_{q(x_t, x_{t-1}|u_t)} [-\log q(x_t|x_{t-1}, u_t)] = H[q(x_t|x_{t-1}, u_t)] $$

and

$$\mathbb{E}_{q(x_t|y_t)} [\log q(x_t|y_t)] = -H[q(x_t|y_t)] \quad \quad \text{(20)}$$

Which, when substituted into (18b), together with the definitions of $\tilde{p}(u_t)$ and $\tilde{p}(y_t)$, yields

$$\begin{aligned} &= G(\boldsymbol{u}) + B(\boldsymbol{u}) \\ &\quad + \sum_{t=1}^{T} \underbrace{H[q(x_t|x_{t-1}, u_t)] - \log \tilde{p}(u_t)}_{=c_x \text{ if } \tilde{p}(u_t) \propto \exp(H[q(x_t|x_{t-1}, u_t)] )} + \\ &\quad + \sum_{t=1}^{T} \mathbb{E}_{q(y_t|u_t)} \left[ \underbrace{-H[q(x_t|y_t)] - \log \tilde{p}(y_t)}_{=c_y \text{ if } \tilde{p}(y_t) \propto \exp(-H[q(x_t|y_t)])} \right] \qquad \qquad \text{(21a)} \\ &= G(\boldsymbol{u}) + \mathbb{E}_{q(\boldsymbol{y}, \boldsymbol{x}|\boldsymbol{u})} \left[ \log \frac{q(\boldsymbol{y}, \boldsymbol{x}|\boldsymbol{u})}{p(\boldsymbol{y}, \boldsymbol{x}|\boldsymbol{u})} \right] + c_x + c_y, \quad \textit{if (13) holds.} \qquad \text{(21b)} \end{aligned}$$

so we get the same epistemic prior over actions, but a different epistemic prior over observations $$\tilde{p}(y_t) \propto \exp(-H[q(x_t|y_t)])$$, which in the box example where we have a factorized state-space could be: $$\tilde{p}(y_t) \propto \exp(-H_q(X_t^{lid}, X_t^{content}| y_t))$$

5 replies

skoghoern Mar 1, 2026

@wouterwln i think there is a very minor errata in your paper version on arxiv. in equation (21a) the entropy should be negative as stated correctly in equation (20):

current arxiv version

assumed correct version

$$ + \sum_{t=1}^{T} \mathbb{E}_{q(x_t | u_t)} \left[ \underbrace{-H[q(y_t|x_t)] - \log \tilde{p}(x_t)}_{= c_y \text{ if } \tilde{p}(x_t) \propto \exp(-H[q(y_t|x_t)])} \right] \qquad \text{(21a)}$$

wouterwln Mar 2, 2026
Maintainer

Yep, you're right. Thanks for spotting! I will publish the revised version to arxiv this week.

skoghoern Mar 5, 2026

top - @wouterwln. There is still some conceptual flow i didnt understand, and would be very grateful if you could help out:D
why did you include future observations y in the joint approximate posterior q(y,x,u) for the VFE functional defined in 16a, but excluded them from the predictive belief q(x∣u) inside the inserted Expected Free Energy (EFE) term in 17b?

wouterwln Mar 6, 2026
Maintainer

Well, as for EFE, I don't really get to choose as I did not come up with it, I just took the definition from https://arxiv.org/abs/2001.07203, eq 42. As for why the future observations are in the joint approximate posterior: For the epistemic prior on $x$, I need to compute $\mathbb{H}[q(y \mid x)]$, which means that we need a belief over $y$ in our variational posterior, otherwise this quantity is not defined.

wouterwln Mar 6, 2026
Maintainer

A quick comment on your derivation: I couldn't fully check the math, but the result does not surprise me. As you can see, the epistemic prior on your observations resembles mine on the states, the conditioning is just flipped. I think this is because in EFE, the mutual information is counted (https://en.wikipedia.org/wiki/Mutual_information) and as you can see in this equation (excerpt from wikipedia):

There is a relation between the mutual information and conditional entropies. I guess both epistemic priors encode a preference to maximize the mutual information between states and observations. I think this is more of a sanity check on your proof. Maybe with minigrid I picked the wrong problem, but intuitively I would say that for any interesting problem, the cardinality of the state space is a lot bigger than the cardinality on the observation space, so maybe your epistemic prior might be easier to compute.

ReactiveBayes

On the empirical priors in the EFEasVFE paper #598

Uh oh!

meditans Feb 25, 2026

Replies: 3 comments · 6 replies

Uh oh!

Uh oh!

skoghoern Feb 27, 2026

Uh oh!

Uh oh!

meditans Feb 28, 2026 Author

Uh oh!

skoghoern Mar 1, 2026

code with epistemic observation prior and cond_entropy function

Uh oh!

skoghoern Mar 1, 2026

Uh oh!

Uh oh!

skoghoern Mar 1, 2026

current arxiv version

assumed correct version

Uh oh!

wouterwln Mar 2, 2026 Maintainer

Uh oh!

skoghoern Mar 5, 2026

Uh oh!

wouterwln Mar 6, 2026 Maintainer

Uh oh!

wouterwln Mar 6, 2026 Maintainer

meditans
Feb 25, 2026

Replies: 3 comments 6 replies

skoghoern
Feb 27, 2026

meditans
Feb 28, 2026
Author

skoghoern
Mar 1, 2026

wouterwln Mar 2, 2026
Maintainer

wouterwln Mar 6, 2026
Maintainer

wouterwln Mar 6, 2026
Maintainer