Let
Let
Let
We define four fundamental probability measures on
-
Environmental Prior:
$p_{\text{env}}: \mathcal{S} \to [0,1]$ - Represents observational frequency in training/environment data
- Empirically estimated from corpus:
$p_{\text{env}}(s) = \frac{\text{count}(s)}{\sum_{s' \in \mathcal{S}} \text{count}(s')}$
-
Internal Prior:
$p_{\text{internal}}: \mathcal{S} \to [0,1]$ - Represents structural necessity within the belief system
- High for logically/causally necessary sequences
- Low for contingent/arbitrary sequences
-
Unified Prior:
$p_{\text{prior}}: \mathcal{S} \to [0,1]$ $$p_{\text{prior}}(s) = \frac{p_{\text{env}}(s) \cdot p_{\text{internal}}(s)}{Z}$$ where$Z = \sum_{s \in \mathcal{S}} p_{\text{env}}(s) \cdot p_{\text{internal}}(s)$ is the normalization constant. -
Adversarial Distribution:
$p_{\text{adv}}: \mathcal{S} \to [0,1]$ - Learned by GFlowNet to maximize divergence from LLM
The LLM is a conditional distribution:
where
Autoregressive decomposition:
The GFlowNet learns a policy
Flow matching objective: $$\mathcal{L}{\text{TB}}(\phi) = \mathbb{E}{\tau \sim \pi_\phi}\left[\left(\log Z_\phi + \log P_F(\tau) - \log R(\tau) - \log P_B(\tau)\right)^2\right]$$
where:
-
$P_F(\tau) = \prod_{t} \pi_\phi(a_t|s_t)$ is forward probability -
$P_B(\tau)$ is backward probability (uniform in our case) -
$R(\tau)$ is the reward function -
$Z_\phi$ is the learned partition function
For any prefix
where:
-
$p(v|x) = p_{\text{LLM}}(v|x; \theta)$ (next token probability) -
$p(x) = p_{\text{prior}}(x)$ (prior over prefixes) -
$p(v) = \sum_{x' \in \mathcal{S}} p(v|x') \cdot p(x')$ (marginal)
The Bayesian divergence for a token
where
For a sequence
where:
-
$P(\text{Action})$ = probability of generating the prefix/question -
$P(\text{Info}|\text{Action})$ = probability of generating the answer given the question -
$P(s)$ = joint probability of the entire sequence
The reward for a trajectory
where:
-
$R_{\text{base}}(s)$ = base environment reward (e.g., correctness for math) -
$D_{\text{Bayes}}(s)$ = Bayesian divergence from LLM -
$\mathbb{1}[\text{causal-order}(s)]$ = indicator for preserving causal structure -
$\lambda_1, \lambda_2 > 0$ are hyperparameters
The LLM minimizes:
where:
-
$\mathcal{L}_{\text{CE}}$ = cross-entropy loss on training data -
$\mathcal{L}_{\text{div}}$ = penalty for Bayesian divergence -
$\mathcal{L}_{\text{causal}}$ = penalty for violating causal coherence
The system seeks a Nash equilibrium $(\theta^, \phi^)$ where:
$\theta^* \in \arg\min_\theta \mathcal{L}_{\text{LLM}}(\theta, \phi^*)$ - $\phi^* \in \arg\max_\phi \mathbb{E}{\tau \sim \pi\phi}[R(\tau; \theta^*)]$
The training alternates between:
Phase 1 (GFlowNet update):
Phase 2 (LLM update):
Convergence is measured by:
-
Divergence stability:
$\text{Var}[D_{\text{Bayes}}] < \epsilon_1$ -
Causal consistency:
$\mathbb{E}[C_{\text{causal}}] < \epsilon_2$ - Performance plateau: $|\mathcal{L}{t} - \mathcal{L}{t-k}| < \epsilon_3$
Let
- Points: embedded sequence representations
$\varphi: \mathcal{S} \to \mathcal{M} \subset \mathbb{R}^d$ - Metric tensor:
$g_{ij}(p) = \langle \frac{\partial \varphi}{\partial s_i}, \frac{\partial \varphi}{\partial s_j} \rangle$
The probability current on
where:
-
$\rho(p, t)$ = probability density at point$p$ at time$t$ -
$v(p, t)$ = velocity field induced by token transitions
Satisfies continuity equation:
The semantic distance between sequences
where
The discrete token space maps to continuous representations:
with learned parameters
LLM (Seq2SeqTransformer):
- Encoder:
$h^{\text{enc}} = \text{TransformerEncoder}(\text{Embed}(x) + \text{Pos}(x))$ - Decoder:
$h^{\text{dec}} = \text{TransformerDecoder}(\text{Embed}(y) + \text{Pos}(y), h^{\text{enc}})$ - Output:
$p(y_t|x, y_{<t}) = \text{Softmax}(W_{\text{out}} h^{\text{dec}}_t)$
GFlowNet (FlowNet):
- State encoding:
$h = \text{TransformerEncoder}(\text{Embed}(s) + \text{Pos}(s))$ - Policy:
$\pi(a|s) = \text{LogSoftmax}(W_{\text{out}} h_{-1})$ (last hidden state)
-
Prior Estimation:
-
$p_{\text{env}}$ : Estimated from training corpus frequencies -
$p_{\text{internal}}$ : Approximated by rule-based heuristics or learned network
-
-
Marginal Computation:
- Full marginalization intractable
- Use Monte Carlo approximation:
$p(v) \approx \frac{1}{N}\sum_{i=1}^N p(v|x_i) p(x_i)$
-
Divergence Calculation:
- Compute on mini-batches
- Use moving averages for stability
θ₀ ← PretrainedLLM or RandomInit
φ₀ ← RandomInit
Z₀ ← 0 (log partition function)
for epoch = 1 to T:
# Phase 1: GFlowNet maximizes divergence
for batch in GFlowBatches:
τ ~ π_φ (sample trajectories)
R(τ) = R_base(τ) + λ₁·D_Bayes(τ,θ) + λ₂·C_causal(τ)
L_TB = TB_Loss(τ, R, φ)
φ ← φ - η_φ·∇_φ L_TB
# Phase 2: LLM minimizes divergence
for batch in MixedBatches:
x_gen, y_gen ~ π_φ (generated data)
x_real, y_real ~ Data (real data)
x_mix = concat(x_gen, x_real) with ratio ρ
L_CE = CrossEntropy(p_LLM(y|x;θ), y)
L_div = D_Bayes(x,y,θ,φ)
L_causal = C_causal(x,y,θ)
L = L_CE + α·L_div + β·L_causal
θ ← θ - η_θ·∇_θ L
Track metrics:
- $\text{Div}t = \mathbb{E}[D{\text{Bayes}}]$ (should stabilize)
-
$\text{Acc}_t$ = task accuracy (should improve) -
$\text{Rob}_t$ = robustness to noise (should increase)
Theorem 1 (Existence): Under mild conditions (compact parameter spaces, continuous losses), there exists at least one Nash equilibrium $(\theta^, \phi^)$.
Proof sketch: Apply Brouwer's fixed-point theorem to the best-response mapping.
Theorem 2 (Consistency): At equilibrium, the system satisfies approximate Bayesian coherence:
Proof sketch: By construction of the loss functions and adversarial training.
Proposition 1: The learned model exhibits improved causal invariance: $$\text{Var}{\text{noise}}[p(y|x + \epsilon)] < \text{Var}{\text{baseline}}[p(y|x + \epsilon)]$$
where
The system can be viewed as minimizing free energy:
where:
-
$U = -\mathbb{E}[\log p(y|x)]$ (internal energy / prediction loss) -
$S = -\sum p \log p$ (entropy) -
$T = \frac{1}{\beta}$ (temperature parameter controlling exploration)
The Fisher information metric on parameter space:
Natural gradient updates follow geodesics in this space.
The system evolution can be described by: $$\begin{cases} \frac{d\theta}{dt} = -\frac{\partial H}{\partial \phi} \ \frac{d\phi}{dt} = \frac{\partial H}{\partial \theta} \end{cases}$$
where
-
Factorized Prior:
$p_{\text{prior}} = p_{\text{env}} \times p_{\text{internal}}$ separates frequency from necessity - Adversarial Causal Learning: GFlowNet generates challenging cases for causal reasoning
- Bayesian Divergence Reward: Enforces probabilistic coherence
- Unsupervised Adaptation: Can learn from generated sequences without labels
- Scalability: How to efficiently compute marginals for large vocabularies?
-
Prior Learning: How to learn
$p_{\text{internal}}$ from data? - Convergence Rate: What determines the speed of convergence to equilibrium?
- Generalization: Does the framework extend to multi-modal inputs?
- Interpretability: Can we extract explicit causal graphs from the learned model?
- Bengio et al. (2021): "Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation"
- Pearl (2009): "Causality: Models, Reasoning, and Inference"
- MacKay (2003): "Information Theory, Inference, and Learning Algorithms"
- Integrated World Modeling Theory (IWMT): Consciousness as Bayesian inference