We proof that the predictive posterior distribution maximizes the log-likelihood of future observations averaged over the data-generating distribution:
The essence of this proof is to show that the predictive posterior distribution is superior to any other reference distribution in terms of the log-likelihood:
or equivalently that:
Proofing this conjecture is straightforward [1]:
Note that while we used sums in our proof, which assumes that relevant quantities take discrete values, the same ideas can be readily applied to continuous-valued quantities by replacing sums with integrals.
[1] Aitchison, J. (1975). Goodness of prediction fit. Biometrika, 62(3), 547-554.