You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/post/chentsov-theorem.mdx
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,7 +63,7 @@ be the set of the parametric densities $p_\theta(x)$. We can treat $M$ as a smoo
63
63
64
64
Let us assume that $\I$ is positive-definite everywhere, and each $\I_{ij}$ is smooth. Then we can use it as (the coordinates representation of) a Riemannian metric for $M$. This is because $\I$ is a covariant 2-tensor. (Recall the definition of a Riemannian metric.)
65
65
66
-
**Proposition 2.**_The component functions $\I\_{ij}$ of $\I$ follows the covariant transformation rule._
66
+
**Proposition 2.**_The component functions $\I_{ij}$ of $\I$ follows the covariant transformation rule._
67
67
68
68
_Proof._ Let $\theta \mapsto \varphi$ be a change of coordinates and let $\ell(\varphi) := \log p_\varphi(x)$. The component function $\I_{ij}(\theta)$ in the "old" coordinates is expressed in terms of the "new" ones, as follows:
69
69
@@ -113,13 +113,13 @@ We call this map a **_Markov embedding_**. The name suggests that $f$ embeds $\R
113
113
114
114
The result of Campbell (1986) characterizes the form of the Riemannian metric in $\R^n_{>0}$ that is invariant under any Markov embedding.
115
115
116
-
**Lemma 3 (Campbell, 1986).**_Let $g$ be a Riemannian metric on $\R^n\_{>0}$ where $n \geq 2$. Suppose that every Markov embedding on $(\R^n\_{>0}, g)$ is an isometry. Then_
116
+
**Lemma 3 (Campbell, 1986).**_Let $g$ be a Riemannian metric on $\R^n_{>0}$ where $n \geq 2$. Suppose that every Markov embedding on $(\R^n_{>0}, g)$ is an isometry. Then_
_where $\abs{x} = \sum\_{i=1}^n x^i$, $\delta\_{ij}$ is the Kronecker delta, and $A, B \in C^\infty(\R\_{>0})$ satisfying $B > 0$ and $A + B > 0$._
122
+
_where $\abs{x} = \sum_{i=1}^n x^i$, $\delta_{ij}$ is the Kronecker delta, and $A, B \in C^\infty(\R_{>0})$ satisfying $B > 0$ and $A + B > 0$._
123
123
124
124
_Proof._ See Campbell (1986) and Amari (2016, Sec. 3.5).
125
125
@@ -133,7 +133,7 @@ The fact that the Fisher information is the unique invariant metric under suffic
133
133
134
134
Let us, therefore, connect the result in Lemma 3 with the Fisher information on $\Delta^{n-1}$. We give the latter in the following lemma.
135
135
136
-
**Lemma 4.**_The Fisher information of a Categorical distribution $p\_\theta(z)$ where $z$ takes values in $\Omega = \\{ 1, \dots, n \\}$ and $\theta = \\{ \theta^1, \dots, \theta^n \\} \in \Delta^{n-1}$ is given by_
136
+
**Lemma 4.**_The Fisher information of a Categorical distribution $p_\theta(z)$ where $z$ takes values in $\Omega = \\{1, \dots, n \\}$ and $\theta = \\{ \theta^1, \dots, \theta^n \\} \in \Delta^{n-1}$ is given by_
for any $x \in \R^n_{>0}$. Therefore, this is the form of the invariant metric under sufficient statistics in $\Delta^{n-1} \subset \R^n_{>0}$, i.e. when $n=m$ in the Markov embedding.
187
187
188
-
Let us therefore restrict $g$ to $\Delta^{n-1}$. For each $\theta \in \Delta^{n-1}$, the tangent space $T_\theta \Delta^{n-1}$ is orthogonal to the line $x^1 = x^2 = \dots = x^n$, which direction is given by the vector $\mathbf{1} = (1, \dots, 1) \in \R^n_{>0}$. This is a vector normal to $\Delta^{n-1}$, implying that any $v \in T_\theta \Delta^{n-1}$ satisfies $\inner{\mathbf{1}, v}\_g = 0$, i.e. $\sum_{i=1}^n v^i = 0$.
188
+
Let us therefore restrict $g$ to $\Delta^{n-1}$. For each $\theta \in \Delta^{n-1}$, the tangent space $T_\theta \Delta^{n-1}$ is orthogonal to the line $x^1 = x^2 = \dots = x^n$, which direction is given by the vector $\mathbf{1} = (1, \dots, 1) \in \R^n_{>0}$. This is a vector normal to $\Delta^{n-1}$, implying that any $v \in T_\theta \Delta^{n-1}$ satisfies $\inner{\mathbf{1}, v}_g = 0$, i.e. $\sum_{i=1}^n v^i = 0$.
189
189
190
190
Moreover, if $\theta \in \Delta^{n-1}$, then $\abs{\theta} = \sum_{i=1}^n \theta^i = 1$ by definition. Thus, $A(1)$ and $B(1)$ are constants. So, if $v, w \in T_\theta \Delta^{n-1}$, we have:
Let $f: X \times \Theta \to Y$ defined by $(x, \theta) \mapsto f_\theta(x)$ be a neural network, where $X \subseteq \R^n$, $\Theta \subseteq \R^d$, and $Y \subseteq \R^c$ be the input, parameter, and output spaces, respectively.
9
-
Given a dataset $\D := \\{ (x_i, y_i) : x_i \in X, y_i \in Y \\}_{i=1}^m$, we define the likelihood $p(\D \mid \theta) := \prod\_{i=1}^m p(y_i \mid f\_\theta(x_i))$.
10
+
Given a dataset $\D := \\{ (x_i, y_i) : x_i \inX, y_i \inY \\}_{i=1}^m$, we define the likelihood $p(\D \mid \theta) := \prod_{i=1}^m p(y_i \mid f_\theta(x_i))$.
10
11
Then, given a prior $p(\theta)$, we can obtain the posterior via an application of Bayes' rule: $p(\theta \mid \D) = 1/Z \,\, p(\D \mid \theta) p(\theta)$.
11
12
But, the exact computation of $p(\theta \mid \D)$ is intractable in general due to the need of computing the normalization constant
12
13
@@ -49,7 +50,7 @@ $$
49
50
\end{align*}
50
51
$$
51
52
52
-
For simplicity, let $\varSigma := -\left(\nabla^2_\theta \L\vert\_{\theta\_\map}\right)^{-1}$. Then, using this approximation, we can also obtain an approximation of $Z$:
53
+
For simplicity, let $\varSigma := -\left(\nabla^2_\theta \L\vert_{\theta_\map}\right)^{-1}$. Then, using this approximation, we can also obtain an approximation of $Z$:
53
54
54
55
$$
55
56
\begin{align*}
@@ -91,7 +92,7 @@ which in general is less overconfident compared to the MAP-estimate-induced pred
91
92
What we have seen is the most general framework of the LA.
92
93
One can make a specific design decision, such as by imposing a special structure to the Hessian $\nabla^2_\theta \L$, and thus the covariance $\varSigma$.
93
94
94
-
## The <spanstyle="font-family: monospace; font-size: 15pt">laplace-torch</span> library
95
+
## The laplace-torch library
95
96
96
97
The simplicity of the LA is not without a drawback.
97
98
Recall that the parameter $\theta$ is in $\Theta \subseteq \R^d$.
@@ -101,44 +102,40 @@ Together with the fact that the LA is an old method (and thus not "trendy" in th
101
102
102
103
Motivated by this observation, in our NeurIPS 2021 paper titled ["Laplace Redux -- Effortless Bayesian Deep Learning"](https://arxiv.org/abs/2106.14806), we showcase that (i) the Hessian can be obtained cheaply, thanks to recent advances in second-order optimization, and (ii) even the simplest LA can be competitive to more sophisticated VB and MCMC methods, while only being much cheaper than them.
103
104
Of course, numbers alone are not sufficient to promote the goodness of the LA.
104
-
So, in that paper, we also propose an extendible, easy-to-use software library for PyTorch called <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span>, which is available at <https://github.com/AlexImmer/Laplace>.
105
+
So, in that paper, we also propose an extendible, easy-to-use software library for PyTorch called `laplace-torch`, which is available at [this Github repo](https://github.com/AlexImmer/Laplace).
105
106
106
-
The <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span> is a simple library for, essentially, "turning standard NNs into BNNs".
107
+
The `laplace-torch` is a simple library for, essentially, "turning standard NNs into BNNs".
107
108
The main class of this library is the class `Laplace`, which can be used to transform a standard PyTorch model into a Laplace-approximated BNN.
108
109
Here is an example.
109
110
110
-
```python
111
+
```python title="try_laplace.py"
111
112
from laplace import Laplace
112
113
113
114
model = load_pretrained_model()
114
-
115
115
la = Laplace(model, 'regression')
116
116
117
117
# Compute the Hessian
118
-
119
118
la.fit(train_loader)
120
119
121
120
# Hyperparameter tuning
122
-
123
121
la.optimize_prior_precision()
124
122
125
123
# Make prediction
126
-
127
124
pred_mean, pred_var = la(x_test)
128
125
```
129
126
130
127
The resulting object, `la` is a fully-functioning BNN, yielding the following prediction.
131
128
(Notice the identical regression curves---the LA essentially imbues MAP predictions with uncertainty estimates.)
<BlogImageimagePath="/img/laplace/regression_example.png"altText="Laplace for regression." />
134
131
135
-
Of course, <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span> is flexible: the `Laplace` class has almost all state-of-the-art features in Laplace approximations.
136
-
Those features, along with the corresponding options in <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span>, are summarized in the following flowchart.
132
+
Of course, `laplace-torch` is flexible: the `Laplace` class has almost all state-of-the-art features in Laplace approximations.
133
+
Those features, along with the corresponding options in `laplace-torch`, are summarized in the following flowchart.
137
134
(The options `'subnetwork'` for `subset_of_weights` and `'lowrank'` for `hessian_structure` are in the work, by the time this post is first published.)
<BlogImageimagePath="/img/laplace/flowchart.png"altText="Modern arts of Laplace approximations."fullWidth />
140
137
141
-
The <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span> library uses a very cheap yet highly-performant flavor of LA by default, based on [4]:
138
+
The `laplace-torch` library uses a very cheap yet highly-performant flavor of LA by default, based on [4]:
<BlogImageimagePath="/img/laplace/classification.png"altText="Laplace for classification."fullWidth />
151
148
152
149
Here we can see that `Laplace`, with default options, improves the calibration (in terms of expected calibration error (ECE)) of the MAP model.
153
150
Moreover, it is guaranteed to preserve the accuracy of the MAP model---something that cannot be said for other baselines.
154
-
Ultimately, this improvement is cheap: <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span> only incurs little overhead relative to the MAP model---far cheaper than other Bayesian baselines.
151
+
Ultimately, this improvement is cheap: `laplace-torch` only incurs little overhead relative to the MAP model---far cheaper than other Bayesian baselines.
155
152
156
153
## Hyperparameter Tuning
157
154
158
155
Hyperparameter tuning, especially for the prior variance/precision, is crucial in modern Laplace approximations for BNNs.
159
-
<spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span> provides several options: (i) cross-validation and (ii) marginal-likelihood maximization (MLM, also known as empirical Bayes and type-II maximum likelihood).
156
+
`laplace-torch` provides several options: (i) cross-validation and (ii) marginal-likelihood maximization (MLM, also known as empirical Bayes and type-II maximum likelihood).
160
157
161
158
Cross-validation is simple but needs a validation dataset.
162
-
In <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span>, this can be done via the following.
159
+
In `laplace-torch`, this can be done via the following.
@@ -170,7 +167,7 @@ Recall that by taking the second-order Taylor expansion over the log-posterior,
170
167
This object is called the marginal likelihood: it is a probability over the dataset $\D$ and crucially, it is a function of the hyperparameter since the parameter $\theta$ is marginalized out.
171
168
Thus, we can find the best values for our hyperparameters by maximizing this function.
172
169
173
-
In <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span>, the marginal likelihood can be accessed via
170
+
In `laplace-torch`, the marginal likelihood can be accessed via
174
171
175
172
```python
176
173
ml = la.log_marginal_likelihood(prior_precision)
@@ -182,16 +179,16 @@ This function is compatible with PyTorch's autograd, so we can backpropagate thr
182
179
ml.backward() # Works!
183
180
```
184
181
185
-
Thus, MLM can easily be done in <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span>.
186
-
By extension, recent methods such as online MLM [5], can also easily be applied using <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span>.
182
+
Thus, MLM can easily be done in `laplace-torch`.
183
+
By extension, recent methods such as online MLM [5], can also easily be applied using `laplace-torch`.
187
184
188
185
## Outlooks
189
186
190
-
The <spanstyle="font-family: monospace; font-size: 12pt">laplace-torch</span> library is continuously developed.
187
+
The `laplace-torch` library is continuously developed.
191
188
Support for more likelihood functions and priors, subnetwork Laplace, etc. are on the way.
192
189
193
190
In any case, we hope to see the revival of the LA in the Bayesian deep learning community.
194
-
So, please try out our library at <https://github.com/AlexImmer/Laplace>!
191
+
So, please try out our library at [https://github.com/AlexImmer/Laplace](https://github.com/AlexImmer/Laplace)!
0 commit comments