Skip to content

Commit d18dd88

Browse files
authored
Fix documentation display (#16)
1 parent 112c807 commit d18dd88

File tree

1 file changed

+22
-56
lines changed

1 file changed

+22
-56
lines changed

docs/src/background.md

Lines changed: 22 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -5,23 +5,17 @@ Most of the math below is taken from [mohamedMonteCarloGradient2020](@citet).
55
Consider a function $f: \mathbb{R}^n \to \mathbb{R}^m$, a parameter $\theta \in \mathbb{R}^d$ and a parametric probability distribution $p(\theta)$ on the input space.
66
Given a random variable $X \sim p(\theta)$, we want to differentiate the expectation of $Y = f(X)$ with respect to $\theta$:
77

8-
$$
9-
E(\theta) = \mathbb{E}[f(X)] = \int f(x) ~ p(x | \theta) ~\mathrm{d} x
10-
$$
8+
$$E(\theta) = \mathbb{E}[f(X)] = \int f(x) ~ p(x | \theta) ~\mathrm{d} x$$
119

1210
Usually this is approximated with Monte-Carlo sampling: let $x_1, \dots, x_S \sim p(\theta)$ be i.i.d., we have the estimator
1311

14-
$$
15-
E(\theta) \simeq \frac{1}{S} \sum_{s=1}^S f(x_s)
16-
$$
12+
$$E(\theta) \simeq \frac{1}{S} \sum_{s=1}^S f(x_s)$$
1713

1814
## Autodiff
1915

2016
Since $E$ is a vector-to-vector function, the key quantity we want to compute is its Jacobian matrix $\partial E(\theta) \in \mathbb{R}^{m \times n}$:
2117

22-
$$
23-
\partial E(\theta) = \int y ~ \nabla_\theta q(y | \theta)^\top ~ \mathrm{d} y = \int f(x) ~ \nabla_\theta p(x | \theta)^\top ~\mathrm{d} x
24-
$$
18+
$$\partial E(\theta) = \int y ~ \nabla_\theta q(y | \theta)^\top ~ \mathrm{d} y = \int f(x) ~ \nabla_\theta p(x | \theta)^\top ~\mathrm{d} x$$
2519

2620
However, to implement automatic differentiation, we only need the vector-Jacobian product (VJP) $\partial E(\theta)^\top \bar{y}$ with an output cotangent $\bar{y} \in \mathbb{R}^m$.
2721
See the book by [blondelElementsDifferentiableProgramming2024](@citet) to know more.
@@ -36,40 +30,32 @@ Implemented by [`Reinforce`](@ref).
3630

3731
The REINFORCE estimator is derived with the help of the identity $\nabla \log u = \nabla u / u$:
3832

39-
$$
40-
\begin{aligned}
33+
$$\begin{aligned}
4134
\partial E(\theta)
4235
& = \int f(x) ~ \nabla_\theta p(x | \theta)^\top ~ \mathrm{d}x \\
4336
& = \int f(x) ~ p(x | \theta) \nabla_\theta \log p(x | \theta)^\top ~ \mathrm{d}x \\
4437
& = \mathbb{E} \left[f(X) \nabla_\theta \log p(X | \theta)^\top\right] \\
45-
\end{aligned}
46-
$$
38+
\end{aligned}$$
4739

4840
And the VJP:
4941

50-
$$
51-
\partial E(\theta)^\top \bar{y} = \mathbb{E} \left[f(X)^\top \bar{y} ~\nabla_\theta \log p(X | \theta)\right]
52-
$$
42+
$$\partial E(\theta)^\top \bar{y} = \mathbb{E} \left[f(X)^\top \bar{y} ~\nabla_\theta \log p(X | \theta)\right]$$
5343

5444
Our Monte-Carlo approximation will therefore be:
5545

56-
$$
57-
\partial E(\theta)^\top \bar{y} \simeq \frac{1}{S} \sum_{s=1}^S f(x_s)^\top \bar{y} ~ \nabla_\theta \log p(x_s | \theta)
58-
$$
46+
$$\partial E(\theta)^\top \bar{y} \simeq \frac{1}{S} \sum_{s=1}^S f(x_s)^\top \bar{y} ~ \nabla_\theta \log p(x_s | \theta)$$
5947

6048
### Variance reduction
6149

6250
The REINFORCE estimator has high variance, but its variance is reduced by subtracting a so-called baseline $b = \frac{1}{S} \sum_{s=1}^S f(x_s)$ [koolBuyREINFORCESamples2022](@citep).
6351

6452
For $S > 1$ Monte-Carlo samples, we have
6553

66-
$$
67-
\begin{aligned}
54+
$$\begin{aligned}
6855
\partial E(\theta)^\top \bar{y}
6956
& \simeq \frac{1}{S} \sum_{s=1}^S \left(f(x_s) - \frac{1}{S - 1}\sum_{j\neq i} f(x_j) \right)^\top \bar{y} ~ \nabla_\theta\log p(x_s | \theta)\\
7057
& = \frac{1}{S - 1}\sum_{s=1}^S (f(x_s) - b)^\top \bar{y} ~ \nabla_\theta\log p(x_s | \theta)
71-
\end{aligned}
72-
$$
58+
\end{aligned}$$
7359

7460
## Reparametrization
7561

@@ -81,27 +67,19 @@ The reparametrization trick assumes that we can rewrite the random variable $X \
8167

8268
The expectation is rewritten with $h = f \circ g$:
8369

84-
$$
85-
E(\theta) = \mathbb{E}\left[ f(g_\theta(Z)) \right] = \mathbb{E}\left[ h_\theta(Z) \right]
86-
$$
70+
$$E(\theta) = \mathbb{E}\left[ f(g_\theta(Z)) \right] = \mathbb{E}\left[ h_\theta(Z) \right]$$
8771

8872
And we can directly differentiate through the expectation:
8973

90-
$$
91-
\partial E(\theta) = \mathbb{E} \left[ \partial_\theta h_\theta(Z) \right]
92-
$$
74+
$$\partial E(\theta) = \mathbb{E} \left[ \partial_\theta h_\theta(Z) \right]$$
9375

9476
This yields the VJP:
9577

96-
$$
97-
\partial E(\theta)^\top \bar{y} = \mathbb{E} \left[ \partial_\theta h_\theta(Z)^\top \bar{y} \right]
98-
$$
78+
$$\partial E(\theta)^\top \bar{y} = \mathbb{E} \left[ \partial_\theta h_\theta(Z)^\top \bar{y} \right]$$
9979

10080
We can use a Monte-Carlo approximation with i.i.d. samples $z_1, \dots, z_S \sim r$:
10181

102-
$$
103-
\partial E(\theta)^\top \bar{y} \simeq \frac{1}{S} \sum_{s=1}^S \partial_\theta h_\theta(z_s)^\top \bar{y}
104-
$$
82+
$$\partial E(\theta)^\top \bar{y} \simeq \frac{1}{S} \sum_{s=1}^S \partial_\theta h_\theta(z_s)^\top \bar{y}$$
10583

10684
### Catalogue
10785

@@ -118,49 +96,37 @@ In addition to the expectation, we may also want gradients for individual output
11896

11997
The REINFORCE technique can be applied in a similar way:
12098

121-
$$
122-
q(y | \theta) = \mathbb{E}[\mathbf{1}\{f(X) = y\}] = \int \mathbf{1} \{f(x) = y\} ~ p(x | \theta) ~ \mathrm{d}x
123-
$$
99+
$$q(y | \theta) = \mathbb{E}[\mathbf{1}\{f(X) = y\}] = \int \mathbf{1} \{f(x) = y\} ~ p(x | \theta) ~ \mathrm{d}x$$
124100

125101
Differentiating through the integral,
126102

127-
$$
128-
\begin{aligned}
103+
$$\begin{aligned}
129104
\nabla_\theta q(y | \theta)
130105
& = \int \mathbf{1} \{f(x) = y\} ~ \nabla_\theta p(x | \theta) ~ \mathrm{d}x \\
131106
& = \mathbb{E} [\mathbf{1} \{f(X) = y\} ~ \nabla_\theta \log p(X | \theta)]
132-
\end{aligned}
133-
$$
107+
\end{aligned}$$
134108

135109
The Monte-Carlo approximation for this is
136110

137-
$$
138-
\nabla_\theta q(y | \theta) \simeq \frac{1}{S} \sum_{s=1}^S \mathbf{1} \{f(x_s) = y\} ~ \nabla_\theta \log p(x_s | \theta)
139-
$$
111+
$$\nabla_\theta q(y | \theta) \simeq \frac{1}{S} \sum_{s=1}^S \mathbf{1} \{f(x_s) = y\} ~ \nabla_\theta \log p(x_s | \theta)$$
140112

141113
### Reparametrization probability gradients
142114

143115
To leverage reparametrization, we perform a change of variables:
144116

145-
$$
146-
q(y | \theta) = \mathbb{E}[\mathbf{1}\{h_\theta(Z) = y\}] = \int \mathbf{1} \{h_\theta(z) = y\} ~ r(z) ~ \mathrm{d}z
147-
$$
117+
$$q(y | \theta) = \mathbb{E}[\mathbf{1}\{h_\theta(Z) = y\}] = \int \mathbf{1} \{h_\theta(z) = y\} ~ r(z) ~ \mathrm{d}z$$
148118

149119
Assuming that $h_\theta$ is invertible, we take $z = h_\theta^{-1}(u)$ and
150120

151-
$$
152-
\mathrm{d}z = |\partial h_{\theta}^{-1}(u)| ~ \mathrm{d}u
153-
$$
121+
$$\mathrm{d}z = |\partial h_{\theta}^{-1}(u)| ~ \mathrm{d}u$$
154122

155123
so that
156124

157-
$$
158-
q(y | \theta) = \int \mathbf{1} \{u = y\} ~ r(h_\theta^{-1}(u)) ~ |\partial h_{\theta}^{-1}(u)| ~ \mathrm{d}u
159-
$$
125+
$$q(y | \theta) = \int \mathbf{1} \{u = y\} ~ r(h_\theta^{-1}(u)) ~ |\partial h_{\theta}^{-1}(u)| ~ \mathrm{d}u$$
160126

161127
We can now differentiate, but it gets tedious.
162128

163129
## Bibliography
164130

165-
$$@bibliography
166-
$$
131+
```@bibliography
132+
```

0 commit comments

Comments
 (0)