@@ -5,23 +5,17 @@ Most of the math below is taken from [mohamedMonteCarloGradient2020](@citet).
55Consider a function $f: \mathbb{R}^n \to \mathbb{R}^m$, a parameter $\theta \in \mathbb{R}^d$ and a parametric probability distribution $p(\theta)$ on the input space.
66Given a random variable $X \sim p(\theta)$, we want to differentiate the expectation of $Y = f(X)$ with respect to $\theta$:
77
8- $$
9- E(\theta) = \mathbb{E}[f(X)] = \int f(x) ~ p(x | \theta) ~\mathrm{d} x
10- $$
8+ $$ E(\theta) = \mathbb{E}[f(X)] = \int f(x) ~ p(x | \theta) ~\mathrm{d} x $$
119
1210Usually this is approximated with Monte-Carlo sampling: let $x_1, \dots, x_S \sim p(\theta)$ be i.i.d., we have the estimator
1311
14- $$
15- E(\theta) \simeq \frac{1}{S} \sum_{s=1}^S f(x_s)
16- $$
12+ $$ E(\theta) \simeq \frac{1}{S} \sum_{s=1}^S f(x_s) $$
1713
1814## Autodiff
1915
2016Since $E$ is a vector-to-vector function, the key quantity we want to compute is its Jacobian matrix $\partial E(\theta) \in \mathbb{R}^{m \times n}$:
2117
22- $$
23- \partial E(\theta) = \int y ~ \nabla_\theta q(y | \theta)^\top ~ \mathrm{d} y = \int f(x) ~ \nabla_\theta p(x | \theta)^\top ~\mathrm{d} x
24- $$
18+ $$ \partial E(\theta) = \int y ~ \nabla_\theta q(y | \theta)^\top ~ \mathrm{d} y = \int f(x) ~ \nabla_\theta p(x | \theta)^\top ~\mathrm{d} x $$
2519
2620However, to implement automatic differentiation, we only need the vector-Jacobian product (VJP) $\partial E(\theta)^\top \bar{y}$ with an output cotangent $\bar{y} \in \mathbb{R}^m$.
2721See the book by [ blondelElementsDifferentiableProgramming2024] ( @citet ) to know more.
@@ -36,40 +30,32 @@ Implemented by [`Reinforce`](@ref).
3630
3731The REINFORCE estimator is derived with the help of the identity $\nabla \log u = \nabla u / u$:
3832
39- $$
40- \begin{aligned}
33+ $$ \begin{aligned}
4134\partial E(\theta)
4235& = \int f(x) ~ \nabla_\theta p(x | \theta)^\top ~ \mathrm{d}x \\
4336& = \int f(x) ~ p(x | \theta) \nabla_\theta \log p(x | \theta)^\top ~ \mathrm{d}x \\
4437& = \mathbb{E} \left[f(X) \nabla_\theta \log p(X | \theta)^\top\right] \\
45- \end{aligned}
46- $$
38+ \end{aligned} $$
4739
4840And the VJP:
4941
50- $$
51- \partial E(\theta)^\top \bar{y} = \mathbb{E} \left[f(X)^\top \bar{y} ~\nabla_\theta \log p(X | \theta)\right]
52- $$
42+ $$ \partial E(\theta)^\top \bar{y} = \mathbb{E} \left[f(X)^\top \bar{y} ~\nabla_\theta \log p(X | \theta)\right] $$
5343
5444Our Monte-Carlo approximation will therefore be:
5545
56- $$
57- \partial E(\theta)^\top \bar{y} \simeq \frac{1}{S} \sum_{s=1}^S f(x_s)^\top \bar{y} ~ \nabla_\theta \log p(x_s | \theta)
58- $$
46+ $$ \partial E(\theta)^\top \bar{y} \simeq \frac{1}{S} \sum_{s=1}^S f(x_s)^\top \bar{y} ~ \nabla_\theta \log p(x_s | \theta) $$
5947
6048### Variance reduction
6149
6250The REINFORCE estimator has high variance, but its variance is reduced by subtracting a so-called baseline $b = \frac{1}{S} \sum_ {s=1}^S f(x_s)$ [ koolBuyREINFORCESamples2022] ( @citep ) .
6351
6452For $S > 1$ Monte-Carlo samples, we have
6553
66- $$
67- \begin{aligned}
54+ $$ \begin{aligned}
6855\partial E(\theta)^\top \bar{y}
6956& \simeq \frac{1}{S} \sum_{s=1}^S \left(f(x_s) - \frac{1}{S - 1}\sum_{j\neq i} f(x_j) \right)^\top \bar{y} ~ \nabla_\theta\log p(x_s | \theta)\\
7057& = \frac{1}{S - 1}\sum_{s=1}^S (f(x_s) - b)^\top \bar{y} ~ \nabla_\theta\log p(x_s | \theta)
71- \end{aligned}
72- $$
58+ \end{aligned} $$
7359
7460## Reparametrization
7561
@@ -81,27 +67,19 @@ The reparametrization trick assumes that we can rewrite the random variable $X \
8167
8268The expectation is rewritten with $h = f \circ g$:
8369
84- $$
85- E(\theta) = \mathbb{E}\left[ f(g_\theta(Z)) \right] = \mathbb{E}\left[ h_\theta(Z) \right]
86- $$
70+ $$ E(\theta) = \mathbb{E}\left[ f(g_\theta(Z)) \right] = \mathbb{E}\left[ h_\theta(Z) \right] $$
8771
8872And we can directly differentiate through the expectation:
8973
90- $$
91- \partial E(\theta) = \mathbb{E} \left[ \partial_\theta h_\theta(Z) \right]
92- $$
74+ $$ \partial E(\theta) = \mathbb{E} \left[ \partial_\theta h_\theta(Z) \right] $$
9375
9476This yields the VJP:
9577
96- $$
97- \partial E(\theta)^\top \bar{y} = \mathbb{E} \left[ \partial_\theta h_\theta(Z)^\top \bar{y} \right]
98- $$
78+ $$ \partial E(\theta)^\top \bar{y} = \mathbb{E} \left[ \partial_\theta h_\theta(Z)^\top \bar{y} \right] $$
9979
10080We can use a Monte-Carlo approximation with i.i.d. samples $z_1, \dots, z_S \sim r$:
10181
102- $$
103- \partial E(\theta)^\top \bar{y} \simeq \frac{1}{S} \sum_{s=1}^S \partial_\theta h_\theta(z_s)^\top \bar{y}
104- $$
82+ $$ \partial E(\theta)^\top \bar{y} \simeq \frac{1}{S} \sum_{s=1}^S \partial_\theta h_\theta(z_s)^\top \bar{y} $$
10583
10684### Catalogue
10785
@@ -118,49 +96,37 @@ In addition to the expectation, we may also want gradients for individual output
11896
11997The REINFORCE technique can be applied in a similar way:
12098
121- $$
122- q(y | \theta) = \mathbb{E}[\mathbf{1}\{f(X) = y\}] = \int \mathbf{1} \{f(x) = y\} ~ p(x | \theta) ~ \mathrm{d}x
123- $$
99+ $$ q(y | \theta) = \mathbb{E}[\mathbf{1}\{f(X) = y\}] = \int \mathbf{1} \{f(x) = y\} ~ p(x | \theta) ~ \mathrm{d}x $$
124100
125101Differentiating through the integral,
126102
127- $$
128- \begin{aligned}
103+ $$ \begin{aligned}
129104\nabla_\theta q(y | \theta)
130105& = \int \mathbf{1} \{f(x) = y\} ~ \nabla_\theta p(x | \theta) ~ \mathrm{d}x \\
131106& = \mathbb{E} [\mathbf{1} \{f(X) = y\} ~ \nabla_\theta \log p(X | \theta)]
132- \end{aligned}
133- $$
107+ \end{aligned} $$
134108
135109The Monte-Carlo approximation for this is
136110
137- $$
138- \nabla_\theta q(y | \theta) \simeq \frac{1}{S} \sum_{s=1}^S \mathbf{1} \{f(x_s) = y\} ~ \nabla_\theta \log p(x_s | \theta)
139- $$
111+ $$ \nabla_\theta q(y | \theta) \simeq \frac{1}{S} \sum_{s=1}^S \mathbf{1} \{f(x_s) = y\} ~ \nabla_\theta \log p(x_s | \theta) $$
140112
141113### Reparametrization probability gradients
142114
143115To leverage reparametrization, we perform a change of variables:
144116
145- $$
146- q(y | \theta) = \mathbb{E}[\mathbf{1}\{h_\theta(Z) = y\}] = \int \mathbf{1} \{h_\theta(z) = y\} ~ r(z) ~ \mathrm{d}z
147- $$
117+ $$ q(y | \theta) = \mathbb{E}[\mathbf{1}\{h_\theta(Z) = y\}] = \int \mathbf{1} \{h_\theta(z) = y\} ~ r(z) ~ \mathrm{d}z $$
148118
149119Assuming that $h_ \theta$ is invertible, we take $z = h_ \theta^{-1}(u)$ and
150120
151- $$
152- \mathrm{d}z = |\partial h_{\theta}^{-1}(u)| ~ \mathrm{d}u
153- $$
121+ $$ \mathrm{d}z = |\partial h_{\theta}^{-1}(u)| ~ \mathrm{d}u $$
154122
155123so that
156124
157- $$
158- q(y | \theta) = \int \mathbf{1} \{u = y\} ~ r(h_\theta^{-1}(u)) ~ |\partial h_{\theta}^{-1}(u)| ~ \mathrm{d}u
159- $$
125+ $$ q(y | \theta) = \int \mathbf{1} \{u = y\} ~ r(h_\theta^{-1}(u)) ~ |\partial h_{\theta}^{-1}(u)| ~ \mathrm{d}u $$
160126
161127We can now differentiate, but it gets tedious.
162128
163129## Bibliography
164130
165- $$ @bibliography
166- $$
131+ ``` @bibliography
132+ ```
0 commit comments