|
1 | 1 | # Background |
2 | 2 |
|
3 | | -Consider a function ``f: \mathbb{R}^n \to \mathbb{R}^m`` and a parametric probability distribution ``p(\theta)`` on the input space ``\mathbb{R}^n``. |
4 | | -Given a random variable ``X \sim p(\theta)``, we want to differentiate the following expectation with respect to ``\theta``: |
| 3 | +Most of the math below is taken from [mohamedMonteCarloGradient2020](@citet). |
| 4 | + |
| 5 | +Consider a function $f: \mathbb{R}^n \to \mathbb{R}^m$, a parameter $\theta \in \mathbb{R}^d$ and a parametric probability distribution $p(\theta)$ on the input space. |
| 6 | +Given a random variable $X \sim p(\theta)$, we want to differentiate the expectation of $Y = f(X)$ with respect to $\theta$: |
| 7 | + |
| 8 | +$$ |
| 9 | +E(\theta) = \mathbb{E}[f(X)] = \int f(x) ~ p(x | \theta) ~\mathrm{d} x |
| 10 | +$$ |
| 11 | + |
| 12 | +Usually this is approximated with Monte-Carlo sampling: let $x_1, \dots, x_S \sim p(\theta)$ be i.i.d., we have the estimator |
| 13 | + |
| 14 | +$$ |
| 15 | +E(\theta) \simeq \frac{1}{S} \sum_{s=1}^S f(x_s) |
| 16 | +$$ |
5 | 17 |
|
6 | | -```math |
7 | | -F(\theta) = \mathbb{E}_{p(\theta)}[f(X)] |
8 | | -``` |
| 18 | +## Autodiff |
9 | 19 |
|
10 | | -Since ``F`` is a vector-to-vector function, the key quantity we want to compute is its Jacobian matrix ``\partial F(\theta) \in \mathbb{R}^{m \times n}``. |
11 | | -However, to implement automatic differentiation, we only need vector-Jacobian products (VJPs) ``\partial F(\theta)^\top v`` with ``v \in \mathbb{R}^m``, also called pullbacks. |
| 20 | +Since $E$ is a vector-to-vector function, the key quantity we want to compute is its Jacobian matrix $\partial E(\theta) \in \mathbb{R}^{m \times n}$: |
| 21 | + |
| 22 | +$$ |
| 23 | +\partial E(\theta) = \int y ~ \nabla_\theta q(y | \theta)^\top ~ \mathrm{d} y = \int f(x) ~ \nabla_\theta p(x | \theta)^\top ~\mathrm{d} x |
| 24 | +$$ |
| 25 | + |
| 26 | +However, to implement automatic differentiation, we only need the vector-Jacobian product (VJP) $\partial E(\theta)^\top \bar{y}$ with an output cotangent $\bar{y} \in \mathbb{R}^m$. |
12 | 27 | See the book by [blondelElementsDifferentiableProgramming2024](@citet) to know more. |
13 | 28 |
|
14 | | -Most of the math below is taken from [mohamedMonteCarloGradient2020](@citet). |
| 29 | +Our goal is to rephrase this VJP as an expectation, so that we may approximate it with Monte-Carlo sampling as well. |
15 | 30 |
|
16 | 31 | ## REINFORCE |
17 | 32 |
|
18 | | -### Principle |
| 33 | +Implemented by [`Reinforce`](@ref). |
19 | 34 |
|
20 | | -The REINFORCE estimator is derived with the help of the identity ``\nabla \log u = \nabla u / u``: |
| 35 | +### Score function |
21 | 36 |
|
22 | | -```math |
| 37 | +The REINFORCE estimator is derived with the help of the identity $\nabla \log u = \nabla u / u$: |
| 38 | + |
| 39 | +$$ |
23 | 40 | \begin{aligned} |
24 | | -F(\theta + \varepsilon) |
25 | | -& = \int f(x) ~ p(x, \theta + \varepsilon) ~ \mathrm{d}x \\ |
26 | | -& \approx \int f(x) ~ \left(p(x, \theta) + \nabla_\theta p(x, \theta)^\top \varepsilon\right) ~ \mathrm{d}x \\ |
27 | | -& = \int f(x) ~ \left(p(x, \theta) + p(x, \theta) \nabla_\theta \log p(x, \theta)^\top \varepsilon\right) ~ \mathrm{d}x \\ |
28 | | -& = F(\theta) + \left(\int f(x) ~ p(x, \theta) \nabla_\theta \log p(x, \theta)^\top ~ \mathrm{d}x\right) \varepsilon \\ |
29 | | -& = F(\theta) + \mathbb{E}_{p(\theta)} \left[f(X) \nabla_\theta \log p(X, \theta)^\top\right] ~ \varepsilon \\ |
| 41 | +\partial E(\theta) |
| 42 | +& = \int f(x) ~ \nabla_\theta p(x | \theta)^\top ~ \mathrm{d}x \\ |
| 43 | +& = \int f(x) ~ p(x | \theta) \nabla_\theta \log p(x | \theta)^\top ~ \mathrm{d}x \\ |
| 44 | +& = \mathbb{E} \left[f(X) \nabla_\theta \log p(X | \theta)^\top\right] \\ |
30 | 45 | \end{aligned} |
31 | | -``` |
| 46 | +$$ |
32 | 47 |
|
33 | | -We thus identify the Jacobian matrix: |
| 48 | +And the VJP: |
34 | 49 |
|
35 | | -```math |
36 | | -\partial F(\theta) = \mathbb{E}_{p(\theta)} \left[f(X) \nabla_\theta \log p(X, \theta)^\top\right] |
37 | | -``` |
| 50 | +$$ |
| 51 | +\partial E(\theta)^\top \bar{y} = \mathbb{E} \left[f(X)^\top \bar{y} ~\nabla_\theta \log p(X | \theta)\right] |
| 52 | +$$ |
38 | 53 |
|
39 | | -And the vector-Jacobian product: |
| 54 | +Our Monte-Carlo approximation will therefore be: |
40 | 55 |
|
41 | | -```math |
42 | | -\partial F(\theta)^\top v = \mathbb{E}_{p(\theta)} \left[(f(X)^\top v) \nabla_\theta \log p(X, \theta)\right] |
43 | | -``` |
| 56 | +$$ |
| 57 | +\partial E(\theta)^\top \bar{y} \simeq \frac{1}{S} \sum_{s=1}^S f(x_s)^\top \bar{y} ~ \nabla_\theta \log p(x_s | \theta) |
| 58 | +$$ |
44 | 59 |
|
45 | 60 | ### Variance reduction |
46 | 61 |
|
47 | | -Since the REINFORCE estimator has high variance, it can be reduced by using a baseline [koolBuyREINFORCESamples2022](@citep). |
48 | | -For $k > 1$ Monte-Carlo samples, we have |
| 62 | +The REINFORCE estimator has high variance, but its variance is reduced by subtracting a so-called baseline $b = \frac{1}{S} \sum_{s=1}^S f(x_s)$ [koolBuyREINFORCESamples2022](@citep). |
| 63 | + |
| 64 | +For $S > 1$ Monte-Carlo samples, we have |
49 | 65 |
|
50 | | -```math |
| 66 | +$$ |
51 | 67 | \begin{aligned} |
52 | | -\partial F(\theta) &\simeq \frac{1}{k}\sum_{i=1}^k f(x_k) \nabla_\theta\log p(x_k, \theta)^\top\\ |
53 | | -& \simeq \frac{1}{k}\sum_{i=1}^k \left(f(x_i) - \frac{1}{k - 1}\sum_{j\neq i} f(x_j)\right) \nabla_\theta\log p(x_i, \theta)^\top\\ |
54 | | -& = \frac{1}{k - 1}\sum_{i=1}^k \left(f(x_i) - \frac{1}{k}\sum_{j=1}^k f(x_j)\right) \nabla_\theta\log p(x_i, \theta)^\top |
| 68 | +\partial E(\theta)^\top \bar{y} |
| 69 | +& \simeq \frac{1}{S} \sum_{s=1}^S \left(f(x_s) - \frac{1}{S - 1}\sum_{j\neq i} f(x_j) \right)^\top \bar{y} ~ \nabla_\theta\log p(x_s | \theta)\\ |
| 70 | +& = \frac{1}{S - 1}\sum_{s=1}^S (f(x_s) - b)^\top \bar{y} ~ \nabla_\theta\log p(x_s | \theta) |
55 | 71 | \end{aligned} |
56 | | -``` |
57 | | - |
58 | | -This gives the following vector-Jacobian product: |
59 | | - |
60 | | -```math |
61 | | -\partial F(\theta)^\top v \simeq \frac{1}{k - 1}\sum_{i=1}^k \left(\left(f(x_i) - \frac{1}{k}\sum_{j=1}^k f(x_j)\right)^\top v\right) \nabla_\theta\log p(x_i, \theta) |
62 | | -``` |
| 72 | +$$ |
63 | 73 |
|
64 | 74 | ## Reparametrization |
65 | 75 |
|
| 76 | +Implemented by [`Reparametrization`](@ref). |
| 77 | + |
66 | 78 | ### Trick |
67 | 79 |
|
68 | | -The reparametrization trick assumes that we can rewrite the random variable ``X \sim p(\theta)`` as ``X = g(Z, \theta)``, where ``Z \sim q`` is another random variable whose distribution does not depend on ``\theta``. |
| 80 | +The reparametrization trick assumes that we can rewrite the random variable $X \sim p(\theta)$ as $X = g_\theta(Z)$, where $Z \sim r$ is another random variable whose distribution $r$ does not depend on $\theta$. |
69 | 81 |
|
70 | | -```math |
71 | | -\begin{aligned} |
72 | | -F(\theta + \varepsilon) |
73 | | -& = \int f(g(z, \theta + \varepsilon)) ~ q(z) ~ \mathrm{d}z \\ |
74 | | -& \approx \int f\left(g(z, \theta) + \partial_\theta g(z, \theta) ~ \varepsilon\right) ~ q(z) ~ \mathrm{d}z \\ |
75 | | -& \approx F(\theta) + \int \partial f(g(z, \theta)) ~ \partial_\theta g(z, \theta) ~ \varepsilon ~ q(z) ~ \mathrm{d}z \\ |
76 | | -& \approx F(\theta) + \mathbb{E}_q \left[ \partial f(g(Z, \theta)) ~ \partial_\theta g(Z, \theta) \right] ~ \varepsilon \\ |
77 | | -\end{aligned} |
78 | | -``` |
| 82 | +The expectation is rewritten with $h = f \circ g$: |
79 | 83 |
|
80 | | -If we denote ``h(z, \theta) = f(g(z, \theta))``, then we identify the Jacobian matrix: |
| 84 | +$$ |
| 85 | +E(\theta) = \mathbb{E}\left[ f(g_\theta(Z)) \right] = \mathbb{E}\left[ h_\theta(Z) \right] |
| 86 | +$$ |
81 | 87 |
|
82 | | -```math |
83 | | -\partial F(\theta) = \mathbb{E}_q \left[ \partial_\theta h(Z, \theta) \right] |
84 | | -``` |
| 88 | +And we can directly differentiate through the expectation: |
85 | 89 |
|
86 | | -And the vector-Jacobian product: |
| 90 | +$$ |
| 91 | +\partial E(\theta) = \mathbb{E} \left[ \partial_\theta h_\theta(Z) \right] |
| 92 | +$$ |
87 | 93 |
|
88 | | -```math |
89 | | -\partial F(\theta)^\top v = \mathbb{E}_q \left[ \partial_\theta h(Z, \theta)^\top v \right] |
90 | | -``` |
| 94 | +This yields the VJP: |
| 95 | + |
| 96 | +$$ |
| 97 | +\partial E(\theta)^\top \bar{y} = \mathbb{E} \left[ \partial_\theta h_\theta(Z)^\top \bar{y} \right] |
| 98 | +$$ |
| 99 | + |
| 100 | +We can use a Monte-Carlo approximation with i.i.d. samples $z_1, \dots, z_S \sim r$: |
| 101 | + |
| 102 | +$$ |
| 103 | +\partial E(\theta)^\top \bar{y} \simeq \frac{1}{S} \sum_{s=1}^S \partial_\theta h_\theta(z_s)^\top \bar{y} |
| 104 | +$$ |
91 | 105 |
|
92 | 106 | ### Catalogue |
93 | 107 |
|
94 | 108 | The following reparametrizations are implemented: |
95 | 109 |
|
96 | | -- Univariate Normal: ``X \sim \mathcal{N}(\mu, \sigma^2)`` is equivalent to ``X = \mu + \sigma Z`` with ``Z \sim \mathcal{N}(0, 1)``. |
97 | | -- Multivariate Normal: ``X \sim \mathcal{N}(\mu, \Sigma)`` is equivalent to ``X = \mu + L Z`` with ``Z \sim \mathcal{N}(0, I)`` and ``L L^\top = \Sigma``. The matrix ``L`` can be obtained by Cholesky decomposition of ``\Sigma``. |
| 110 | +- Univariate Normal: $X \sim \mathcal{N}(\mu, \sigma^2)$ is equivalent to $X = \mu + \sigma Z$ with $Z \sim \mathcal{N}(0, 1)$. |
| 111 | +- Multivariate Normal: $X \sim \mathcal{N}(\mu, \Sigma)$ is equivalent to $X = \mu + L Z$ with $Z \sim \mathcal{N}(0, I)$ and $L L^\top = \Sigma$. The matrix $L$ can be obtained by Cholesky decomposition of $\Sigma$. |
| 112 | + |
| 113 | +## Probability gradients |
| 114 | + |
| 115 | +In addition to the expectation, we may also want gradients for individual output densities $q(y | \theta) = \mathbb{P}(f(X) = y)$. |
| 116 | + |
| 117 | +### REINFORCE probability gradients |
| 118 | + |
| 119 | +The REINFORCE technique can be applied in a similar way: |
| 120 | + |
| 121 | +$$ |
| 122 | +q(y | \theta) = \mathbb{E}[\mathbf{1}\{f(X) = y\}] = \int \mathbf{1} \{f(x) = y\} ~ p(x | \theta) ~ \mathrm{d}x |
| 123 | +$$ |
| 124 | + |
| 125 | +Differentiating through the integral, |
| 126 | + |
| 127 | +$$ |
| 128 | +\begin{aligned} |
| 129 | +\nabla_\theta q(y | \theta) |
| 130 | +& = \int \mathbf{1} \{f(x) = y\} ~ \nabla_\theta p(x | \theta) ~ \mathrm{d}x \\ |
| 131 | +& = \mathbb{E} [\mathbf{1} \{f(X) = y\} ~ \nabla_\theta \log p(X | \theta)] |
| 132 | +\end{aligned} |
| 133 | +$$ |
| 134 | + |
| 135 | +The Monte-Carlo approximation for this is |
| 136 | + |
| 137 | +$$ |
| 138 | +\nabla_\theta q(y | \theta) \simeq \frac{1}{S} \sum_{s=1}^S \mathbf{1} \{f(x_s) = y\} ~ \nabla_\theta \log p(x_s | \theta) |
| 139 | +$$ |
| 140 | + |
| 141 | +### Reparametrization probability gradients |
| 142 | + |
| 143 | +To leverage reparametrization, we perform a change of variables: |
| 144 | + |
| 145 | +$$ |
| 146 | +q(y | \theta) = \mathbb{E}[\mathbf{1}\{h_\theta(Z) = y\}] = \int \mathbf{1} \{h_\theta(z) = y\} ~ r(z) ~ \mathrm{d}z |
| 147 | +$$ |
| 148 | + |
| 149 | +Assuming that $h_\theta$ is invertible, we take $z = h_\theta^{-1}(u)$ and |
| 150 | + |
| 151 | +$$ |
| 152 | +\mathrm{d}z = |\partial h_{\theta}^{-1}(u)| ~ \mathrm{d}u |
| 153 | +$$ |
| 154 | + |
| 155 | +so that |
| 156 | + |
| 157 | +$$ |
| 158 | +q(y | \theta) = \int \mathbf{1} \{u = y\} ~ r(h_\theta^{-1}(u)) ~ |\partial h_{\theta}^{-1}(u)| ~ \mathrm{d}u |
| 159 | +$$ |
| 160 | + |
| 161 | +We can now differentiate, but it gets tedious. |
98 | 162 |
|
99 | 163 | ## Bibliography |
100 | 164 |
|
101 | | -```@bibliography |
102 | | -``` |
| 165 | +$$@bibliography |
| 166 | +$$ |
0 commit comments