diff --git a/docs/autoregressive/index.html b/docs/autoregressive/index.html index 80afa71..56af57c 100644 --- a/docs/autoregressive/index.html +++ b/docs/autoregressive/index.html @@ -77,19 +77,18 @@

Autoregressive models

-

We begin our study into generative modeling with autoregressive models. As before, we assume we are given access to a dataset of -dimensional datapoints . For simplicity, we assume the datapoints are binary, i.e., .

+

We begin our study into generative modeling with autoregressive models. As before, we assume we are given access to a dataset \(\mathcal{D}\) of \(n\)-dimensional datapoints \(\mathbf{x}\). For simplicity, we assume the datapoints are binary, i.e., \(\mathbf{x} \in \{0,1\}^n\).

Representation

-

By the chain rule of probability, we can factorize the joint distribution over the -dimensions as

+

By the chain rule of probability, we can factorize the joint distribution over the \(n\)-dimensions as

-

where denotes the vector of random variables with index less than .

+

where \(\mathbf{x}_{< i}=[x_1, x_2, \ldots, x_{i-1}]\) denotes the vector of random variables with index less than \(i\).

The chain rule factorization can be expressed graphically as a Bayesian network.

@@ -101,24 +100,21 @@

Representation

Such a Bayesian network that makes no conditional independence assumptions is said to obey the autoregressive property. -The term autoregressive originates from the literature on time-series models where observations from the previous time-steps are used to predict the value at the current time step. Here, we fix an ordering of the variables and the distribution for the -th random variable depends on the values of all the preceeding random variables in the chosen ordering .

+The term autoregressive originates from the literature on time-series models where observations from the previous time-steps are used to predict the value at the current time step. Here, we fix an ordering of the variables \(x_1, x_2, \ldots, x_n\) and the distribution for the \(i\)-th random variable depends on the values of all the preceding random variables in the chosen ordering \(x_1, x_2, \ldots, x_{i-1}\).

-

If we allow for every conditional to be specified in a tabular form, then such a representation is fully general and can represent any possible distribution over random variables. However, the space complexity for such a representation grows exponentially with .

+

If we allow for every conditional \(p(x_i \vert \mathbf{x}_{< i})\) to be specified in a tabular form, then such a representation is fully general and can represent any possible distribution over \(n\) random variables. However, the space complexity for such a representation grows exponentially with \(n\).

-

To see why, let us consider the conditional for the last dimension, given by . In order to fully specify this conditional, we need to specify a probability for configurations of the variables . Since the probabilities should sum to 1, the total number of parameters for specifying this conditional is given by . Hence, a tabular representation for the conditionals is impractical for learning the joint distribution factorized via chain rule.

+

To see why, let us consider the conditional for the last dimension, given by \(p(x_n \vert \mathbf{x}_{< n})\). In order to fully specify this conditional, we need to specify a probability distribution for each of the \(2^{n-1}\) configurations of the variables \(x_1, x_2, \ldots, x_{n-1}\). For any one of the \(2^{n-1}\) possible configurations of the variables, the probabilities should sum to one. Therefore, we need only one parameter for each configuration, so the total number of parameters for specifying this conditional is given by \(2^{n-1}\). Hence, a tabular representation for the conditionals is impractical for learning the joint distribution factorized via chain rule.

-

In an autoregressive generative model, the conditionals are specified as parameterized functions with a fixed number of parameters. That is, we assume the conditional distributions to correspond to a Bernoulli random variable and learn a function that maps the preceeding random variables to the +

In an autoregressive generative model, the conditionals are specified as parameterized functions with a fixed number of parameters. That is, we assume the conditional distributions \(p(x_i \vert \mathbf{x}_{< i})\) to correspond to a Bernoulli random variable and learn a function that maps the preceding random variables \(x_1, x_2, \ldots, x_{i-1}\) to the mean of this distribution. Hence, we have

-

where denotes the set of parameters used to specify the mean -function .

+

where \(\theta_i\) denotes the set of parameters used to specify the mean +function \(f_i: \{0,1\}^{i-1}\rightarrow [0,1]\).

-

The number of parameters of an autoregressive generative model are given by . As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions.

+

The number of parameters of an autoregressive generative model are given by \(\sum_{i=1}^n \vert \theta_i \vert\). As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions.

drawing @@ -132,17 +128,17 @@

Representation

f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\alpha^{(i)}_0 + \alpha^{(i)}_1 x_1 + \ldots + \alpha^{(i)}_{i-1} x_{i-1}) -

where denotes the sigmoid function and denote the parameters of the mean function. The conditional for variable requires parameters, and hence the total number of parameters in the model is given by . Note that the number of parameters are much fewer than the exponential complexity of the tabular case.

+

where \(\sigma\) denotes the sigmoid function and \(\theta_i=\{\alpha^{(i)}_0,\alpha^{(i)}_1, \ldots, \alpha^{(i)}_{i-1}\}\) denote the parameters of the mean function. The conditional for variable \(i\) requires \(i\) parameters, and hence the total number of parameters in the model is given by \(\sum_{i=1}^ni= O(n^2)\). Note that the number of parameters are much fewer than the exponential complexity of the tabular case.

-

A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable can be expressed as

+

A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable \(i\) can be expressed as

-

where denotes the hidden layer activations for the MLP and -are the set of parameters for the mean function . The total number of parameters in this model is dominated by the matrices and given by .

+

where \(\mathbf{h}_i \in \mathbb{R}^d\) denotes the hidden layer activations for the MLP and \(\theta_i = \{A_i \in \mathbb{R}^{d\times (i-1)}, \mathbf{c}_i \in \mathbb{R}^d, \boldsymbol{\alpha}^{(i)}\in \mathbb{R}^d, b_i \in \mathbb{R}\}\) +are the set of parameters for the mean function \(\mu_i(\cdot)\). The total number of parameters in this model is dominated by the matrices \(A_i\) and given by \(O(n^2 d)\).

drawing @@ -157,26 +153,26 @@

Representation

\mathbf{h}_i = \sigma(W_{., < i} \mathbf{x_{< i}} + \mathbf{c})\\ f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\boldsymbol{\alpha}^{(i)}\mathbf{h}_i +b_i ) -

where is -the full set of parameters for the mean functions . The weight matrix and the bias vector are shared across the conditionals. Sharing parameters offers two benefits:

+

where \(\theta=\{W\in \mathbb{R}^{d\times n}, \mathbf{c} \in \mathbb{R}^d, \{\boldsymbol{\alpha}^{(i)}\in \mathbb{R}^d\}^n_{i=1}, \{b_i \in \mathbb{R}\}^n_{i=1}\}\)is +the full set of parameters for the mean functions \(f_1(\cdot), f_2(\cdot), \ldots, f_n(\cdot)\). The weight matrix \(W\) and the bias vector \(\mathbf{c}\) are shared across the conditionals. Sharing parameters offers two benefits:

  1. -

    The total number of parameters gets reduced from to [readers are encouraged to check!].

    +

    The total number of parameters gets reduced from \(O(n^2 d)\) to \(O(nd)\) [readers are encouraged to check!].

  2. -

    The hidden unit activations can be evaluated in time via the following recursive strategy:

    +

    The hidden unit activations can be evaluated in \(O(nd)\) time via the following recursive strategy:

    -

    with the base case given by .

    +

    with the base case given by \(\mathbf{a}_1=\mathbf{c}\).

Extensions to NADE

-

The RNADE algorithm extends NADE to learn generative models over real-valued data. Here, the conditionals are modeled via a continuous distribution such as a equi-weighted mixture of Gaussians. Instead of learning a mean function, we know learn the means and variances of the Gaussians for every conditional. For statistical and computational efficiency, a single function outputs all the means and variances of the Gaussians for the -th conditional distribution.

+

The RNADE algorithm extends NADE to learn generative models over real-valued data. Here, the conditionals are modeled via a continuous distribution such as a equi-weighted mixture of \(K\) Gaussians. Instead of learning a mean function, we now learn the means \(\mu_{i,1}, \mu_{i,2},\ldots, \mu_{i,K}\) and variances \(\Sigma_{i,1}, \Sigma_{i,2},\ldots, \Sigma_{i,K}\) of the \(K\) Gaussians for every conditional. For statistical and computational efficiency, a single function \(g_i: \mathbb{R}^{i-1}\rightarrow\mathbb{R}^{2K}\) outputs all the means and variances of the \(K\) Gaussians for the \(i\)-th conditional distribution.

Notice that NADE requires specifying a single, fixed ordering of the variables. The choice of ordering can lead to different models. The EoNADE algorithm allows training an ensemble of NADE models with different orderings.

@@ -189,33 +185,32 @@

Learning and inference

(p_{\mathrm{data}}, p_{\theta}) = \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} }\left[\log p_{\mathrm{data}}(\mathbf{x}) - \log p_{\theta}(\mathbf{x})\right] -

Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution which assigns low probability to a datapoint that is likely to be sampled under . In the extreme case, if the density evaluates to zero for a datapoint sampled from , the objective evaluates to .

+

Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution \(p_\theta\) which assigns low probability to a datapoint that is likely to be sampled under \(p_{\mathrm{data}}\). In the extreme case, if the density \(p_\theta(\mathbf{x})\) evaluates to zero for a datapoint sampled from \(p_{\mathrm{data}}\), the objective evaluates to \(+\infty\).

-

Since does not depend on , we can equivalently recover the optimal parameters via maximizing likelihood estimation.

+

Since \(p_{\mathrm{data}}\) does not depend on \(\theta\), we can equivalently recover the optimal parameters via maximizing likelihood estimation.

-

Here, is referred to as the log-likelihood of the datapoint with respect to the model distribution .

+

Here, \(\log p_{\theta}(\mathbf{x})\) is referred to as the log-likelihood of the datapoint \(\mathbf{x}\) with respect to the model distribution \(p_\theta\).

-

To approximate the expectation over the unknown , we make an assumption: points in the dataset are sampled i.i.d. from . This allows us to obtain an unbiased Monte Carlo estimate of the objective as

+

To approximate the expectation over the unknown \(p_{\mathrm{data}}\), we make an assumption: points in the dataset \(\mathcal{D}\) are sampled i.i.d. from \(p_{\mathrm{data}}\). This allows us to obtain an unbiased Monte Carlo estimate of the objective as

-

The maximum likelihood estimation (MLE) objective has an intuitive interpretation: pick the model parameters that maximize the log-probability of the observed datapoints in .

+

The maximum likelihood estimation (MLE) objective has an intuitive interpretation: pick the model parameters \(\theta \in \mathcal{M}\) that maximize the log-probability of the observed datapoints in \(\mathcal{D}\).

-

In practice, we optimize the MLE objective using mini-batch gradient ascent. The algorithm operates in iterations. At every iteration , we sample a mini-batch of datapoints sampled randomly from the dataset () and compute gradients of the objective evaluated for the mini-batch. These parameters at iteration are then given via the following update rule

+

In practice, we optimize the MLE objective using mini-batch gradient ascent. The algorithm operates in iterations. At every iteration \(t\), we sample a mini-batch \(\mathcal{B}_t\) of datapoints sampled randomly from the dataset (\(\vert \mathcal{B}_t\vert < \vert \mathcal{D} \vert\)) and compute gradients of the objective evaluated for the mini-batch. These parameters at iteration \(t+1\) are then given via the following update rule

-

where and are the parameters at iterations and respectively, and is the learning rate at iteration . Typically, we only specify the initial learning rate and update the rate based on a schedule. Variants of stochastic gradient ascent, such as RMS prop and Adam, employ modified update rules that work slightly better in practice.

+

where \(\theta^{(t+1)}\) and \(\theta^{(t)}\) are the parameters at iterations \(t+1\) and \(t\) respectively, and \(r_t\) is the learning rate at iteration \(t\). Typically, we only specify the initial learning rate \(r_1\) and update the rate based on a schedule. Variants of stochastic gradient ascent, such as RMS prop and Adam, employ modified update rules that work slightly better in practice.

-

From a practical standpoint, we must think about how to choose hyperaparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve1.

+

From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve1.

Now that we have a well-defined objective and optimization procedure, the only remaining task is to evaluate the objective in the context of an autoregressive generative model. To this end, we substitute the factorized joint distribution of an autoregressive model in the MLE objective to get

@@ -223,14 +218,12 @@

Learning and inference

\max_{\theta \in \mathcal{M}}\frac{1}{\vert D \vert} \sum_{\mathbf{x} \in\mathcal{D} }\sum_{i=1}^n\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i}) -

where now denotes the +

where \(\theta = \{\theta_1, \theta_2, \ldots, \theta_n\}\) now denotes the collective set of parameters for the conditionals.

-

Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point , we simply evaluate the log-conditionals for each and add these up to obtain the log-likelihood assigned by the model to . Since we know conditioning vector , each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.

+

Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point \(\mathbf{x}\), we simply evaluate the log-conditionals \(\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i})\) for each \(i\) and add these up to obtain the log-likelihood assigned by the model to \(\mathbf{x}\). Since we know conditioning vector \(\mathbf{x}\), each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.

-

Sampling from an autoregressive model is a sequential procedure. Here, we first sample , then we sample conditioned on the sampled , followed by conditioned on both and and so on until we sample conditioned on the previously sampled . For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process. Later in this course, we will discuss how parallel Wavenet, an autoregressive model sidesteps this expensive sampling process.

+

Sampling from an autoregressive model is a sequential procedure. Here, we first sample \(x_1\), then we sample \(x_2\) conditioned on the sampled \(x_1\), followed by \(x_3\) conditioned on both \(x_1\) and \(x_2\) and so on until we sample \(x_n\) conditioned on the previously sampled \(\mathbf{x}_{< n}\). For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process. Later in this course, we will discuss how parallel WaveNet, an autoregressive model sidesteps this expensive sampling process.

@@ -244,10 +237,10 @@

Learning and inference

Footnotes

-
+
  1. -

    Given the non-convex nature of such problems, the optimization procedure can get stuck in local optima. Hence, early stopping will generally not be optimal but is a very practical strategy. 

    +

    Given the non-convex nature of such problems, the optimization procedure can get stuck in local optima. Hence, early stopping will generally not be optimal but is a very practical strategy. 

@@ -283,8 +276,8 @@

Footnotes

- -Site created with Jekyll using the Tufte theme. © 2018 + +Site created with Jekyll using the Tufte theme. © 2025
diff --git a/docs/autoregressive/index.tex b/docs/autoregressive/index.tex index bc157b2..4972ba5 100644 --- a/docs/autoregressive/index.tex +++ b/docs/autoregressive/index.tex @@ -13,7 +13,7 @@ \section{Representation} \] where $\mathbf{x}_{
-

In the model above, and denote the latent and observed variables respectively. The joint distribution expressed by this model is given as

+

In the model above, \(\bz\) and \(\bx\) denote the latent and observed variables respectively. The joint distribution expressed by this model is given as

-

From a generative modeling perspective, this model describes a generative process for the observed data using the following procedure

+

From a generative modeling perspective, this model describes a generative process for the observed data \(\bx\) using the following procedure

-

If one adopts the belief that the latent variables somehow encode semantically meaningful information about , it is natural to view this generative process as first generating the “high-level” semantic information about first before fully generating . Such a perspective motivates generative models with rich latent variable structures such as hierarchical generative models —where information about is generated hierarchically—and temporal models such as the Hidden Markov Model—where temporally-related high-level information is generated first before constructing .

+

If one adopts the belief that the latent variables \(\bz\) somehow encode semantically meaningful information about \(\bx\), it is natural to view this generative process as first generating the “high-level” semantic information about \(\bx\) first before fully generating \(\bx\). Such a perspective motivates generative models with rich latent variable structures such as hierarchical generative models \(p(\bx, \bz_1, \ldots, \bz_m) = p(\bx \giv \bz_1)\prod_i p(\bz_i \giv \bz_{i+1})\)—where information about \(\bx\) is generated hierarchically—and temporal models such as the Hidden Markov Model—where temporally-related high-level information is generated first before constructing \(\bx\).

-

We now consider a family of distributions where describes a probability distribution over . Next, consider a family of conditional distributions where describes a conditional probability distribution over given . Then our hypothesis class of generative models is the set of all possible combinations

+

We now consider a family of distributions \(\P_\bz\) where \(p(\bz) \in \P_\bz\) describes a probability distribution over \(\bz\). Next, consider a family of conditional distributions \(\P_{\bx\giv \bz}\) where \(p_\theta(\bx \giv \bz) \in \P_{\bx\giv \bz}\) describes a conditional probability distribution over \(\bx\) given \(\bz\). Then our hypothesis class of generative models is the set of all possible combinations

-

Given a dataset , we are interested in the following learning and inference tasks

+

Given a dataset \(\D = \set{\bx^{(1)}, \ldots, \bx^{(n)}}\), we are interested in the following learning and inference tasks

@@ -149,7 +149,7 @@

Representation

Learning Directed Latent Variable Models

-

One way to measure how closely fits the observed dataset is to measure the Kullback-Leibler (KL) divergence between the data distribution (which we denote as ) and the model’s marginal distribution . The distribution that ``best’’ fits the data is thus obtained by minimizing the KL divergence.

+

One way to measure how closely \(p(\bx, \bz)\) fits the observed dataset \(\D\) is to measure the Kullback-Leibler (KL) divergence between the data distribution (which we denote as \(p_{\mathrm{data}}(\bx)\)) and the model’s marginal distribution \(p(\bx) = \int p(\bx, \bz) \d \bz\). The distribution that ``best’’ fits the data is thus obtained by minimizing the KL divergence.

-

As we have seen previously, optimizing an empirical estimate of the KL divergence is equivalent to maximizing the marginal log-likelihood over

+

As we have seen previously, optimizing an empirical estimate of the KL divergence is equivalent to maximizing the marginal log-likelihood \(\log p(\bx)\) over \(\D\)

-

However, it turns out this problem is generally intractable for high-dimensional as it involves an integration (or sums in the case is discrete) over all the possible latent sources of variation . One option is to estimate the objective via Monte Carlo. For any given datapoint , we can obtain the following estimate for its marginal log-likelihood

+

However, it turns out this problem is generally intractable for high-dimensional \(\bz\) as it involves an integration (or sums in the case \(\bz\) is discrete) over all the possible latent sources of variation \(\bz\). One option is to estimate the objective via Monte Carlo. For any given datapoint \(\bf x\), we can obtain the following estimate for its marginal log-likelihood

is at least as difficult as as evaluating the posterior for any latent vector since by definition .

+

Rather than maximizing the log-likelihood directly, an alternate is to instead construct a lower bound that is more amenable to optimization. To do so, we note that evaluating the marginal likelihood \(p(\bx)\) is at least as difficult as as evaluating the posterior \(p(\bz \mid \bx)\) for any latent vector \(\bz\) since by definition \(p(\bz \mid \bx) = p(\bx, \bz) / p(\bx)\).

-

Next, we introduce a variational family of distributions that approximate the true, but intractable posterior . Further henceforth, we will assume a parameteric setting where any distribution in the model family is specified via a set of parameters and distributions in the variational family are specified via a set of parameters .

+

Next, we introduce a variational family \(\Q\) of distributions that approximate the true, but intractable posterior \(p(\bz \mid \bx)\). Further henceforth, we will assume a parameteric setting where any distribution in the model family \(\P_{\bx, \bz}\) is specified via a set of parameters \(\theta \in \Theta\) and distributions in the variational family \(\Q\) are specified via a set of parameters \(\lambda \in \Lambda\).

-

Given and , we note that the following relationships hold true1 for any and all variational distributions

+

Given \(\P_{\bx, \bz}\) and \(\Q\), we note that the following relationships hold true1 for any \(\bx\) and all variational distributions \(q_\lambda(\bz) \in \Q\)

-

so long as it is easy to sample from and evaluate densities for .

+

so long as it is easy to sample from and evaluate densities for \(q_\lambda(\bz)\).

-

Which variational distribution should we pick? Even though the above derivation holds for any choice of variational parameters , the tightness of the lower bound depends on the specific choice of .

+

Which variational distribution should we pick? Even though the above derivation holds for any choice of variational parameters \(\lambda\), the tightness of the lower bound depends on the specific choice of \(q\).

drawing @@ -204,9 +204,9 @@

Learning Directed Latent Varia

-

In particular, the gap between the original objective(marginal log-likelihood ) and the ELBO equals the KL divergence between the approximate posterior and the true posterior . The gap is zero when the variational distribution exactly matches .

+

In particular, the gap between the original objective(marginal log-likelihood \(\log p_\theta(\bx)\)) and the ELBO equals the KL divergence between the approximate posterior \(q(\bz)\) and the true posterior \(p(\bz \giv \bx)\). The gap is zero when the variational distribution \(q_\lambda(\bz)\) exactly matches \(p_\theta(\bz \giv \bx)\).

-

In summary, we can learn a latent variable model by maximizing the ELBO with respect to both the model parameters and the variational parameters for any given datapoint

+

In summary, we can learn a latent variable model by maximizing the ELBO with respect to both the model parameters \(\theta\) and the variational parameters \(\lambda\) for any given datapoint \(\bx\)

, the following two steps are performed.

+This inspires Black-Box Variational Inference (BBVI), a general-purpose Expectation-Maximization-like algorithm for variational learning of latent variable models, where, for each mini-batch \(\M = \set{\bx^{(1)}, \ldots, \bx^{(m)}}\), the following two steps are performed.

Step 1

-

We first do per-sample optimization of by iteratively applying the update

+

We first do per-sample optimization of \(q\) by iteratively applying the update

-

where , and denotes an unbiased estimate of the ELBO gradient. This step seeks to approximate the log-likelihood .

+

where \(\text{ELBO}(\bx; \theta, \lambda) = \Expect_{q_\lambda(\bz)} \left[\log \frac{p_\theta(\bx, \bz)}{q_\lambda(\bz)}\right]\), and \(\tilde{\nabla}_\lambda\) denotes an unbiased estimate of the ELBO gradient. This step seeks to approximate the log-likelihood \(\log p_\theta(\bx^{(i)})\).

Step 2

@@ -237,30 +237,30 @@

Black-Box Variational Inference

\theta \gets \theta + \tilde{\nabla}_\theta \sum_{i} \ELBO(\bx^{(i)}; \theta, \lambda^{(i)}), \end{align}
-

which corresponds to the step that hopefully moves closer to .

+

which corresponds to the step that hopefully moves \(p_\theta\) closer to \(p_{\mathrm{data}}\).

Gradient Estimation

-

The gradients and can be estimated via Monte Carlo sampling. While it is straightforward to construct an unbiased estimate of by simply pushing through the expectation operator, the same cannot be said for . Instead, we see that

+

The gradients \(\nabla_\lambda \ELBO\) and \(\nabla_\theta \ELBO\) can be estimated via Monte Carlo sampling. While it is straightforward to construct an unbiased estimate of \(\nabla_\theta \ELBO\) by simply pushing \(\nabla_\theta\) through the expectation operator, the same cannot be said for \(\nabla_\lambda\). Instead, we see that

-

This equality follows from the log-derivative trick (also commonly referred to as the REINFORCE trick). The full derivation involves some simple algebraic manipulations and is left as an exercise for the reader. The gradient estimator is thus

+

This equality follows from the log-derivative trick (also commonly referred to as the REINFORCE trick). The full derivation involves some simple algebraic manipulations and is left as an exercise for the reader. The gradient estimator \(\tilde{\nabla}_\lambda \ELBO\) is thus

-

However, it is often noted that this estimator suffers from high variance. One of the key contributions of the variational autoencoder paper is the reparameterization trick, which introduces a fixed, auxiliary distribution and a differentiable function such that the procedure

+

However, it is often noted that this estimator suffers from high variance. One of the key contributions of the variational autoencoder paper is the reparameterization trick, which introduces a fixed, auxiliary distribution \(p(\veps)\) and a differentiable function \(T(\veps; \lambda)\) such that the procedure

-

is equivalent to sampling from . By the Law of the Unconscious Statistician, we can see that

+

is equivalent to sampling from \(q_\lambda(\bz)\). By the Law of the Unconscious Statistician, we can see that

and in the abstract. To instantiate these objects, we consider choices of parametric distributions for , , and . A popular choice for is the unit Gaussian

+

So far, we have described \(p_\theta(\bx, \bz)\) and \(q_\lambda(\bz)\) in the abstract. To instantiate these objects, we consider choices of parametric distributions for \(p_\theta(\bz)\), \(p_\theta(\bx \giv \bz)\), and \(q_\lambda(\bz)\). A popular choice for \(p_\theta(\bz)\) is the unit Gaussian

-

in which case is simply the empty set since the prior is a fixed distribution. Another alternative often used in practice is a mixture of Gaussians with trainable mean and covariance parameters.

+

in which case \(\theta\) is simply the empty set since the prior is a fixed distribution. Another alternative often used in practice is a mixture of Gaussians with trainable mean and covariance parameters.

-

The conditional distribution is where we introduce a deep neural network. We note that a conditional distribution can be constructed by defining a distribution family (parameterized by ) in the target space (i.e. defines an unconditional distribution over ) and a mapping function . +

The conditional distribution \(p_\theta(\bx \giv \bz)\) is where we introduce a deep neural network. We note that a conditional distribution can be constructed by defining a distribution family (parameterized by \(\omega \in \Omega\)) in the target space \(\bx\) (i.e. \(p_\omega(\bx)\) defines an unconditional distribution over \(\bx\)) and a mapping function \(g_\theta: \Z \to \Omega\). -In other words, defines the conditional distribution

+In other words, \(g_\theta(\cdot)\) defines the conditional distribution

-

The function is also referred to as the decoding distribution since it maps a latent code to the parameters of a distribution over observed variables . In practice, it is typical to specify as a deep neural network.
+

The function \(g_\theta\) is also referred to as the decoding distribution since it maps a latent code \(\bz\) to the parameters of a distribution over observed variables \(\bx\). In practice, it is typical to specify \(g_\theta\) as a deep neural network.
-In the case where is a Gaussian distribution, we can thus represent it as

+In the case where \(p_\theta(\bx \giv \bz)\) is a Gaussian distribution, we can thus represent it as

-

where and are neural networks that specify the mean and covariance matrix for the Gaussian distribution over when conditioned on .

+

where \(\mu_\theta(\bz)\) and \(\Sigma_\theta(\bz)\) are neural networks that specify the mean and covariance matrix for the Gaussian distribution over \(\bx\) when conditioned on \(\bz\).

-

Finally, the variational family for the proposal distribution needs to be chosen judiciously so that the reparameterization trick is possible. Many continuous distributions in the location-scale family can be reparameterized. In practice, a popular choice is again the Gaussian distribution, where

+

Finally, the variational family for the proposal distribution \(q_\lambda(\bz)\) needs to be chosen judiciously so that the reparameterization trick is possible. Many continuous distributions in the location-scale family can be reparameterized. In practice, a popular choice is again the Gaussian distribution, where

-

where is the Cholesky decomposition of . For simplicity, practitioners often restrict to be a diagonal matrix (which restricts the distribution family to that of factorized Gaussians).

+

where \(\Sigma^{1/2}\) is the Cholesky decomposition of \(\Sigma\). For simplicity, practitioners often restrict \(\Sigma\) to be a diagonal matrix (which restricts the distribution family to that of factorized Gaussians).

Amortized Variational Inference

-

A noticable limitation of black-box variational inference is that Step 1 executes an optimization subroutine that is computationally expensive. Recall that the goal of the Step 1 is to find

+

A noticeable limitation of black-box variational inference is that Step 1 executes an optimization subroutine that is computationally expensive. Recall that the goal of the Step 1 is to find

-

For a given choice of , there is a well-defined mapping from . A key realization is that this mapping can be learned. In particular, one can train an encoding function (parameterized by ) -(where is the space of parameters) +

For a given choice of \(\theta\), there is a well-defined mapping from \(\bx \mapsto \lambda^\ast\). A key realization is that this mapping can be learned. In particular, one can train an encoding function (parameterized by \(\phi\)) \(f_\phi: \X \to \Lambda\) +(where \(\Lambda\) is the space of \(\lambda\) parameters) on the following objective

-

It is worth noting at this point that can be interpreted as defining the conditional distribution . With a slight abuse of notation, we define

+

It is worth noting at this point that \(f_\phi(\bx)\) can be interpreted as defining the conditional distribution \(q_\phi(\bz \giv \bx)\). With a slight abuse of notation, we define

-

It is also worth noting that optimizing over the entire dataset as a subroutine everytime we sample a new mini-batch is clearly not reasonable. However, if we believe that is capable of quickly adapting to a close-enough approximation of given the current choice of , then we can interleave the optimization and . The yields the following procedure, where for each mini-batch , we perform the following two updates jointly

+

It is also worth noting that optimizing \(\phi\) over the entire dataset as a subroutine every time we sample a new mini-batch is clearly not reasonable. However, if we believe that \(f_\phi\) is capable of quickly adapting to a close-enough approximation of \(\lambda^\ast\) given the current choice of \(\theta\), then we can interleave the optimization \(\phi\) and \(\theta\). This yields the following procedure, where for each mini-batch \(\M = \set{\bx^{(1)}, \ldots, \bx^{(m)}}\), we perform the following two updates jointly

-

rather than running BBVI’s Step 1 as a subroutine. By leveraging the learnability of , this optimization procedure amortizes the cost of variational inference. If one further chooses to define as a neural network, the result is the variational autoencoder.

+

rather than running BBVI’s Step 1 as a subroutine. By leveraging the learnability of \(\bx \mapsto \lambda^\ast\), this optimization procedure amortizes the cost of variational inference. If one further chooses to define \(f_\phi\) as a neural network, the result is the variational autoencoder.

Footnotes

-
+
  1. -

    The first equality only holds if the support of includes that of . If not, it is an inequality. 

    +

    The first equality only holds if the support of \(q\) includes that of \(p\). If not, it is an inequality. 

@@ -385,8 +385,8 @@

Footnotes

- -Site created with Jekyll using the Tufte theme. © 2018 + +Site created with Jekyll using the Tufte theme. © 2025