From 218ec651380bc889f1466ede68c72ebb13cf58d6 Mon Sep 17 00:00:00 2001
From: armenk We begin our study into generative modeling with autoregressive models. As before, we assume we are given access to a dataset of -dimensional datapoints . For simplicity, we assume the datapoints are binary, i.e., . We begin our study into generative modeling with autoregressive models. As before, we assume we are given access to a dataset \(\mathcal{D}\) of \(n\)-dimensional datapoints \(\mathbf{x}\). For simplicity, we assume the datapoints are binary, i.e., \(\mathbf{x} \in \{0,1\}^n\). By the chain rule of probability, we can factorize the joint distribution over the -dimensions as By the chain rule of probability, we can factorize the joint distribution over the \(n\)-dimensions as where denotes the vector of random variables with index less than . where \(\mathbf{x}_{< i}=[x_1, x_2, \ldots, x_{i-1}]\) denotes the vector of random variables with index less than \(i\). The chain rule factorization can be expressed graphically as a Bayesian network. Such a Bayesian network that makes no conditional independence assumptions is said to obey the autoregressive property.
-The term autoregressive originates from the literature on time-series models where observations from the previous time-steps are used to predict the value at the current time step. Here, we fix an ordering of the variables and the distribution for the -th random variable depends on the values of all the preceeding random variables in the chosen ordering .Autoregressive models
-Representation
-Representation
If we allow for every conditional to be specified in a tabular form, then such a representation is fully general and can represent any possible distribution over random variables. However, the space complexity for such a representation grows exponentially with .
+If we allow for every conditional \(p(x_i \vert \mathbf{x}_{< i})\) to be specified in a tabular form, then such a representation is fully general and can represent any possible distribution over \(n\) random variables. However, the space complexity for such a representation grows exponentially with \(n\).
-To see why, let us consider the conditional for the last dimension, given by . In order to fully specify this conditional, we need to specify a probability for configurations of the variables . Since the probabilities should sum to 1, the total number of parameters for specifying this conditional is given by . Hence, a tabular representation for the conditionals is impractical for learning the joint distribution factorized via chain rule.
+To see why, let us consider the conditional for the last dimension, given by \(p(x_n \vert \mathbf{x}_{< n})\). In order to fully specify this conditional, we need to specify a probability distribution for each of the \(2^{n-1}\) configurations of the variables \(x_1, x_2, \ldots, x_{n-1}\). For any one of the \(2^{n-1}\) possible configurations of the variables, the probabilities should sum to one. Therefore, we need only one parameter for each configuration, so the total number of parameters for specifying this conditional is given by \(2^{n-1}\). Hence, a tabular representation for the conditionals is impractical for learning the joint distribution factorized via chain rule.
-In an autoregressive generative model, the conditionals are specified as parameterized functions with a fixed number of parameters. That is, we assume the conditional distributions to correspond to a Bernoulli random variable and learn a function that maps the preceeding random variables to the +
In an autoregressive generative model, the conditionals are specified as parameterized functions with a fixed number of parameters. That is, we assume the conditional distributions \(p(x_i \vert \mathbf{x}_{< i})\) to correspond to a Bernoulli random variable and learn a function that maps the preceding random variables \(x_1, x_2, \ldots, x_{i-1}\) to the mean of this distribution. Hence, we have
-where denotes the set of parameters used to specify the mean -function .
+where \(\theta_i\) denotes the set of parameters used to specify the mean +function \(f_i: \{0,1\}^{i-1}\rightarrow [0,1]\).
-The number of parameters of an autoregressive generative model are given by . As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions.
+The number of parameters of an autoregressive generative model are given by \(\sum_{i=1}^n \vert \theta_i \vert\). As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions.
where denotes the sigmoid function and denote the parameters of the mean function. The conditional for variable requires parameters, and hence the total number of parameters in the model is given by . Note that the number of parameters are much fewer than the exponential complexity of the tabular case.
+where \(\sigma\) denotes the sigmoid function and \(\theta_i=\{\alpha^{(i)}_0,\alpha^{(i)}_1, \ldots, \alpha^{(i)}_{i-1}\}\) denote the parameters of the mean function. The conditional for variable \(i\) requires \(i\) parameters, and hence the total number of parameters in the model is given by \(\sum_{i=1}^ni= O(n^2)\). Note that the number of parameters are much fewer than the exponential complexity of the tabular case.
-A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable can be expressed as
+A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable \(i\) can be expressed as
-where denotes the hidden layer activations for the MLP and -are the set of parameters for the mean function . The total number of parameters in this model is dominated by the matrices and given by .
+where \(\mathbf{h}_i \in \mathbb{R}^d\) denotes the hidden layer activations for the MLP and \(\theta_i = \{A_i \in \mathbb{R}^{d\times (i-1)}, \mathbf{c}_i \in \mathbb{R}^d, \boldsymbol{\alpha}^{(i)}\in \mathbb{R}^d, b_i \in \mathbb{R}\}\) +are the set of parameters for the mean function \(\mu_i(\cdot)\). The total number of parameters in this model is dominated by the matrices \(A_i\) and given by \(O(n^2 d)\).
where is -the full set of parameters for the mean functions . The weight matrix and the bias vector are shared across the conditionals. Sharing parameters offers two benefits:
+where \(\theta=\{W\in \mathbb{R}^{d\times n}, \mathbf{c} \in \mathbb{R}^d, \{\boldsymbol{\alpha}^{(i)}\in \mathbb{R}^d\}^n_{i=1}, \{b_i \in \mathbb{R}\}^n_{i=1}\}\)is +the full set of parameters for the mean functions \(f_1(\cdot), f_2(\cdot), \ldots, f_n(\cdot)\). The weight matrix \(W\) and the bias vector \(\mathbf{c}\) are shared across the conditionals. Sharing parameters offers two benefits:
The total number of parameters gets reduced from to [readers are encouraged to check!].
+The total number of parameters gets reduced from \(O(n^2 d)\) to \(O(nd)\) [readers are encouraged to check!].
The hidden unit activations can be evaluated in time via the following recursive strategy:
+The hidden unit activations can be evaluated in \(O(nd)\) time via the following recursive strategy:
-with the base case given by .
+with the base case given by \(\mathbf{a}_1=\mathbf{c}\).
The RNADE algorithm extends NADE to learn generative models over real-valued data. Here, the conditionals are modeled via a continuous distribution such as a equi-weighted mixture of Gaussians. Instead of learning a mean function, we know learn the means and variances of the Gaussians for every conditional. For statistical and computational efficiency, a single function outputs all the means and variances of the Gaussians for the -th conditional distribution.
+The RNADE algorithm extends NADE to learn generative models over real-valued data. Here, the conditionals are modeled via a continuous distribution such as a equi-weighted mixture of \(K\) Gaussians. Instead of learning a mean function, we now learn the means \(\mu_{i,1}, \mu_{i,2},\ldots, \mu_{i,K}\) and variances \(\Sigma_{i,1}, \Sigma_{i,2},\ldots, \Sigma_{i,K}\) of the \(K\) Gaussians for every conditional. For statistical and computational efficiency, a single function \(g_i: \mathbb{R}^{i-1}\rightarrow\mathbb{R}^{2K}\) outputs all the means and variances of the \(K\) Gaussians for the \(i\)-th conditional distribution.
Notice that NADE requires specifying a single, fixed ordering of the variables. The choice of ordering can lead to different models. The EoNADE algorithm allows training an ensemble of NADE models with different orderings.
@@ -189,33 +185,32 @@Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution which assigns low probability to a datapoint that is likely to be sampled under . In the extreme case, if the density evaluates to zero for a datapoint sampled from , the objective evaluates to .
+Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution \(p_\theta\) which assigns low probability to a datapoint that is likely to be sampled under \(p_{\mathrm{data}}\). In the extreme case, if the density \(p_\theta(\mathbf{x})\) evaluates to zero for a datapoint sampled from \(p_{\mathrm{data}}\), the objective evaluates to \(+\infty\).
-Since does not depend on , we can equivalently recover the optimal parameters via maximizing likelihood estimation.
+Since \(p_{\mathrm{data}}\) does not depend on \(\theta\), we can equivalently recover the optimal parameters via maximizing likelihood estimation.
-Here, is referred to as the log-likelihood of the datapoint with respect to the model distribution .
+Here, \(\log p_{\theta}(\mathbf{x})\) is referred to as the log-likelihood of the datapoint \(\mathbf{x}\) with respect to the model distribution \(p_\theta\).
-To approximate the expectation over the unknown , we make an assumption: points in the dataset are sampled i.i.d. from . This allows us to obtain an unbiased Monte Carlo estimate of the objective as
+To approximate the expectation over the unknown \(p_{\mathrm{data}}\), we make an assumption: points in the dataset \(\mathcal{D}\) are sampled i.i.d. from \(p_{\mathrm{data}}\). This allows us to obtain an unbiased Monte Carlo estimate of the objective as
-The maximum likelihood estimation (MLE) objective has an intuitive interpretation: pick the model parameters that maximize the log-probability of the observed datapoints in .
+The maximum likelihood estimation (MLE) objective has an intuitive interpretation: pick the model parameters \(\theta \in \mathcal{M}\) that maximize the log-probability of the observed datapoints in \(\mathcal{D}\).
-In practice, we optimize the MLE objective using mini-batch gradient ascent. The algorithm operates in iterations. At every iteration , we sample a mini-batch of datapoints sampled randomly from the dataset () and compute gradients of the objective evaluated for the mini-batch. These parameters at iteration are then given via the following update rule
+In practice, we optimize the MLE objective using mini-batch gradient ascent. The algorithm operates in iterations. At every iteration \(t\), we sample a mini-batch \(\mathcal{B}_t\) of datapoints sampled randomly from the dataset (\(\vert \mathcal{B}_t\vert < \vert \mathcal{D} \vert\)) and compute gradients of the objective evaluated for the mini-batch. These parameters at iteration \(t+1\) are then given via the following update rule
-where and are the parameters at iterations and respectively, and is the learning rate at iteration . Typically, we only specify the initial learning rate and update the rate based on a schedule. Variants of stochastic gradient ascent, such as RMS prop and Adam, employ modified update rules that work slightly better in practice.
+where \(\theta^{(t+1)}\) and \(\theta^{(t)}\) are the parameters at iterations \(t+1\) and \(t\) respectively, and \(r_t\) is the learning rate at iteration \(t\). Typically, we only specify the initial learning rate \(r_1\) and update the rate based on a schedule. Variants of stochastic gradient ascent, such as RMS prop and Adam, employ modified update rules that work slightly better in practice.
-From a practical standpoint, we must think about how to choose hyperaparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve1.
+From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve1.
Now that we have a well-defined objective and optimization procedure, the only remaining task is to evaluate the objective in the context of an autoregressive generative model. To this end, we substitute the factorized joint distribution of an autoregressive model in the MLE objective to get
@@ -223,14 +218,12 @@where now denotes the +
where \(\theta = \{\theta_1, \theta_2, \ldots, \theta_n\}\) now denotes the collective set of parameters for the conditionals.
-Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point , we simply evaluate the log-conditionals for each and add these up to obtain the log-likelihood assigned by the model to . Since we know conditioning vector , each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.
+Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point \(\mathbf{x}\), we simply evaluate the log-conditionals \(\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i})\) for each \(i\) and add these up to obtain the log-likelihood assigned by the model to \(\mathbf{x}\). Since we know conditioning vector \(\mathbf{x}\), each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.
-Sampling from an autoregressive model is a sequential procedure. Here, we first sample , then we sample conditioned on the sampled , followed by conditioned on both and and so on until we sample conditioned on the previously sampled . For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process. Later in this course, we will discuss how parallel Wavenet, an autoregressive model sidesteps this expensive sampling process.
+Sampling from an autoregressive model is a sequential procedure. Here, we first sample \(x_1\), then we sample \(x_2\) conditioned on the sampled \(x_1\), followed by \(x_3\) conditioned on both \(x_1\) and \(x_2\) and so on until we sample \(x_n\) conditioned on the previously sampled \(\mathbf{x}_{< n}\). For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process. Later in this course, we will discuss how parallel WaveNet, an autoregressive model sidesteps this expensive sampling process.
@@ -244,10 +237,10 @@Given the non-convex nature of such problems, the optimization procedure can get stuck in local optima. Hence, early stopping will generally not be optimal but is a very practical strategy. ↩
+Given the non-convex nature of such problems, the optimization procedure can get stuck in local optima. Hence, early stopping will generally not be optimal but is a very practical strategy. ↩
We continue our study over another type of likelihood based generative models. As before, we assume we are given access to a dataset of -dimensional datapoints . So far we have learned two types of likelihood based generative models:
+We continue our study over another type of likelihood based generative models. As before, we assume we are given access to a dataset \(\mathcal{D}\) of \(n\)-dimensional datapoints \(\mathbf{x}\). So far we have learned two types of likelihood based generative models:
Autoregressive Models:
+Autoregressive Models: \(p_\theta(\mathbf{x}) = \prod_{i=1}^{N} p_\theta(x_i \vert \mathbf{x}_{<i})\)
Variational autoencoders:
+Variational autoencoders: \(p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}, \mathbf{z}) \text{d}\mathbf{z}\)
The two methods have relative strengths and weaknesses. Autoregressive models provide tractable likelihoods but no direct mechanism for learning features, whereas variational autoencoders can learn feature representations but have intractable marginal likelihoods.
-In this section, we introduce normalizing flows a type of method that combines the best of both worlds, allowing both feature learning and tractable marginal likelihood estimation.
+In this section, we introduce normalizing flows: a type of method that combines the best of both worlds, allowing both feature learning and tractable marginal likelihood estimation.
In normalizing flows, we wish to map simple distributions (easy to sample and evaluate densities) to complex ones (learned via data). The change of variables formula describe how to evaluate densities of a random variable that is a deterministic transformation from another variable.
-Change of Variables: and be random variables which are related by a mapping such that and . Then
+Change of Variables: \(Z\) and \(X\) be random variables which are related by a mapping \(f: \mathbb{R}^n \to \mathbb{R}^n\) such that \(X = f(Z)\) and \(Z = f^{-1}(X)\). Then
\(\mathbb{x}\) and \(\mathbb{z}\) need to be continuous and have the same dimension.
is a matrix of dimension , where each entry at location is defined as . This matrix is also known as the Jacobian matrix.
+\(\frac{\partial f^{-1}(\mathbb{x})}{\partial \mathbb{x}}\) is a matrix of dimension \(n \times n\), where each entry at location \((i, j)\) is defined as \(\frac{\partial f^{-1}(\mathbb{x})_i}{\partial x_j}\). This matrix is also known as the Jacobian matrix.
denotes the determinant of a square matrix .
+\(\text{det}(A)\) denotes the determinant of a square matrix \(A\).
For any invertible matrix , , so for we have
+For any invertible matrix \(A\), \(\text{det}(A^{-1}) = \text{det}(A)^{-1}\), so for \(\mathbb{z} = f^{-1}(\mathbb{x})\) we have
If , then the mappings is volume preserving, which means that the transformed distribution will have the same “volume” compared to the original one .
+If \(\left \vert \text{det}\left(\frac{\partial f(\mathbb{z})}{\partial \mathbb{z}}\right) \right\vert = 1\), then the mappings is volume preserving, which means that the transformed distribution \(p_X\) will have the same “volume” compared to the original one \(p_Z\).
We are ready to introduce normalizing flow models. Let us consider a directed, latent-variable model over observed variables and latent variables . In a normalizing flow model, the mapping between and , given by , is deterministic and invertible such that and 1.
+We are ready to introduce normalizing flow models. Let us consider a directed, latent-variable model over observed variables \(X\) and latent variables \(Z\). In a normalizing flow model, the mapping between \(Z\) and \(X\), given by \(f_\theta: \mathbb{R}^n \to \mathbb{R}^n\), is deterministic and invertible such that \(X = f_\theta(Z)\) and \(Z = f_\theta^{-1}(X)\)1.
Using change of variables, the marginal likelihood is given by
+Using change of variables, the marginal likelihood \(p(x)\) is given by
-where are parameters.
+where \(\mathbf{u}, \mathbf{w}, b\) are parameters.
The absolute value of the determinant of the Jacobian is given by
@@ -174,31 +173,31 @@However, need to be restricted in order to be invertible. For example, and . Note that while is invertible, computing could be difficult analytically. The following models address this problem, where both and have simple analytical forms.
+However, \(\mathbf{u}, \mathbf{w}, b, h(\cdot)\) need to be restricted in order to be invertible. For example, \(h = \tanh\) and \(h'(\mathbf{w}^\top \mathbf{z} + b) \mathbf{u}^\top \mathbf{w} \geq -1\). Note that while \(f_\theta(\mathbf{z})\) is invertible, computing \(f_\theta^{-1}(\mathbf{z})\) could be difficult analytically. The following models address this problem, where both \(f_\theta\) and \(f_\theta^{-1}\) have simple analytical forms.
-The Nonlinear Independent Components Estimation (NICE) model and Real Non-Volume Preserving (RealNVP) model composes two kinds of invertible transformations: additive coupling layers and rescaling layers. The coupling layer in NICE partitions a variable into two disjoints subsets, say and . Then it applies the following transformation:
+The Nonlinear Independent Components Estimation (NICE) model and Real Non-Volume Preserving (RealNVP) model composes two kinds of invertible transformations: additive coupling layers and rescaling layers. The coupling layer in NICE partitions a variable \(\mathbf{z}\) into two disjoint subsets, say \(\mathbf{z}_1\) and \(\mathbf{z}_2\). Then it applies the following transformation:
-Forward mapping
+Forward mapping \(\mathbf{z} \to \mathbf{x}\)
, which is an identity mapping.
+\(\mathbf{x}_1 = \mathbf{z}_1\), which is an identity mapping.
, where is a neural network.
+\(\mathbf{x}_2 = \mathbf{z}_2 + m_\theta(\mathbf{z_1})\), where \(m_\theta\) is a neural network.
Inverse mapping :
+Inverse mapping \(\mathbf{x} \to \mathbf{z}\):
, which is an identity mapping.
+\(\mathbf{z}_1 = \mathbf{x}_1\), which is an identity mapping.
, which is the inverse of the forward transformation.
+\(\mathbf{z}_2 = \mathbf{x}_2 - m_\theta(\mathbf{x_1})\), which is the inverse of the forward transformation.
Therefore, the Jacobian of the forward mapping is lower trangular, whose determinant is simply the product of the elements on the diagonal, which is 1. Therefore, this defines a volume preserving transformation. RealNVP adds scaling factors to the transformation:
+Therefore, the Jacobian of the forward mapping is lower triangular, whose determinant is simply the product of the elements on the diagonal, which is 1. Therefore, this defines a volume preserving transformation. RealNVP adds scaling factors to the transformation:
-where denotes elementwise product. This results in a non-volume preserving transformation.
+where \(\odot\) denotes elementwise product. This results in a non-volume preserving transformation.
-Some autoregressive models can also be interpreted as flow models. For a Gaussian autoregressive model, one receive some Gaussian noise for each dimension of , which can be treated as the latent variables . Such transformations are also invertible, meaning that given and the model parameters, we can obtain exactly.
+Some autoregressive models can also be interpreted as flow models. For a Gaussian autoregressive model, one receive some Gaussian noise for each dimension of \(\mathbb{x}\), which can be treated as the latent variables \(\mathbf{z}\). Such transformations are also invertible, meaning that given \(\mathbf{x}\) and the model parameters, we can obtain \(\mathbf{z}\) exactly.
-Masked Autoregressive Flow (MAF) uses this interpretation, where the forward mapping is an autoregressive model. However, sampling is sequential and slow, in time where is the dimension of the samples.
+Masked Autoregressive Flow (MAF) uses this interpretation, where the forward mapping is an autoregressive model. However, sampling is sequential and slow, in \(O(n)\) time where \(n\) is the dimension of the samples.
To address the sampling problem, the Inverse Autoregressive Flow (IAF) simply inverts the generating process. In this case, generating from the noise can be parallelized, but computing the likelihood of new data points is slow. However, for generated points the likelihood can be computed efficiently (since the noise are already obtained).
+To address the sampling problem, the Inverse Autoregressive Flow (IAF) simply inverts the generating process. In this case, generating \(\mathbf{x}\) from the noise can be parallelized, but computing the likelihood of new data points is slow. However, for generated points the likelihood can be computed efficiently (since the noise are already obtained).
Why not? In fact, it is not so clear that better likelihood numbers necessarily correspond to higher sample quality. We know that the optimal generative model will give us the best sample quality and highest test log-likelihood. However, models with high test log-likelihoods can still yield poor samples, and vice versa. To see why, consider pathological cases in which our model is comprised almost entirely of noise, or our model simply memorizes the training set. Therefore, we turn to likelihood-free training with the hope that optimizing a different objective will allow us to disentangle our desiderata of obtaining high likelihoods as well as high-quality samples.
-Recall that maximum likelihood required us to evaluate the likelihood of the data under our model . A natural way to set up a likelihood-free objective is to consider the two-sample test, a statistical test that determines whether or not a finite set of samples from two distributions are from the same distribution using only samples from and . Concretely, given and , we compute a test statistic according to the difference in and that, when less than a threshold , accepts the null hypothesis that .
+Recall that maximum likelihood required us to evaluate the likelihood of the data under our model \(p_\theta\). A natural way to set up a likelihood-free objective is to consider the two-sample test, a statistical test that determines whether or not a finite set of samples from two distributions are from the same distribution using only samples from \(P\) and \(Q\). Concretely, given \(S_1 = \{\mathbf{x} \sim P\}\) and \(S_2 = \{\mathbf{x} \sim Q\}\), we compute a test statistic \(T\) according to the difference in \(S_1\) and \(S_2\) that, when less than a threshold \(\alpha\), accepts the null hypothesis that \(P = Q\).
-Analogously, we have in our generative modeling setup access to our training set and . The key idea is to train the model to minimize a two-sample test objective between and . But this objective becomes extremely difficult to work with in high dimensions, so we choose to optimize a surrogate objective that instead maximizes some distance between and .
+Analogously, we have in our generative modeling setup access to our training set \(S_1 = \mathcal{D} = \{\mathbf{x} \sim p_{\textrm{data}} \}\) and \(S_2 = \{\mathbf{x} \sim p_{\theta} \}\). The key idea is to train the model to minimize a two-sample test objective between \(S_1\) and \(S_2\). But this objective becomes extremely difficult to work with in high dimensions, so we choose to optimize a surrogate objective that instead maximizes some distance between \(S_1\) and \(S_2\).
We thus arrive at the generative adversarial network formulation. There are two components in a GAN: (1) a generator and (2) a discriminator. The generator is a directed latent variable model that deterministically generates samples from , and the discriminator is a function whose job is to distinguish samples from the real dataset and the generator. The image below is a graphical model of and . denotes samples (either from data or generator), denotes our noise vector, and denotes the discriminator’s prediction about .
+We thus arrive at the generative adversarial network formulation. There are two components in a GAN: (1) a generator and (2) a discriminator. The generator \(G_\theta\) is a directed latent variable model that deterministically generates samples \(\mathbf{x}\) from \(\mathbf{z}\), and the discriminator \(D_\phi\) is a function whose job is to distinguish samples from the real dataset and the generator. The image below is a graphical model of \(G_\theta\) and \(D_\phi\). \(\mathbf{x}\) denotes samples (either from data or generator), \(\mathbf{z}\) denotes our noise vector, and \(\mathbf{y}\) denotes the discriminator’s prediction about \(\mathbf{x}\).
The generator and discriminator both play a two player minimax game, where the generator minimizes a two-sample test objective () and the discriminator maximizes the objective (). Intuitively, the generator tries to fool the discriminator to the best of its ability by generating samples that look indisginguishable from .
+The generator and discriminator both play a two player minimax game, where the generator minimizes a two-sample test objective (\(p_{\textrm{data}} = p_\theta\)) and the discriminator maximizes the objective (\(p_{\textrm{data}} \neq p_\theta\)). Intuitively, the generator tries to fool the discriminator to the best of its ability by generating samples that look indistinguishable from \(p_{\textrm{data}}\).
Formally, the GAN objective can be written as:
@@ -107,40 +107,40 @@Let’s unpack this expression. We know that the discriminator is maximizing this function with respect to its parameters , where given a fixed generator it is performing binary classification: it assigns probability 1 to data points from the training set , and assigns probability 0 to generated samples . In this setup, the optimal discriminator is:
+Let’s unpack this expression. We know that the discriminator is maximizing this function with respect to its parameters \(\phi\), where given a fixed generator \(G_\theta\) it is performing binary classification: it assigns probability 1 to data points from the training set \(\mathbf{x} \sim p_{\textrm{data}}\), and assigns probability 0 to generated samples \(\mathbf{x} \sim p_G\). In this setup, the optimal discriminator is:
-On the other hand, the generator minimizes this objective for a fixed discriminator . And after performing some algebra, plugging in the optimal discriminator into the overall objective gives us:
+On the other hand, the generator minimizes this objective for a fixed discriminator \(D_\phi\). And after performing some algebra, plugging in the optimal discriminator \(D^*_G(\cdot)\) into the overall objective \(V(G_\theta, D^*_G(\mathbf{x}))\) gives us:
-The term is the Jenson-Shannon Divergence, which is also known as the symmetric form of the KL divergence:
+The \(D_{\textrm{JSD}}\) term is the Jenson-Shannon Divergence, which is also known as the symmetric form of the KL divergence:
-The JSD satisfies all properties of the KL, and has the additional perk that . With this distance metric, the optimal generator for the GAN objective becomces , and the optimal objective value that we can achieve with optimal generators and discriminators and is .
+The JSD satisfies all properties of the KL, and has the additional perk that \(D_{\textrm{JSD}}[p,q] = D_{\textrm{JSD}}[q,p]\). With this distance metric, the optimal generator for the GAN objective becomes \(p_G = p_{\textrm{data}}\), and the optimal objective value that we can achieve with optimal generators and discriminators \(G^*(\cdot)\) and \(D^*_{G^*}(\mathbf{x})\) is \(-\log 4\).
Thus, the way in which we train a GAN is as follows:
-For epochs do:
+For epochs \(1, \ldots, N\) do:
Next, we focus our attention to a few select types of GAN architectures and explore them in more detail.
The f-GAN optimizes the variant of the two-sample test objective that we have discussed so far, but using a very general notion of distance: the . Given two densities and , the -divergence can be written as:
+The f-GAN optimizes the variant of the two-sample test objective that we have discussed so far, but using a very general notion of distance: the \(f divergence\). Given two densities \(p\) and \(q\), the \(f\)-divergence can be written as:
-where is any convex1, lower-semicontinuous2 function with . Several of the distance “metrics” that we have seen so far fall under the class of f-divergences, such as KL, Jenson-Shannon, and total variation.
+where \(f\) is any convex1, lower-semicontinuous2 function with \(f(1) = 0\). Several of the distance “metrics” that we have seen so far fall under the class of f-divergences, such as KL, Jenson-Shannon, and total variation.
-To set up the f-GAN objective, we borrow two commonly used tools from convex optimization3: the Fenchel conjugate and duality. Specifically, we obtain a lower bound to any f-divergence via its Fenchel conjugate:
+To set up the f-GAN objective, we borrow two commonly used tools from convex optimization3: the Fenchel conjugate and duality. Specifically, we obtain a lower bound to any f-divergence via its Fenchel conjugate:
-Therefore we can choose any f-divergence that we desire, let and , parameterize by and by , and obtain the following fGAN objective:
+Therefore we can choose any f-divergence that we desire, let \(p = p_{\textrm{data}}\) and \(q = p_G\), parameterize \(T\) by \(\phi\) and \(G\) by \(\theta\), and obtain the following fGAN objective:
CycleGAN is a type of GAN that allows us to do unsupervised image-to-image translation, from two domains \(\mathcal{X} \leftrightarrow \mathcal{Y}\).
-Specifically, we learn two conditional generative models: and . There is a discriminator associated with that compares the true with the generated samples . Similarly, there is another discriminator associated with that compares the true with the generated samples . The figure below illustrates the CycleGAN setup:
+Specifically, we learn two conditional generative models: \(G: \mathcal{X} \leftrightarrow \mathcal{Y}\) and \(F: \mathcal{Y} \leftrightarrow \mathcal{X}\). There is a discriminator \(D_\mathcal{Y}\) associated with \(G\) that compares the true \(Y\) with the generated samples \(\hat{Y} = G(X)\). Similarly, there is another discriminator \(D_\mathcal{X}\) associated with \(F\) that compares the true \(X\) with the generated samples \(\hat{X} = F(Y)\). The figure below illustrates the CycleGAN setup:
CycleGAN enforces a property known as cycle consistency, which states that if we can go from to via , then we should also be able to go from to via . The overall loss function can be written as:
+CycleGAN enforces a property known as cycle consistency, which states that if we can go from \(X\) to \(\hat{Y}\) via \(G\), then we should also be able to go from \(\hat{Y}\) to \(X\) via \(F\). The overall loss function can be written as:
In this context, convex means a line joining any two points that lies above the function. ↩
+In this context, convex means a line joining any two points that lies above the function. ↩
The function value at any point is close to or greater than . ↩
+The function value at any point \(\mathbf{x}_0\) is close to or greater than \(f(\mathbf{x}_0)\). ↩
This book is an excellent resource to learn more about these topics. ↩
+This book is an excellent resource to learn more about these topics. ↩
In this course, we will study generative models that view the world under the lens of probability. In such a worldview, we can think of any kind of -observed data, say , as a finite set of samples from an -underlying distribution, say . At its very core, the +observed data, say \(\mathcal{D}\), as a finite set of samples from an +underlying distribution, say \(p_{\mathrm{data}}\). At its very core, the goal of any generative model is then to approximate this data -distribution given access to the dataset . The hope is that +distribution given access to the dataset \(\mathcal{D}\). The hope is that if we are able to learn a good generative model, we can use the learned model for downstream inference.
We will be primarily interested in parametric approximations to the data -distribution, which summarize all the information about the dataset in +distribution, which summarize all the information about the dataset \(\mathcal{D}\) in a finite set of parameters. In contrast with non-parametric models, parametric models scale more efficiently with large datasets but are limited in the family of distributions they can represent.
In the parametric setting, we can think of the task of learning a generative model as picking the parameters within a family of model -distributions that minimizes some notion of distance1 between the +distributions that minimizes some notion of distance1 between the model distribution and the data distribution.
For instance, we might be given access to a dataset of dog images and -our goal is to learn the paraemeters of a generative model within a model family such that -the model distribution is close to the data distribution over dogs -. Mathematically, we can specify our goal as the -following optimization problem: \begin{equation} +
For instance, we might be given access to a dataset of dog images \(\mathcal{D}\) and +our goal is to learn the parameters of a generative model \(\theta\) within a model family \(\mathcal{M}\) such that +the model distribution \(p_\theta\) is close to the data distribution over dogs +\(p_{\mathrm{data}}\). Mathematically, we can specify our goal as the +following optimization problem: \(\)\begin{equation} \min_{\theta\in \mathcal{M}}d(p_{\mathrm{data}}, p_{\theta}) \label{eq:learning_gm} \tag{1} -\end{equation}where is accessed via the dataset - and is a notion of distance between probability distributions.
+\end{equation}\(\)where \(p_{\mathrm{data}}\) is accessed via the dataset +\(\mathcal{D}\) and \(d(\cdot)\) is a notion of distance between probability distributions.As we navigate through this course, it is interesting to take note of the difficulty of the problem at hand. A typical image from a modern -phone camera has a resolution of approximately pixels. +phone camera has a resolution of approximately \(700 \times 1400\) pixels. Each pixel has three channels: R(ed), G(reen) and B(lue) and each channel can take a value between 0 to 255. Hence, the number of possible -images is given by . -In contrast, Imagenet, one of the largest publicly available datasets, +images is given by \(256^{700 \times 1400 \times 3}\approx 10 ^{800000}\). +In contrast, ImageNet, one of the largest publicly available datasets, consists of only about 15 million images. Hence, learning a generative model with such a limited dataset is a highly underdetermined problem.
@@ -142,13 +142,13 @@In the next few set of lectures, we will take a deeper dive into certain @@ -161,7 +161,7 @@
For a discriminative model such as logistic regression, the fundamental inference task is to predict a label for any given datapoint. Generative models, on the other hand, learn a joint distribution over the entire -data.2
+data.2While the range of applications to which generative models have been used continue to grow, we can identify three fundamental inference @@ -169,23 +169,23 @@
Density estimation: Given a datapoint , what is the -probability assigned by the model, i.e., ?
+Density estimation: Given a datapoint \(\mathbf{x}\), what is the +probability assigned by the model, i.e., \(p_\theta(\mathbf{x})\)?
Sampling: How can we generate novel data from the model distribution, i.e., -?
+\(\mathbf{x}_{\mathrm{new}} \sim p_\theta(\mathbf{x})\)?Unsupervised representation learning: How can we learn meaningful -feature representations for a datapoint ?
+feature representations for a datapoint \(\mathbf{x}\)?Going back to our example of learning a generative model over dog images, we can intuitively expect a good generative model to work as -follows. For density estimation, we expect to be +follows. For density estimation, we expect \(p_\theta(\mathbf{x})\) to be high for dog images and low otherwise. Alluding to the name generative model, sampling involves generating novel images of dogs beyond the ones we observe in our dataset. Finally, representation learning can @@ -205,18 +205,18 @@
As we shall see later, functions that do not satisfy all properties of a distance metric are also used in practice, e.g., KL -divergence. ↩
+divergence. ↩Technically, a probabilistic discriminative model is also a generative model of the labels conditioned on the data. However, the usage of the term generative models is typically reserved for high -dimensional data. ↩
+dimensional data. ↩In the model above, and denote the latent and observed variables respectively. The joint distribution expressed by this model is given as
+In the model above, \(\bz\) and \(\bx\) denote the latent and observed variables respectively. The joint distribution expressed by this model is given as
-From a generative modeling perspective, this model describes a generative process for the observed data using the following procedure
+From a generative modeling perspective, this model describes a generative process for the observed data \(\bx\) using the following procedure
-If one adopts the belief that the latent variables somehow encode semantically meaningful information about , it is natural to view this generative process as first generating the “high-level” semantic information about first before fully generating . Such a perspective motivates generative models with rich latent variable structures such as hierarchical generative models —where information about is generated hierarchically—and temporal models such as the Hidden Markov Model—where temporally-related high-level information is generated first before constructing .
+If one adopts the belief that the latent variables \(\bz\) somehow encode semantically meaningful information about \(\bx\), it is natural to view this generative process as first generating the “high-level” semantic information about \(\bx\) first before fully generating \(\bx\). Such a perspective motivates generative models with rich latent variable structures such as hierarchical generative models \(p(\bx, \bz_1, \ldots, \bz_m) = p(\bx \giv \bz_1)\prod_i p(\bz_i \giv \bz_{i+1})\)—where information about \(\bx\) is generated hierarchically—and temporal models such as the Hidden Markov Model—where temporally-related high-level information is generated first before constructing \(\bx\).
-We now consider a family of distributions where describes a probability distribution over . Next, consider a family of conditional distributions where describes a conditional probability distribution over given . Then our hypothesis class of generative models is the set of all possible combinations
+We now consider a family of distributions \(\P_\bz\) where \(p(\bz) \in \P_\bz\) describes a probability distribution over \(\bz\). Next, consider a family of conditional distributions \(\P_{\bx\giv \bz}\) where \(p_\theta(\bx \giv \bz) \in \P_{\bx\giv \bz}\) describes a conditional probability distribution over \(\bx\) given \(\bz\). Then our hypothesis class of generative models is the set of all possible combinations
-Given a dataset , we are interested in the following learning and inference tasks
+Given a dataset \(\D = \set{\bx^{(1)}, \ldots, \bx^{(n)}}\), we are interested in the following learning and inference tasks
One way to measure how closely fits the observed dataset is to measure the Kullback-Leibler (KL) divergence between the data distribution (which we denote as ) and the model’s marginal distribution . The distribution that ``best’’ fits the data is thus obtained by minimizing the KL divergence.
+One way to measure how closely \(p(\bx, \bz)\) fits the observed dataset \(\D\) is to measure the Kullback-Leibler (KL) divergence between the data distribution (which we denote as \(p_{\mathrm{data}}(\bx)\)) and the model’s marginal distribution \(p(\bx) = \int p(\bx, \bz) \d \bz\). The distribution that ``best’’ fits the data is thus obtained by minimizing the KL divergence.
-As we have seen previously, optimizing an empirical estimate of the KL divergence is equivalent to maximizing the marginal log-likelihood over
+As we have seen previously, optimizing an empirical estimate of the KL divergence is equivalent to maximizing the marginal log-likelihood \(\log p(\bx)\) over \(\D\)
-However, it turns out this problem is generally intractable for high-dimensional as it involves an integration (or sums in the case is discrete) over all the possible latent sources of variation . One option is to estimate the objective via Monte Carlo. For any given datapoint , we can obtain the following estimate for its marginal log-likelihood
+However, it turns out this problem is generally intractable for high-dimensional \(\bz\) as it involves an integration (or sums in the case \(\bz\) is discrete) over all the possible latent sources of variation \(\bz\). One option is to estimate the objective via Monte Carlo. For any given datapoint \(\bf x\), we can obtain the following estimate for its marginal log-likelihood
Rather than maximizing the log-likelihood directly, an alternate is to instead construct a lower bound that is more amenable to optimization. To do so, we note that evaluating the marginal likelihood \(p(\bx)\) is at least as difficult as as evaluating the posterior \(p(\bz \mid \bx)\) for any latent vector \(\bz\) since by definition \(p(\bz \mid \bx) = p(\bx, \bz) / p(\bx)\).
-Next, we introduce a variational family of distributions that approximate the true, but intractable posterior . Further henceforth, we will assume a parameteric setting where any distribution in the model family is specified via a set of parameters and distributions in the variational family are specified via a set of parameters .
+Next, we introduce a variational family \(\Q\) of distributions that approximate the true, but intractable posterior \(p(\bz \mid \bx)\). Further henceforth, we will assume a parameteric setting where any distribution in the model family \(\P_{\bx, \bz}\) is specified via a set of parameters \(\theta \in \Theta\) and distributions in the variational family \(\Q\) are specified via a set of parameters \(\lambda \in \Lambda\).
-Given and , we note that the following relationships hold true1 for any and all variational distributions
+Given \(\P_{\bx, \bz}\) and \(\Q\), we note that the following relationships hold true1 for any \(\bx\) and all variational distributions \(q_\lambda(\bz) \in \Q\)
-so long as it is easy to sample from and evaluate densities for .
+so long as it is easy to sample from and evaluate densities for \(q_\lambda(\bz)\).
-Which variational distribution should we pick? Even though the above derivation holds for any choice of variational parameters , the tightness of the lower bound depends on the specific choice of .
+Which variational distribution should we pick? Even though the above derivation holds for any choice of variational parameters \(\lambda\), the tightness of the lower bound depends on the specific choice of \(q\).
In particular, the gap between the original objective(marginal log-likelihood ) and the ELBO equals the KL divergence between the approximate posterior and the true posterior . The gap is zero when the variational distribution exactly matches .
+In particular, the gap between the original objective(marginal log-likelihood \(\log p_\theta(\bx)\)) and the ELBO equals the KL divergence between the approximate posterior \(q(\bz)\) and the true posterior \(p(\bz \giv \bx)\). The gap is zero when the variational distribution \(q_\lambda(\bz)\) exactly matches \(p_\theta(\bz \giv \bx)\).
-In summary, we can learn a latent variable model by maximizing the ELBO with respect to both the model parameters and the variational parameters for any given datapoint
+In summary, we can learn a latent variable model by maximizing the ELBO with respect to both the model parameters \(\theta\) and the variational parameters \(\lambda\) for any given datapoint \(\bx\)
Step 1
-We first do per-sample optimization of by iteratively applying the update
+We first do per-sample optimization of \(q\) by iteratively applying the update
-where , and denotes an unbiased estimate of the ELBO gradient. This step seeks to approximate the log-likelihood .
+where \(\text{ELBO}(\bx; \theta, \lambda) = \Expect_{q_\lambda(\bz)} \left[\log \frac{p_\theta(\bx, \bz)}{q_\lambda(\bz)}\right]\), and \(\tilde{\nabla}_\lambda\) denotes an unbiased estimate of the ELBO gradient. This step seeks to approximate the log-likelihood \(\log p_\theta(\bx^{(i)})\).
Step 2
@@ -237,30 +237,30 @@which corresponds to the step that hopefully moves closer to .
+which corresponds to the step that hopefully moves \(p_\theta\) closer to \(p_{\mathrm{data}}\).
The gradients and can be estimated via Monte Carlo sampling. While it is straightforward to construct an unbiased estimate of by simply pushing through the expectation operator, the same cannot be said for . Instead, we see that
+The gradients \(\nabla_\lambda \ELBO\) and \(\nabla_\theta \ELBO\) can be estimated via Monte Carlo sampling. While it is straightforward to construct an unbiased estimate of \(\nabla_\theta \ELBO\) by simply pushing \(\nabla_\theta\) through the expectation operator, the same cannot be said for \(\nabla_\lambda\). Instead, we see that
-This equality follows from the log-derivative trick (also commonly referred to as the REINFORCE trick). The full derivation involves some simple algebraic manipulations and is left as an exercise for the reader. The gradient estimator is thus
+This equality follows from the log-derivative trick (also commonly referred to as the REINFORCE trick). The full derivation involves some simple algebraic manipulations and is left as an exercise for the reader. The gradient estimator \(\tilde{\nabla}_\lambda \ELBO\) is thus
-However, it is often noted that this estimator suffers from high variance. One of the key contributions of the variational autoencoder paper is the reparameterization trick, which introduces a fixed, auxiliary distribution and a differentiable function such that the procedure
+However, it is often noted that this estimator suffers from high variance. One of the key contributions of the variational autoencoder paper is the reparameterization trick, which introduces a fixed, auxiliary distribution \(p(\veps)\) and a differentiable function \(T(\veps; \lambda)\) such that the procedure
-is equivalent to sampling from . By the Law of the Unconscious Statistician, we can see that
+is equivalent to sampling from \(q_\lambda(\bz)\). By the Law of the Unconscious Statistician, we can see that
So far, we have described \(p_\theta(\bx, \bz)\) and \(q_\lambda(\bz)\) in the abstract. To instantiate these objects, we consider choices of parametric distributions for \(p_\theta(\bz)\), \(p_\theta(\bx \giv \bz)\), and \(q_\lambda(\bz)\). A popular choice for \(p_\theta(\bz)\) is the unit Gaussian
-in which case is simply the empty set since the prior is a fixed distribution. Another alternative often used in practice is a mixture of Gaussians with trainable mean and covariance parameters.
+in which case \(\theta\) is simply the empty set since the prior is a fixed distribution. Another alternative often used in practice is a mixture of Gaussians with trainable mean and covariance parameters.
-The conditional distribution is where we introduce a deep neural network. We note that a conditional distribution can be constructed by defining a distribution family (parameterized by ) in the target space (i.e. defines an unconditional distribution over ) and a mapping function . +
The conditional distribution \(p_\theta(\bx \giv \bz)\) is where we introduce a deep neural network. We note that a conditional distribution can be constructed by defining a distribution family (parameterized by \(\omega \in \Omega\)) in the target space \(\bx\) (i.e. \(p_\omega(\bx)\) defines an unconditional distribution over \(\bx\)) and a mapping function \(g_\theta: \Z \to \Omega\). -In other words, defines the conditional distribution
+In other words, \(g_\theta(\cdot)\) defines the conditional distribution -The function is also referred to as the decoding distribution since it maps a latent code to the parameters of a distribution over observed variables . In practice, it is typical to specify as a deep neural network.
+
The function \(g_\theta\) is also referred to as the decoding distribution since it maps a latent code \(\bz\) to the parameters of a distribution over observed variables \(\bx\). In practice, it is typical to specify \(g_\theta\) as a deep neural network.
-In the case where is a Gaussian distribution, we can thus represent it as
where and are neural networks that specify the mean and covariance matrix for the Gaussian distribution over when conditioned on .
+where \(\mu_\theta(\bz)\) and \(\Sigma_\theta(\bz)\) are neural networks that specify the mean and covariance matrix for the Gaussian distribution over \(\bx\) when conditioned on \(\bz\).
-Finally, the variational family for the proposal distribution needs to be chosen judiciously so that the reparameterization trick is possible. Many continuous distributions in the location-scale family can be reparameterized. In practice, a popular choice is again the Gaussian distribution, where
+Finally, the variational family for the proposal distribution \(q_\lambda(\bz)\) needs to be chosen judiciously so that the reparameterization trick is possible. Many continuous distributions in the location-scale family can be reparameterized. In practice, a popular choice is again the Gaussian distribution, where
-where is the Cholesky decomposition of . For simplicity, practitioners often restrict to be a diagonal matrix (which restricts the distribution family to that of factorized Gaussians).
+where \(\Sigma^{1/2}\) is the Cholesky decomposition of \(\Sigma\). For simplicity, practitioners often restrict \(\Sigma\) to be a diagonal matrix (which restricts the distribution family to that of factorized Gaussians).
A noticable limitation of black-box variational inference is that Step 1 executes an optimization subroutine that is computationally expensive. Recall that the goal of the Step 1 is to find
+A noticeable limitation of black-box variational inference is that Step 1 executes an optimization subroutine that is computationally expensive. Recall that the goal of the Step 1 is to find
-For a given choice of , there is a well-defined mapping from . A key realization is that this mapping can be learned. In particular, one can train an encoding function (parameterized by ) -(where is the space of parameters) +
For a given choice of \(\theta\), there is a well-defined mapping from \(\bx \mapsto \lambda^\ast\). A key realization is that this mapping can be learned. In particular, one can train an encoding function (parameterized by \(\phi\)) \(f_\phi: \X \to \Lambda\) +(where \(\Lambda\) is the space of \(\lambda\) parameters) on the following objective
-It is worth noting at this point that can be interpreted as defining the conditional distribution . With a slight abuse of notation, we define
+It is worth noting at this point that \(f_\phi(\bx)\) can be interpreted as defining the conditional distribution \(q_\phi(\bz \giv \bx)\). With a slight abuse of notation, we define
-It is also worth noting that optimizing over the entire dataset as a subroutine everytime we sample a new mini-batch is clearly not reasonable. However, if we believe that is capable of quickly adapting to a close-enough approximation of given the current choice of , then we can interleave the optimization and . The yields the following procedure, where for each mini-batch , we perform the following two updates jointly
+It is also worth noting that optimizing \(\phi\) over the entire dataset as a subroutine every time we sample a new mini-batch is clearly not reasonable. However, if we believe that \(f_\phi\) is capable of quickly adapting to a close-enough approximation of \(\lambda^\ast\) given the current choice of \(\theta\), then we can interleave the optimization \(\phi\) and \(\theta\). This yields the following procedure, where for each mini-batch \(\M = \set{\bx^{(1)}, \ldots, \bx^{(m)}}\), we perform the following two updates jointly
-rather than running BBVI’s Step 1 as a subroutine. By leveraging the learnability of , this optimization procedure amortizes the cost of variational inference. If one further chooses to define as a neural network, the result is the variational autoencoder.
+rather than running BBVI’s Step 1 as a subroutine. By leveraging the learnability of \(\bx \mapsto \lambda^\ast\), this optimization procedure amortizes the cost of variational inference. If one further chooses to define \(f_\phi\) as a neural network, the result is the variational autoencoder.