-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Tutorial on Variational Autoencoders
- Authors: Doersch, Carl
- Journal: arXiv
- Year: 2016
- Link: doersch2016tutorial.pdf
Introduction
-
Generative Modeling
- Generative Modeling is a branch of machine learning that focuses on creating models representing distributions of data, denoted as
$P(X)$ .-
$X$ represents the data points, such as images.
-
- Each data point (like an image) consists of many dimensions, typically corresponding to pixels.
- A good generative model needs to understand and capture the relationships and dependencies among these pixels.
- For example, it should recognize that adjacent pixels in an image usually exhibit similar colors and likely form parts of objects.
- Simply calculating
$P(X)$ numerically is straightforward but can be insufficient.- A model that can recognize which images are real versus noise is important, yet it doesn't focus on generating new useful examples.
- Knowing that an image has a low probability doesn't equate to having the ability to create new, high-probability examples.
- The real value of generative models lies in their ability to create new instances that resemble examples from a given database, such as generating new images that appear real or creating additional 3D models for applications like gaming.
- Generative Modeling is a branch of machine learning that focuses on creating models representing distributions of data, denoted as
-
Variational Auto Encoder(VAE)
- VAEs are a powerful framework to achieve this by learning a probabilistic model of the data distribution, denoted as
$P(X)$ , which can sample new instances similar to a target distribution$P_{gt}(X)$ . - Traditional generative modeling approaches often struggled due to:
- Strong Assumptions: Some models require predefined structures in data which might not reflect its complexity.
- Severe Approximations: These can lead to suboptimal models, failing to capture the true nature of the data.
- Computational Expense: Classical methods like Markov Chain Monte Carlo can be too slow for practical applications.
- VAEs address many of these issues:
- They make minimal assumptions about the data structure.
- They can effectively approximate the data distribution using neural networks without the heavy computational burden normally associated with generative models.
- The approximations they introduce are generally small, allowing for effective training using fast techniques like back-propagation.
- VAEs are a powerful framework to achieve this by learning a probabilistic model of the data distribution, denoted as
Latent Variable Models
-
A latent variable (
$z$ ) represents hidden characteristics that influence the generation of observable data (ex. images).- It acts as an intermediate step that the model utilizes to condition the data it generates.
-
Example of Handwritten Characters:
- When generating images of digits (0-9), the model first needs to decide which digit to create (e.g., 5 or 0).
- This decision, represented by the latent variable
$z$ , ensures that the generated features of the digit are coherent and align with the chosen character. -
(Uniqueness of Latent Variable)Given an output character, the specific settings for latent variables that produced it are unknown, requiring inference methods (like computer vision techniques) for a complete understanding of what settings correspond to a specific output. - The latent variables’ values must effectively map to the data points in the dataset to ensure the generative model adequately represents the distribution of the data.
- The process described focuses on how to ensure that a Variational Autoencoder (VAE) model can effectively represent the dataset.
- It emphasizes the importance of having latent variables (denoted as
$z$ ), which are sampled from a probability density function (PDF)$P(z)$ . - The function
$f(z; \theta)$ , when optimized, should enable the model to generate outputs$f(z; \theta)$ that are similar to the data points$X$ in the original dataset. - The goal is to maximize the overall probability of generating the observed data, represented mathematically as:
$$P(X) = \int P(X|z; \theta) P(z) dz$$ - Here,
$P(X|z; \theta)$ indicates the conditional probability of$X$ given the latent variable$z$ and the parameters$\theta$ .
-
Gaussian Distribution in VAEs:
$$P(X|z; \theta) = N(X|f(z; \theta), \sigma^2 I)$$ -
$f(z; \theta)$ is a deterministic function that defines the mean of the Gaussian output. -
$\sigma^2 I$ describes the covariance, ensuring that samples generated from the model won't be identical to any specific training data point but will be akin to the overall dataset. - The choice of a Gaussian distribution is crucial because it allows gradient-based optimization (like stochastic gradient descent) to be applied effectively.
-
-
By using a Gaussian output, the model can create varied outputs since the samples drawn will not always be identical to the input data points
$X$ , which facilitates learning. -
Why Not a Dirac Delta Function?
- If
$P(X|z)$ were a Dirac delta function:- Each
$z$ would deterministically produce a fixed$X$ , making it impossible to explore variations around$X$ . - This would limit the model's ability to learn and generate diverse examples, hindering its capacity to perform well on similar but unseen data.
- Each
- The rectangle is “plate notation” meaning that we can sample from z and X N times while the model parameters θ remain fixed.
- If
Variational Autoencoders
- Variational Autoencoders (VAEs) utilize latent variables to model complex data distributions.
- Unlike classical autoencoders, they do not directly copy input data, but instead, they create a distribution over the input data.
-
Latent Variables
- In a VAE, latent variables
$z$ represent underlying factors generating the observed data$X$ . VAEs assume$z$ follows a simple distribution, typically$N(0, I)$ (a standard normal distribution), which allows them to learn complex mappings from$z$ to$X$ using a neural network.
- In a VAE, latent variables
-
Sampling and Likelihood
- VAEs approximate the likelihood of data
$P(X)$ through sampling from the latent space. - They introduce strategies like direct sampling from a non-complex distribution and maximizing the likelihood of data using stochastic gradient descent for efficiency, while avoiding computationally expensive methods like Markov Chain Monte Carlo.
- VAEs approximate the likelihood of data
Setting and Objective
-
Objective of Sampling in VAEs:
- The goal is to compute the likelihood
$P(X)$ of the data$X$ based on latent variables$z$ that are likely to generate$X$ .
- The goal is to compute the likelihood
-
Use of Function
$Q(z|X)$ - The function
$Q(z|X)$ is introduced to provide a distribution of latent variables$z$ that are likely to produce a particular$X$ . - This makes the process efficient, as you only need to focus on the
$z$ values that contribute meaningfully to$P(X)$ .
- The function
-
Kullback-Leibler Divergence
$D$ :- The KL divergence measures how one probability distribution diverges from a second expected probability distribution.
$$D[Q(z) | P(z|X)] = E_{z \sim Q} [\log Q(z) - \log P(z|X)]$$ - This quantifies the difference between
$Q(z)$ and the true posterior$P(z|X)$ .
- The KL divergence measures how one probability distribution diverges from a second expected probability distribution.
-
Relation Between Expectations
- The equations show how to relate the quantity
$P(X)$ and$P(X|z)$ using Bayes' rule:
$$D[Q(z) | P(z|X)] = E_{z \sim Q} [\log Q(z) - \log P(X|z) - \log P(z)] + \log P(X)$$
- The equations show how to relate the quantity
-
Maximizing the Log Probability
- The left-hand side of the equation, as expressed in Equation 4, tries to maximize
$\log P(X)$ while minimizing$D[Q(z) | P(z|X)]$ :
$$\log P(X) - D[Q(z|X) | P(z|X)] = E_{z \sim Q} [\log P(X|z)] - D[Q(z|X) | P(z)]$$ - This means that
$Q(z|X)$ is ideally constructed to closely match$P(z|X)$ , thereby allowing us to optimize the likelihood of observing$X$ .
- The left-hand side of the equation, as expressed in Equation 4, tries to maximize
-
Convergence and Optimization
- If
$Q(z|X)$ can accurately approximate$P(z|X)$ , the KL divergence becomes zero, and we can effectively optimize$P(X)$ directly. - The framework essentially introduces a method of variational inference, where one models the posterior
$P(z|X)$ using a simpler distribution$Q(z|X)$ .
- If
-
VAEs optimize a complex sampling procedure to approximate the likelihood of data points using simpler distributions, enabling efficient representation learning.
- The central mathematical tool used here is the Kullback-Leibler divergence, which guides the optimization of the model, ensuring that
$Q(z|X)$ provides a distribution that effectively captures the characteristics of latent variables associated with observed data.
- The central mathematical tool used here is the Kullback-Leibler divergence, which guides the optimization of the model, ensuring that
Optimizing the objective function
-
A training-time variational autoencoder implemented as a feed-forward neural network, where
$P(X|z)$ is Gaussian. -
The objective involves maximizing the likelihood of the data while minimizing the Kullback-Leibler divergence between two probability distributions: the approximate posterior
$Q(z|X)$ and the prior$P(z)$ . -
The usual choice for
$Q(z|X)$ is modeled as a multivariate Gaussian distribution:
$$Q(z|X) = N(z|\mu(X; \theta), \Sigma(X; \theta))$$ - where
$\mu$ and$\Sigma$ are deterministic functions parameterized by$\theta$ that are learned from data. - This choice simplifies computation and facilitates the optimization process.
- where
-
KL-divergence computation
- The KL-divergence between two multivariate Gaussians can be computed using the formula:
$$D[Q(z) | P(z)] = \frac{1}{2} \left( \text{tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^{\top} \Sigma_1^{-1} (\mu_1 - \mu_0) - k + \log\left(\frac{\det \Sigma_1}{\det \Sigma_0}\right) \right)$$ -
$k$ is the dimensionality of the distribution, and this term allows for efficient evaluation and optimization of the divergence.
- The KL-divergence between two multivariate Gaussians can be computed using the formula:
-
Gradient computation and reparameterization trick:
- To estimate the expected value
$E_{z \sim Q}[\log P(X|z)]$ , a sample$z$ is drawn from$Q(z|X)$ . - Since this requires calculating
$\log P(X|z)$ which depends on both$P$ and$Q$ , care must be taken to ensure that sampling does not disrupt the gradient flow during backpropagation. - The solution is to employ the reparameterization trick, where instead of directly sampling from
$Q(z|X)$ , we express$z$ as:-
$$z = \mu(X) + \Sigma^{1/2}(X) \cdot e$$ where$e \sim N(0, I)$ .-
$z$ is reparameterized into a form that allows for gradient computation through deterministic functions of$X$ , enhancing training efficiency.
-
-
- To estimate the expected value
-
Final optimization equation
- The overall equation we want to optimize becomes:
$$E_{X \sim D} \left[ E_{z \sim Q} \left[ \log P(X|z) \right] - D[Q(z|X) | P(z)] \right]$$ - This formulation captures the dual objective of maximizing the likelihood while minimizing the divergence, enabling effective optimization using stochastic gradient descent.
- The overall equation we want to optimize becomes:
-
The testing-time variational “autoencoder,” which allows us to generate new samples. The “encoder” pathway is simply discarded.

Testing the learned model
-
Test-Time Process
- During testing, the encoder is removed from the VAE architecture.
- New samples are generated by inputting values from a standard normal distribution $ z \sim N(0, I) $ into the decoder.
-
Evaluating Probability
- The probability
$P(X)$ of the generated samples is generally intractable to compute directly. - The term
$D[Q(z|X) | P(z|X)]$ , which represents the Kullback-Leibler divergence, is positive, indicating that it serves as a lower bound for$P(X)$ . - To approximate
$P(X)$ , sampling from the approximate distribution$Q(z)$ provides a useful estimator that tends to converge faster than sampling from the prior$N(0, I)$ .
- The probability
-
Usefulness of Lower Bound
- This lower bound gives insights into how well the VAE model represents the training data by indicating how probable generated samples
$X$ are under the learned model.
- This lower bound gives insights into how well the VAE model represents the training data by indicating how probable generated samples
Interpreting the objective
-
VAE framework seeks to optimize the likelihood of the data, represented as
$\log P(X)$ , but it does so with some approximations which is crucial to understanding its performance. -
The learning objective incorporates two key components:
-
$D[Q(z|X) | P(z|X)]$ : This term is the Kullback-Leibler divergence which measures how well the approximating distribution$Q(z|X)$ aligns with the true posterior$P(z|X)$ . - Optimizing this term, while necessary for making the model tractable and efficient, introduces some error; this error arises because the exact posterior is often complex and not easily computable.
-
$\log P(X)$ : This represents the likelihood of the observed data under the model, and maximizing this ensures that the reconstructed samples closely resemble the original data points.
-
-
VAE's potential error comes from balancing these two terms:
- If
$Q(z|X)$ is an accurate approximation of$P(z|X)$ , the divergence term becomes small, and the model performs well. - If not, larger divergences may indicate poor model performance, hence affecting the data generation capability of the VAE.
- If
-
The relationship to information theory is through the concept of Minimum Description Length (MDL):
- Lower values of the objective imply a more efficient coding of the data, meaning the model captures more essential information with fewer bits.
- This helps in understanding the efficiencies gained through the variational inference framework, suggesting that the chosen approximating distribution
$Q(z|X)$ must minimize additional overhead.
-
The tuning of
$\sigma$ in$P(X|z)$ could be seen as a form of regularization, akin to parameters used in sparse autoencoders, which control the complexity of the model and prevent overfitting.
The error from $D[Q(z|X) | P(z|X)]$
-
$Q(z|X)$ is the approximate posterior distribution of the latent variable$z$ given the input$X$ . -
$P(z|X)$ is the true posterior distribution of$z$ given the input$X$ . -
Kullback-Leibler Divergence (KL Divergence)
- The term
$D[Q(z|X) | P(z|X)]$ represents the KL Divergence between these two distributions. - It quantifies how much information is lost when using the approximate distribution
$Q(z|X)$ instead of the true distribution$P(z|X)$ .
- The term
-
Convergence to True Distribution:
- For the model's output distribution
$P(X)$ to converge to the true distribution,$D[Q(z|X) | P(z|X)]$ must approach zero. - This means the approximate posterior
$Q(z|X)$ needs to accurately represent the true posterior$P(z|X)$ .
- For the model's output distribution
-
Challenges in Achieving Zero Divergence
- The author highlights that simply having high capacity (i.e., complex) functions for
$\mu(X)$ (mean) and$\Sigma(X)$ (variance) does not guarantee that$Q(z|X)$ will resemble$P(z|X)$ closely enough to make the divergence zero. - The function
$f$ modulating the relationship can greatly affect this outcome.
- The author highlights that simply having high capacity (i.e., complex) functions for
-
Existence of a Suitable Function
- The text posits that there may exist a sufficiently flexible function
$f$ that can ensure$P(z|X)$ is Gaussian for all$X$ while simultaneously maximizing the likelihood$\log P(X)$ . - If such a function exists, it would facilitate minimizing the divergence
$D[Q(z|X) | P(z|X)]$ .
- The text posits that there may exist a sufficiently flexible function
- The author acknowledges that proving general results for all distributions remains an open problem, but notes that it's theoretically proven in some 1D cases.
- A small variance
$\sigma$ can make modeling easier, though it may also lead to complications in gradient scaling during training.
- A small variance
Information-theoretic interpretation
-
Minimum Description Length Principle
- This principle suggests that the best model is one that minimizes the number of bits needed to encode the data. In the context of VAEs,
$-\log P(X)$ represents the total bits required for encoding data$X$ using an ideal encoding strategy.
- This principle suggests that the best model is one that minimizes the number of bits needed to encode the data. In the context of VAEs,
-
Step 1 - Encoding Latent Variable
$z$ - Some bits are used to determine the latent variable
$z$ . - The KL-divergence
$D[Q(z|X)||P(z)]$ quantifies the extra information needed to adjust from a prior distribution$P(z)$ (uninformative) to the posterior distribution$Q(z|X)$ . - This measures how much information we gain about the latent variable
$z$ when it is informed by the observed data$X$ .
- Some bits are used to determine the latent variable
-
Step 2 - Decoding
- The term
$P(X|z)$ measures the amount of information needed to reconstruct$X$ once$z$ has been determined.
- The term
- The total number of bits needed to accurately represent
$X$ is the sum of the bits used in both steps, subtracting the penalty from the KL-divergence which indicates how well$Q(z|X)$ matches$P(z)$ .- This penalty reflects the inefficiency of the encoding process, highlighting that a sub-optimal encoding can lead to excess needed bits.
VAEs and the regularization parameter
-
In a traditional sparse autoencoder, a regularization parameter
$\lambda$ is used in the objective function, which can be represented as:
$$L = | \phi( \psi(X) ) - X |^2 + \lambda | \psi(X) |_0$$ -
$\phi$ and$\psi$ are the encoder and decoder functions, respectively, and$| \cdot |_0$ is the L0 norm promoting sparsity in the encoding.
-
-
Unlike sparse autoencoders, VAEs typically do not have a separate explicit regularization parameter to tune.
- This is advantageous as it reduces hyperparameter tuning for practitioners.
-
Absorption of Constants
- The text suggests that even though one might think of introducing a parameter through scaling the latent variable
$z$ (like using$z' \sim N(0, \lambda I) $ ), this does not fundamentally change the model. - The model remains the same because you can absorb the constant into the probabilistic definitions of
$P$ and$Q$
$$f'(z') = f(z'/\lambda), \quad \mu'(X) = \mu(X) \cdot \lambda, \quad \Sigma'(X) = \Sigma(X) \cdot \lambda^2$$
- The text suggests that even though one might think of introducing a parameter through scaling the latent variable
-
Output Distribution and Regularization Parameter
- The output distribution for continuous data is typically Gaussian:
$$P(X|z) \sim N(f(z), \sigma^2 I)$$ - The log-probability can be expressed as:
$$\log P(X|z) = C - \frac{1}{2} \frac{| X - f(z) |^2}{\sigma^2}$$
- The output distribution for continuous data is typically Gaussian:
-
Here,
$C$ is a constant. In this context,$\sigma$ acts like a regularization parameter that controls the balance between the two terms:- how well the model fits the data vs. how simplistic the model should be.
-
Binary vs. Continuous Inputs
- If the output
$X$ is binary, the behavior of a regularization parameter disappears altogether since it doesn't influence the model's ability to capture the necessary information content as both terms on the right side use the same information units. - However, for continuous cases, we need a carefully chosen
$\sigma$ to maintain finite information representation, which affects the expected accuracy of the model's reconstruction of data$X$ .
- If the output
Examples: MNIST & VAE
-
he VAE is applied to the MNIST dataset, which consists of grayscale images of handwritten digits (0-9).
- The values for each pixel are constrained between 0 and 1.
- Instead of using pre-existing VAE architectures, the authors adapt the basic AutoEncoder example from Caffe, ensuring flexibility in implementation.
-
Loss Function
- The authors mention using the Sigmoid Cross Entropy loss for the probability distribution
$P(X|z)$ of the data given the latent variables$z$ . - This loss function is appropriate since the MNIST pixel values are between 0 and 1.
- They probabilistically define new data points as
$X'$ sampled according to:
$$X'_i \sim \text{Bernoulli}(X_i)$$ - This means that each pixel value is treated as a Bernoulli trial, where
$X_i$ is the actual observed value from the training set. - This binarization captures the uncertainty in the pixel representations.
- The authors mention using the Sigmoid Cross Entropy loss for the probability distribution
-
Training Process
- The model is trained once fully, but with multiple restarts to identify the optimal learning rate for minimizing the loss.
- This indicates that achieving good performance does not heavily depend on the initial setup or deep structural modifications.
-
- The results from the VAE show that while many generated digits appear realistic, some samples fall in-between digits, exemplifying the VAE’s tendency to interpolate between classes rather than producing distinctly different outputs.
- ex. Digits might look like a blend between '7' and '9'.
-
The dimensionality of the latent variable
$z$ in VAEs appears to have varying impacts on model performance. -
If the dimensionality is too low (ex. less than 4), the model struggles to capture the complexities in the data, leading to poor performance.
- Specifically, a model with too few dimensions fails to adequately represent the variations present in the input data.
- Conversely, increasing the dimensions of
$z$ improves performance to a certain extent, but when the dimensionality is excessively high (ex. 10,000), it can lead to problems in effectively managing the training, especially during optimization with stochastic gradient descent. - This happens because the model has a harder time keeping the Kullback-Leibler divergence
$D[Q(z|X) || P(z)]$ low essentially a measure of how well the approximated distribution matches the true distribution when$z$ is large.

