Skip to content

Tutorial on Variational Autoencoders #51

@standing-o

Description

@standing-o

Tutorial on Variational Autoencoders

Introduction

  • Generative Modeling

    • Generative Modeling is a branch of machine learning that focuses on creating models representing distributions of data, denoted as $P(X)$.
      • $X$ represents the data points, such as images.
    • Each data point (like an image) consists of many dimensions, typically corresponding to pixels.
    • A good generative model needs to understand and capture the relationships and dependencies among these pixels.
      • For example, it should recognize that adjacent pixels in an image usually exhibit similar colors and likely form parts of objects.
    • Simply calculating $P(X)$ numerically is straightforward but can be insufficient.
      • A model that can recognize which images are real versus noise is important, yet it doesn't focus on generating new useful examples.
      • Knowing that an image has a low probability doesn't equate to having the ability to create new, high-probability examples.
      • The real value of generative models lies in their ability to create new instances that resemble examples from a given database, such as generating new images that appear real or creating additional 3D models for applications like gaming.
  • Variational Auto Encoder(VAE)

    • VAEs are a powerful framework to achieve this by learning a probabilistic model of the data distribution, denoted as $P(X)$, which can sample new instances similar to a target distribution $P_{gt}(X)$.
    • Traditional generative modeling approaches often struggled due to:
      • Strong Assumptions: Some models require predefined structures in data which might not reflect its complexity.
      • Severe Approximations: These can lead to suboptimal models, failing to capture the true nature of the data.
      • Computational Expense: Classical methods like Markov Chain Monte Carlo can be too slow for practical applications.
    • VAEs address many of these issues:
      • They make minimal assumptions about the data structure.
      • They can effectively approximate the data distribution using neural networks without the heavy computational burden normally associated with generative models.
      • The approximations they introduce are generally small, allowing for effective training using fast techniques like back-propagation.

Latent Variable Models

  • A latent variable ($z$) represents hidden characteristics that influence the generation of observable data (ex. images).

    • It acts as an intermediate step that the model utilizes to condition the data it generates.
  • Example of Handwritten Characters:

    • When generating images of digits (0-9), the model first needs to decide which digit to create (e.g., 5 or 0).
    • This decision, represented by the latent variable $z$, ensures that the generated features of the digit are coherent and align with the chosen character.
    • (Uniqueness of Latent Variable) Given an output character, the specific settings for latent variables that produced it are unknown, requiring inference methods (like computer vision techniques) for a complete understanding of what settings correspond to a specific output.
    • The latent variables’ values must effectively map to the data points in the dataset to ensure the generative model adequately represents the distribution of the data.
    • The process described focuses on how to ensure that a Variational Autoencoder (VAE) model can effectively represent the dataset.
    • It emphasizes the importance of having latent variables (denoted as $z$), which are sampled from a probability density function (PDF) $P(z)$.
    • The function $f(z; \theta)$, when optimized, should enable the model to generate outputs $f(z; \theta)$ that are similar to the data points $X$ in the original dataset.
    • The goal is to maximize the overall probability of generating the observed data, represented mathematically as:
      $$P(X) = \int P(X|z; \theta) P(z) dz$$
    • Here, $P(X|z; \theta)$ indicates the conditional probability of $X$ given the latent variable $z$ and the parameters $\theta$.
  • Gaussian Distribution in VAEs:
    $$P(X|z; \theta) = N(X|f(z; \theta), \sigma^2 I)$$

    • $f(z; \theta)$ is a deterministic function that defines the mean of the Gaussian output.
    • $\sigma^2 I$ describes the covariance, ensuring that samples generated from the model won't be identical to any specific training data point but will be akin to the overall dataset.
    • The choice of a Gaussian distribution is crucial because it allows gradient-based optimization (like stochastic gradient descent) to be applied effectively.
  • By using a Gaussian output, the model can create varied outputs since the samples drawn will not always be identical to the input data points $X$, which facilitates learning.

  • Why Not a Dirac Delta Function?

    • If $P(X|z)$ were a Dirac delta function:
      • Each $z$ would deterministically produce a fixed $X$, making it impossible to explore variations around $X$.
      • This would limit the model's ability to learn and generate diverse examples, hindering its capacity to perform well on similar but unseen data.
    • The rectangle is “plate notation” meaning that we can sample from z and X N times while the model parameters θ remain fixed.

Variational Autoencoders

  • Variational Autoencoders (VAEs) utilize latent variables to model complex data distributions.
  • Unlike classical autoencoders, they do not directly copy input data, but instead, they create a distribution over the input data.
  • Latent Variables
    • In a VAE, latent variables $z$ represent underlying factors generating the observed data $X$. VAEs assume $z$ follows a simple distribution, typically $N(0, I)$ (a standard normal distribution), which allows them to learn complex mappings from $z$ to $X$ using a neural network.
  • Sampling and Likelihood
    • VAEs approximate the likelihood of data $P(X)$ through sampling from the latent space.
    • They introduce strategies like direct sampling from a non-complex distribution and maximizing the likelihood of data using stochastic gradient descent for efficiency, while avoiding computationally expensive methods like Markov Chain Monte Carlo.

Setting and Objective

  • Objective of Sampling in VAEs:

    • The goal is to compute the likelihood $P(X)$ of the data $X$ based on latent variables $z$ that are likely to generate $X$.
  • Use of Function $Q(z|X)$

    • The function $Q(z|X)$ is introduced to provide a distribution of latent variables $z$ that are likely to produce a particular $X$.
    • This makes the process efficient, as you only need to focus on the $z$ values that contribute meaningfully to $P(X)$.
  • Kullback-Leibler Divergence $D$:

    • The KL divergence measures how one probability distribution diverges from a second expected probability distribution.
      $$D[Q(z) | P(z|X)] = E_{z \sim Q} [\log Q(z) - \log P(z|X)]$$
    • This quantifies the difference between $Q(z)$ and the true posterior $P(z|X)$.
  • Relation Between Expectations

    • The equations show how to relate the quantity $P(X)$ and $P(X|z)$ using Bayes' rule:
      $$D[Q(z) | P(z|X)] = E_{z \sim Q} [\log Q(z) - \log P(X|z) - \log P(z)] + \log P(X)$$
  • Maximizing the Log Probability

    • The left-hand side of the equation, as expressed in Equation 4, tries to maximize $\log P(X)$ while minimizing $D[Q(z) | P(z|X)]$:
      $$\log P(X) - D[Q(z|X) | P(z|X)] = E_{z \sim Q} [\log P(X|z)] - D[Q(z|X) | P(z)]$$
    • This means that $Q(z|X)$ is ideally constructed to closely match $P(z|X)$, thereby allowing us to optimize the likelihood of observing $X$.
  • Convergence and Optimization

    • If $Q(z|X)$ can accurately approximate $P(z|X)$, the KL divergence becomes zero, and we can effectively optimize $P(X)$ directly.
    • The framework essentially introduces a method of variational inference, where one models the posterior $P(z|X)$ using a simpler distribution $Q(z|X)$.
  • VAEs optimize a complex sampling procedure to approximate the likelihood of data points using simpler distributions, enabling efficient representation learning.

    • The central mathematical tool used here is the Kullback-Leibler divergence, which guides the optimization of the model, ensuring that $Q(z|X)$ provides a distribution that effectively captures the characteristics of latent variables associated with observed data.

Optimizing the objective function

  • A training-time variational autoencoder implemented as a feed-forward neural network, where $P(X|z)$ is Gaussian.

    • Left is without the “reparameterization trick”, and right is with it.
    • Red shows sampling operations that are non-differentiable.
    • Blue shows loss layers.
    • The feedforward behavior of these networks is identical, but backpropagation can be applied only to the right network.
  • The objective involves maximizing the likelihood of the data while minimizing the Kullback-Leibler divergence between two probability distributions: the approximate posterior $Q(z|X)$ and the prior $P(z)$.

  • The usual choice for $Q(z|X)$ is modeled as a multivariate Gaussian distribution:
    $$Q(z|X) = N(z|\mu(X; \theta), \Sigma(X; \theta))$$

    • where $\mu$ and $\Sigma$ are deterministic functions parameterized by $\theta$ that are learned from data.
    • This choice simplifies computation and facilitates the optimization process.
  • KL-divergence computation

    • The KL-divergence between two multivariate Gaussians can be computed using the formula:
      $$D[Q(z) | P(z)] = \frac{1}{2} \left( \text{tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^{\top} \Sigma_1^{-1} (\mu_1 - \mu_0) - k + \log\left(\frac{\det \Sigma_1}{\det \Sigma_0}\right) \right)$$
    • $k$ is the dimensionality of the distribution, and this term allows for efficient evaluation and optimization of the divergence.
  • Gradient computation and reparameterization trick:

    • To estimate the expected value $E_{z \sim Q}[\log P(X|z)]$, a sample $z$ is drawn from $Q(z|X)$.
    • Since this requires calculating $\log P(X|z)$ which depends on both $P$ and $Q$, care must be taken to ensure that sampling does not disrupt the gradient flow during backpropagation.
    • The solution is to employ the reparameterization trick, where instead of directly sampling from $Q(z|X)$, we express $z$ as:
      • $$z = \mu(X) + \Sigma^{1/2}(X) \cdot e$$ where $e \sim N(0, I)$.
        • $z$ is reparameterized into a form that allows for gradient computation through deterministic functions of $X$, enhancing training efficiency.
  • Final optimization equation

    • The overall equation we want to optimize becomes:
      $$E_{X \sim D} \left[ E_{z \sim Q} \left[ \log P(X|z) \right] - D[Q(z|X) | P(z)] \right]$$
    • This formulation captures the dual objective of maximizing the likelihood while minimizing the divergence, enabling effective optimization using stochastic gradient descent.
  • The testing-time variational “autoencoder,” which allows us to generate new samples. The “encoder” pathway is simply discarded.

Testing the learned model

  • Test-Time Process

    • During testing, the encoder is removed from the VAE architecture.
    • New samples are generated by inputting values from a standard normal distribution $ z \sim N(0, I) $ into the decoder.
  • Evaluating Probability

    • The probability $P(X)$ of the generated samples is generally intractable to compute directly.
    • The term $D[Q(z|X) | P(z|X)]$, which represents the Kullback-Leibler divergence, is positive, indicating that it serves as a lower bound for $P(X)$.
    • To approximate $P(X)$, sampling from the approximate distribution $Q(z)$ provides a useful estimator that tends to converge faster than sampling from the prior $N(0, I)$.
  • Usefulness of Lower Bound

    • This lower bound gives insights into how well the VAE model represents the training data by indicating how probable generated samples $X$ are under the learned model.

Interpreting the objective

  • VAE framework seeks to optimize the likelihood of the data, represented as $\log P(X)$, but it does so with some approximations which is crucial to understanding its performance.

  • The learning objective incorporates two key components:

    • $D[Q(z|X) | P(z|X)]$: This term is the Kullback-Leibler divergence which measures how well the approximating distribution $Q(z|X)$ aligns with the true posterior $P(z|X)$.
    • Optimizing this term, while necessary for making the model tractable and efficient, introduces some error; this error arises because the exact posterior is often complex and not easily computable.
    • $\log P(X)$: This represents the likelihood of the observed data under the model, and maximizing this ensures that the reconstructed samples closely resemble the original data points.
  • VAE's potential error comes from balancing these two terms:

    • If $Q(z|X)$ is an accurate approximation of $P(z|X)$, the divergence term becomes small, and the model performs well.
    • If not, larger divergences may indicate poor model performance, hence affecting the data generation capability of the VAE.
  • The relationship to information theory is through the concept of Minimum Description Length (MDL):

    • Lower values of the objective imply a more efficient coding of the data, meaning the model captures more essential information with fewer bits.
    • This helps in understanding the efficiencies gained through the variational inference framework, suggesting that the chosen approximating distribution $Q(z|X)$ must minimize additional overhead.
  • The tuning of $\sigma$ in $P(X|z)$ could be seen as a form of regularization, akin to parameters used in sparse autoencoders, which control the complexity of the model and prevent overfitting.

The error from $D[Q(z|X) | P(z|X)]$

  • $Q(z|X)$ is the approximate posterior distribution of the latent variable $z$ given the input $X$.
  • $P(z|X)$ is the true posterior distribution of $z$ given the input $X$.
  • Kullback-Leibler Divergence (KL Divergence)
    • The term $D[Q(z|X) | P(z|X)]$ represents the KL Divergence between these two distributions.
    • It quantifies how much information is lost when using the approximate distribution $Q(z|X)$ instead of the true distribution $P(z|X)$.
  • Convergence to True Distribution:
    • For the model's output distribution $P(X)$ to converge to the true distribution, $D[Q(z|X) | P(z|X)]$ must approach zero.
    • This means the approximate posterior $Q(z|X)$ needs to accurately represent the true posterior $P(z|X)$.
  • Challenges in Achieving Zero Divergence
    • The author highlights that simply having high capacity (i.e., complex) functions for $\mu(X)$ (mean) and $\Sigma(X)$ (variance) does not guarantee that $Q(z|X)$ will resemble $P(z|X)$ closely enough to make the divergence zero.
    • The function $f$ modulating the relationship can greatly affect this outcome.
  • Existence of a Suitable Function
    • The text posits that there may exist a sufficiently flexible function $f$ that can ensure $P(z|X)$ is Gaussian for all $X$ while simultaneously maximizing the likelihood $\log P(X)$.
    • If such a function exists, it would facilitate minimizing the divergence $D[Q(z|X) | P(z|X)]$.
  • The author acknowledges that proving general results for all distributions remains an open problem, but notes that it's theoretically proven in some 1D cases.
    • A small variance $\sigma$ can make modeling easier, though it may also lead to complications in gradient scaling during training.

Information-theoretic interpretation

  • Minimum Description Length Principle
    • This principle suggests that the best model is one that minimizes the number of bits needed to encode the data. In the context of VAEs, $-\log P(X)$ represents the total bits required for encoding data $X$ using an ideal encoding strategy.
  • Step 1 - Encoding Latent Variable $z$
    • Some bits are used to determine the latent variable $z$.
    • The KL-divergence $D[Q(z|X)||P(z)]$ quantifies the extra information needed to adjust from a prior distribution $P(z)$ (uninformative) to the posterior distribution $Q(z|X)$.
    • This measures how much information we gain about the latent variable $z$ when it is informed by the observed data $X$.
  • Step 2 - Decoding
    • The term $P(X|z)$ measures the amount of information needed to reconstruct $X$ once $z$ has been determined.
  • The total number of bits needed to accurately represent $X$ is the sum of the bits used in both steps, subtracting the penalty from the KL-divergence which indicates how well $Q(z|X)$ matches $P(z)$.
    • This penalty reflects the inefficiency of the encoding process, highlighting that a sub-optimal encoding can lead to excess needed bits.

VAEs and the regularization parameter

  • In a traditional sparse autoencoder, a regularization parameter $\lambda$ is used in the objective function, which can be represented as:
    $$L = | \phi( \psi(X) ) - X |^2 + \lambda | \psi(X) |_0$$

    • $\phi$ and $\psi$ are the encoder and decoder functions, respectively, and $| \cdot |_0$ is the L0 norm promoting sparsity in the encoding.
  • Unlike sparse autoencoders, VAEs typically do not have a separate explicit regularization parameter to tune.

    • This is advantageous as it reduces hyperparameter tuning for practitioners.
  • Absorption of Constants

    • The text suggests that even though one might think of introducing a parameter through scaling the latent variable $z$ (like using $z' \sim N(0, \lambda I) $), this does not fundamentally change the model.
    • The model remains the same because you can absorb the constant into the probabilistic definitions of $P$ and $Q$
      $$f'(z') = f(z'/\lambda), \quad \mu'(X) = \mu(X) \cdot \lambda, \quad \Sigma'(X) = \Sigma(X) \cdot \lambda^2$$
  • Output Distribution and Regularization Parameter

    • The output distribution for continuous data is typically Gaussian:
      $$P(X|z) \sim N(f(z), \sigma^2 I)$$
    • The log-probability can be expressed as:
      $$\log P(X|z) = C - \frac{1}{2} \frac{| X - f(z) |^2}{\sigma^2}$$
  • Here, $C$ is a constant. In this context, $\sigma$ acts like a regularization parameter that controls the balance between the two terms:

    • how well the model fits the data vs. how simplistic the model should be.
  • Binary vs. Continuous Inputs

    • If the output $X$ is binary, the behavior of a regularization parameter disappears altogether since it doesn't influence the model's ability to capture the necessary information content as both terms on the right side use the same information units.
    • However, for continuous cases, we need a carefully chosen $\sigma$ to maintain finite information representation, which affects the expected accuracy of the model's reconstruction of data $X$.

Examples: MNIST & VAE

  • he VAE is applied to the MNIST dataset, which consists of grayscale images of handwritten digits (0-9).

    • The values for each pixel are constrained between 0 and 1.
    • Instead of using pre-existing VAE architectures, the authors adapt the basic AutoEncoder example from Caffe, ensuring flexibility in implementation.
  • Loss Function

    • The authors mention using the Sigmoid Cross Entropy loss for the probability distribution $P(X|z)$ of the data given the latent variables $z$.
    • This loss function is appropriate since the MNIST pixel values are between 0 and 1.
    • They probabilistically define new data points as $X'$ sampled according to:
      $$X'_i \sim \text{Bernoulli}(X_i)$$
    • This means that each pixel value is treated as a Bernoulli trial, where $X_i$ is the actual observed value from the training set.
    • This binarization captures the uncertainty in the pixel representations.
  • Training Process

    • The model is trained once fully, but with multiple restarts to identify the optimal learning rate for minimizing the loss.
    • This indicates that achieving good performance does not heavily depend on the initial setup or deep structural modifications.
  • Generated Samples

    • The results from the VAE show that while many generated digits appear realistic, some samples fall in-between digits, exemplifying the VAE’s tendency to interpolate between classes rather than producing distinctly different outputs.
    • ex. Digits might look like a blend between '7' and '9'.
  • The dimensionality of the latent variable $z$ in VAEs appears to have varying impacts on model performance.

  • If the dimensionality is too low (ex. less than 4), the model struggles to capture the complexities in the data, leading to poor performance.

    • Specifically, a model with too few dimensions fails to adequately represent the variations present in the input data.
    • Conversely, increasing the dimensions of $z$ improves performance to a certain extent, but when the dimensionality is excessively high (ex. 10,000), it can lead to problems in effectively managing the training, especially during optimization with stochastic gradient descent.
    • This happens because the model has a harder time keeping the Kullback-Leibler divergence $D[Q(z|X) || P(z)]$ low essentially a measure of how well the approximated distribution matches the true distribution when $z$ is large.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions