Tutorial on Variational Autoencoders

## Tutorial on Variational Autoencoders
- Authors: Doersch, Carl
- Journal: arXiv
- Year: 2016
- Link: [doersch2016tutorial.pdf](https://arxiv.org/pdf/1606.05908)


## Introduction
- **Generative Modeling**
  - Generative Modeling is a branch of machine learning that focuses on **creating models representing distributions of data**, denoted as $P(X)$.
    - $X$ represents the data points, such as images.
  - Each data point (like an image) consists of many dimensions, typically corresponding to pixels. 
  - A good generative model needs to understand and **capture the relationships and dependencies among these pixels**. 
    - For example, it should recognize that adjacent pixels in an image usually exhibit similar colors and likely form parts of objects.
  - Simply calculating $P(X)$ numerically is straightforward but can be insufficient. 
    - A model that can recognize which images are real versus noise is important, yet it doesn't focus on generating new useful examples.
    - **Knowing that an image has a low probability doesn't equate to having the ability to create new, high-probability examples.**
    - The real value of generative models lies in their ability to create new instances that resemble examples from a given database, such as generating new images that appear real or creating additional 3D models for applications like gaming.

- **Variational Auto Encoder(VAE)**
  - VAEs are a powerful framework to achieve this by learning a probabilistic model of the data distribution, denoted as $P(X)$, which can sample new instances similar to a target distribution $P_{gt}(X)$.
  - Traditional generative modeling approaches often struggled due to:
    - **Strong Assumptions**: Some models require predefined structures in data which might not reflect its complexity.
    - **Severe Approximations**: These can lead to suboptimal models, failing to capture the true nature of the data.
    - **Computational Expense**: Classical methods like Markov Chain Monte Carlo can be too slow for practical applications.
  - VAEs address many of these issues:
    - They make minimal assumptions about the data structure.
    - They can effectively approximate the data distribution using neural networks without the heavy computational burden normally associated with generative models.
    - The approximations they introduce are generally small, allowing for effective training using fast techniques like back-propagation.


## Latent Variable Models
- **A latent variable ($z$)** represents **hidden characteristics that influence the generation of observable data** (ex. images). 
  - It acts as an intermediate step that the model utilizes **to condition the data** it generates.
- Example of Handwritten Characters:
  - When generating images of digits (0-9), the model **first needs to decide which digit to create** (e.g., 5 or 0). 
  - This decision, represented by the latent variable $z$, ensures that the generated features of the digit are **coherent** and align with the chosen character.
  - `(Uniqueness of Latent Variable)` Given an output character, **the specific settings for latent variables that produced it are unknown**, requiring inference methods (like computer vision techniques) for a complete understanding of what settings correspond to a specific output.
  - The latent variables’ values must effectively map to the data points in the dataset to ensure the generative model adequately represents the distribution of the data.
  - The process described focuses on how to ensure that a Variational Autoencoder (VAE) model can effectively represent the dataset.
  - It emphasizes the importance of having latent variables (denoted as $z$), which are sampled from a probability density function (PDF) $P(z)$.
  - The function $f(z; \theta)$, when optimized, should enable the model to generate outputs $f(z; \theta)$ that are similar to the data points $X$ in the original dataset.
  - The goal is to maximize the overall probability of generating the observed data, represented mathematically as:
$$P(X) = \int P(X|z; \theta) P(z) dz$$
  - Here, $P(X|z; \theta)$ indicates the conditional probability of $X$ given the latent variable $z$ and the parameters $\theta$.

- **Gaussian Distribution in VAEs:**
$$P(X|z; \theta) = N(X|f(z; \theta), \sigma^2 I)$$
    - $f(z; \theta)$ is a deterministic function that defines the mean of the Gaussian output.
    - $\sigma^2 I$ describes the covariance, ensuring that samples generated from the model won't be identical to any specific training data point but will be akin to the overall dataset.
    - The choice of a Gaussian distribution is crucial because it allows gradient-based optimization (like stochastic gradient descent) to be applied effectively.
- By using a Gaussian output, the model can create varied outputs since the samples drawn will not always be identical to the input data points $X$, which facilitates learning.
- **Why Not a Dirac Delta Function?**
  - If  $P(X|z)$ were a Dirac delta function:
    - Each $z$ would deterministically produce a fixed $X$, making it impossible to explore variations around $X$.
    - This would limit the model's ability to learn and generate diverse examples, hindering its capacity to perform well on similar but unseen data.
  - The rectangle is “plate notation” meaning that we can sample from z and X N times while the model parameters θ remain fixed.
    <img src='https://github.com/user-attachments/assets/979ab767-ca6e-4c51-bad2-93be0eaa7020' width=40%>


## Variational Autoencoders
- Variational Autoencoders (VAEs) utilize latent variables to model complex data distributions. 
- Unlike classical autoencoders, they do not directly copy input data, but instead, **they create a distribution over the input data**.
- **Latent Variables**
  - In a VAE, latent variables $z$ represent underlying factors generating the observed data $X$. VAEs assume $z$ follows a simple distribution, typically $N(0, I)$ (a standard normal distribution), which allows them to learn complex mappings from $z$ to $X$ using a neural network.
- **Sampling and Likelihood**
  - VAEs approximate the likelihood of data $P(X)$ through sampling from the latent space. 
  - They introduce strategies like direct sampling from a non-complex distribution and maximizing the likelihood of data using stochastic gradient descent for efficiency, while avoiding computationally expensive methods like Markov Chain Monte Carlo.

### Setting and Objective
- **Objective of Sampling in VAEs:**
  - The goal is to compute the likelihood $P(X)$ of the data $X$ based on latent variables $z$ that are likely to generate $X$.
- **Use of Function $Q(z|X)$**
  - The function $Q(z|X)$ is introduced to provide a distribution of latent variables $z$ that are likely to produce a particular $X$. 
  - This makes the process efficient, as you only need to focus on the $z$ values that contribute meaningfully to $P(X)$.
- **Kullback-Leibler Divergence $D$:**
  - The KL divergence measures how one probability distribution diverges from a second expected probability distribution.
  $$D[Q(z) \| P(z|X)] = E_{z \sim Q} [\log Q(z) - \log P(z|X)]$$
  - This quantifies the difference between $Q(z)$ and the true posterior $P(z|X)$.
- **Relation Between Expectations**
  - The equations show how to relate the quantity $P(X)$ and $P(X|z)$ using Bayes' rule:
  $$D[Q(z) \| P(z|X)] = E_{z \sim Q} [\log Q(z) - \log P(X|z) - \log P(z)] + \log P(X)$$
- **Maximizing the Log Probability**
  - The left-hand side of the equation, as expressed in Equation 4, tries to maximize $\log P(X)$ while minimizing $D[Q(z) \| P(z|X)]$:
  $$\log P(X) - D[Q(z|X) \| P(z|X)] = E_{z \sim Q} [\log P(X|z)] - D[Q(z|X) \| P(z)]$$
  - This means that $Q(z|X)$ is ideally constructed to closely match $P(z|X)$, thereby allowing us to optimize the likelihood of observing $X$.
- **Convergence and Optimization**
  - If $Q(z|X)$ can accurately approximate $P(z|X)$, the KL divergence becomes zero, and we can effectively optimize $P(X)$ directly.
  - The framework essentially introduces a method of variational inference, where one models the posterior $P(z|X)$ using a simpler distribution $Q(z|X)$.

- VAEs optimize a complex sampling procedure to approximate the likelihood of data points using simpler distributions, enabling efficient representation learning. 
  - The central mathematical tool used here is the Kullback-Leibler divergence, which guides the optimization of the model, ensuring that $Q(z|X)$ provides a distribution that effectively captures the characteristics of latent variables associated with observed data.

### Optimizing the objective function
- A training-time variational autoencoder implemented as a feed-forward neural network, where $P(X|z)$ is Gaussian. 
  - Left is without the “reparameterization trick”, and right is with it. 
  - Red shows sampling operations that are non-differentiable. 
  - Blue shows loss layers.
  - The feedforward behavior of these networks is identical, but backpropagation can be applied only to the right network.
    <img src="https://github.com/user-attachments/assets/0c2e8724-d756-424a-b8f3-7e6c3932cd60" width=70%>

- The objective involves maximizing the likelihood of the data while minimizing the Kullback-Leibler divergence between two probability distributions: the approximate posterior $Q(z|X)$ and the prior $P(z)$.
- The usual choice for $Q(z|X)$ is modeled as a multivariate Gaussian distribution:
$$Q(z|X) = N(z|\mu(X; \theta), \Sigma(X; \theta))$$
  - where $\mu$ and $\Sigma$ are deterministic functions parameterized by $\theta$ that are learned from data. 
  - This choice simplifies computation and facilitates the optimization process.

- **KL-divergence computation**
  - The KL-divergence between two multivariate Gaussians can be computed using the formula:
$$D[Q(z) \| P(z)] = \frac{1}{2} \left( \text{tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^{\top} \Sigma_1^{-1} (\mu_1 - \mu_0) - k + \log\left(\frac{\det \Sigma_1}{\det \Sigma_0}\right) \right)$$
  - $k$ is the dimensionality of the distribution, and this term allows for efficient evaluation and optimization of the divergence.

- Gradient computation and reparameterization trick:
  - To estimate the expected value $E_{z \sim Q}[\log P(X|z)]$, a sample $z$ is drawn from $Q(z|X)$. 
  - Since this requires calculating $\log P(X|z)$ which depends on both $P$ and $Q$, care must be taken to ensure that sampling does not disrupt the gradient flow during backpropagation.
  - The solution is to employ the reparameterization trick, where instead of directly sampling from $Q(z|X)$, we express $z$ as:
    - $$z = \mu(X) + \Sigma^{1/2}(X) \cdot e$$ where $e \sim N(0, I)$.
      - $z$ is reparameterized into a form that allows for gradient computation through deterministic functions of $X$, enhancing training efficiency.
- **Final optimization equation**
  - The overall equation we want to optimize becomes:
    $$E_{X \sim D} \left[ E_{z \sim Q} \left[ \log P(X|z) \right] - D[Q(z|X) \| P(z)] \right]$$
  - This formulation captures the dual objective of maximizing the likelihood while minimizing the divergence, enabling effective optimization using stochastic gradient descent.

- The testing-time variational “autoencoder,” which allows us to generate new samples. The “encoder” pathway is simply discarded.
    <img src="https://github.com/user-attachments/assets/914c9703-be9b-40dd-85cb-aa1899100ac4" width=30%>


### Testing the learned model
- **Test-Time Process**
  - During testing, the encoder is removed from the VAE architecture.
  - New samples are generated by inputting values from a standard normal distribution $ z \sim N(0, I) $ into the decoder.

- **Evaluating Probability**
  - The probability $P(X)$ of the generated samples is generally intractable to compute directly.
  - The term $D[Q(z|X) \| P(z|X)]$, which represents the Kullback-Leibler divergence, is positive, indicating that it serves as a lower bound for $P(X)$.
  - To approximate $P(X)$, sampling from the approximate distribution $Q(z)$ provides a useful estimator that tends to converge faster than sampling from the prior $N(0, I)$.

- **Usefulness of Lower Bound**
  - This lower bound gives insights into how well the VAE model represents the training data by indicating how probable generated samples $X$ are under the learned model.


### Interpreting the objective
- VAE framework seeks to optimize the likelihood of the data, represented as $\log P(X)$, but it does so with some approximations which is crucial to understanding its performance.
- The learning objective incorporates two key components:
  - $D[Q(z|X) \| P(z|X)]$: This term is the Kullback-Leibler divergence which measures how well the approximating distribution $Q(z|X)$ aligns with the true posterior $P(z|X)$. 
  - Optimizing this term, while necessary for making the model tractable and efficient, introduces some error; this error arises because the exact posterior is often complex and not easily computable.
  - $\log P(X)$: This represents the likelihood of the observed data under the model, and maximizing this ensures that the reconstructed samples closely resemble the original data points.

- VAE's potential error comes from balancing these two terms:
  - If $Q(z|X)$ is an accurate approximation of $P(z|X)$, the divergence term becomes small, and the model performs well.
  - If not, larger divergences may indicate poor model performance, hence affecting the data generation capability of the VAE.

- The relationship to information theory is through the concept of Minimum Description Length (MDL):
  - Lower values of the objective imply a more efficient coding of the data, meaning the model captures more essential information with fewer bits.
  - This helps in understanding the efficiencies gained through the variational inference framework, suggesting that the chosen approximating distribution $Q(z|X)$ must minimize additional overhead.

- The tuning of $\sigma$ in $P(X|z)$ could be seen as a form of regularization, akin to parameters used in sparse autoencoders, which control the complexity of the model and prevent overfitting.


### The error from $D[Q(z|X) \| P(z|X)]$
- $Q(z|X)$ is the approximate posterior distribution of the latent variable $z$ given the input $X$.
- $P(z|X)$ is the true posterior distribution of $z$ given the input $X$.
- **Kullback-Leibler Divergence (KL Divergence)**
  - The term $D[Q(z|X) \| P(z|X)]$ represents the KL Divergence between these two distributions. 
  - It quantifies how much information is lost when using the approximate distribution $Q(z|X)$ instead of the true distribution $P(z|X)$.
- **Convergence to True Distribution:**
  - For the model's output distribution $P(X)$ to converge to the true distribution, $D[Q(z|X) \| P(z|X)]$ must approach zero. 
  - This means the approximate posterior $Q(z|X)$ needs to accurately represent the true posterior $P(z|X)$.
- **Challenges in Achieving Zero Divergence**
  - The author highlights that simply having high capacity (i.e., complex) functions for $\mu(X)$ (mean) and $\Sigma(X)$ (variance) does not guarantee that $Q(z|X)$ will resemble $P(z|X)$ closely enough to make the divergence zero. 
  - The function $f$ modulating the relationship can greatly affect this outcome.
- **Existence of a Suitable Function**
  - The text posits that there may exist a sufficiently flexible function $f$ that can ensure $P(z|X)$ is Gaussian for all $X$ while simultaneously maximizing the likelihood $\log P(X)$. 
  - If such a function exists, it would facilitate minimizing the divergence $D[Q(z|X) \| P(z|X)]$.
- The author acknowledges that proving general results for all distributions remains an open problem, but notes that it's theoretically proven in some 1D cases. 
  - A small variance $\sigma$ can make modeling easier, though it may also lead to complications in gradient scaling during training.

### Information-theoretic interpretation
- **Minimum Description Length Principle**
  - This principle suggests that the best model is one that minimizes the number of bits needed to encode the data. In the context of VAEs, $-\log P(X)$ represents the total bits required for encoding data $X$ using an ideal encoding strategy.
- **Step 1 - Encoding Latent Variable $z$**
  - Some bits are used to determine the latent variable $z$. 
  - The KL-divergence $D[Q(z|X)||P(z)]$ quantifies the extra information needed to adjust from a prior distribution $P(z)$ (uninformative) to the posterior distribution $Q(z|X)$. 
  - This measures how much information we gain about the latent variable $z$ when it is informed by the observed data $X$.
- **Step 2 - Decoding**
  - The term $P(X|z)$ measures the amount of information needed to reconstruct $X$ once $z$ has been determined.
- The total number of bits needed to accurately represent $X$ is the sum of the bits used in both steps, subtracting the penalty from the KL-divergence which indicates how well $Q(z|X)$ matches $P(z)$. 
  - This penalty reflects the inefficiency of the encoding process, highlighting that a sub-optimal encoding can lead to excess needed bits.

### VAEs and the regularization parameter
- In a traditional sparse autoencoder, a **regularization parameter** $\lambda$ is used in the objective function, which can be represented as:
$$L = \| \phi( \psi(X) ) - X \|^2 + \lambda \| \psi(X) \|_0$$
  - $\phi$ and $\psi$ are the encoder and decoder functions, respectively, and $\| \cdot \|_0$ is the L0 norm promoting sparsity in the encoding.
- Unlike sparse autoencoders, VAEs typically do not have a separate explicit regularization parameter to tune. 
  - This is advantageous as it reduces hyperparameter tuning for practitioners.

- **Absorption of Constants**
  - The text suggests that even though one might think of introducing a parameter through scaling the latent variable $z$ (like using $z' \sim N(0, \lambda I) $), this does not fundamentally change the model. 
  - The model remains the same because you can absorb the constant into the probabilistic definitions of $P$ and $Q$
  $$f'(z') = f(z'/\lambda), \quad \mu'(X) = \mu(X) \cdot \lambda, \quad \Sigma'(X) = \Sigma(X) \cdot \lambda^2$$

- **Output Distribution and Regularization Parameter**
  - The output distribution for continuous data is typically Gaussian: 
  $$P(X|z) \sim N(f(z), \sigma^2 I)$$
  - The log-probability can be expressed as:
  $$\log P(X|z) = C - \frac{1}{2} \frac{\| X - f(z) \|^2}{\sigma^2}$$
- Here, $C$ is a constant. In this context, $\sigma$ acts like a regularization parameter that controls the balance between the two terms: 
  - how well the model fits the data vs. how simplistic the model should be.
- **Binary vs. Continuous Inputs**
  - If the output $X$ is binary, the behavior of a regularization parameter disappears altogether since it doesn't influence the model's ability to capture the necessary information content as both terms on the right side use the same information units.
  - However, for continuous cases, we need a carefully chosen $\sigma$ to maintain finite information representation, which affects the expected accuracy of the model's reconstruction of data $X$.

## Examples: MNIST & VAE
- he VAE is applied to the MNIST dataset, which consists of grayscale images of handwritten digits (0-9). 
  - The values for each pixel are constrained between 0 and 1.
  - Instead of using pre-existing VAE architectures, the authors adapt the basic AutoEncoder example from Caffe, ensuring flexibility in implementation.
- **Loss Function**
  - The authors mention using the Sigmoid Cross Entropy loss for the probability distribution $P(X|z)$ of the data given the latent variables $z$. 
  - This loss function is appropriate since the MNIST pixel values are between 0 and 1.
  - They probabilistically define new data points as $X'$ sampled according to:
  $$X'_i \sim \text{Bernoulli}(X_i)$$
  - This means that each pixel value is treated as a Bernoulli trial, where $X_i$ is the actual observed value from the training set.
  - This binarization captures the uncertainty in the pixel representations.
- **Training Process**
  - The model is trained once fully, but with multiple restarts to identify the optimal learning rate for minimizing the loss.
  - This indicates that achieving good performance does not heavily depend on the initial setup or deep structural modifications.
- **Generated Samples**
    <img src="https://github.com/user-attachments/assets/640259e2-8c3e-4159-8127-c85d3cf7f479" width=40%>

  - The results from the VAE show that while many generated digits appear realistic, some samples fall in-between digits, exemplifying the VAE’s tendency to interpolate between classes rather than producing distinctly different outputs.
  - ex. Digits might look like a blend between '7' and '9'.
- The dimensionality of the latent variable $z$ in VAEs appears to have varying impacts on model performance. 
- If the dimensionality is too low (ex. less than 4), the model struggles to capture the complexities in the data, leading to poor performance. 
  - Specifically, a model with too few dimensions fails to adequately represent the variations present in the input data.
  - Conversely, increasing the dimensions of $z$ improves performance to a certain extent, but when the dimensionality is excessively high (ex. 10,000), it can lead to problems in effectively managing the training, especially during optimization with stochastic gradient descent. 
  - This happens because the model has a harder time keeping the Kullback-Leibler divergence $D[Q(z|X) || P(z)]$ low essentially a measure of how well the approximated distribution matches the true distribution when $z$ is large.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tutorial on Variational Autoencoders #51