diff --git a/README.md b/README.md index 60f7b21..d4812b7 100644 --- a/README.md +++ b/README.md @@ -1,28 +1,32 @@ -# Notes on Deep Generative Models +# CS236 Notes: Deep Generative Models These notes form a concise introductory course on deep generative models. They are based on Stanford [CS236](https://deepgenerativemodels.github.io/), taught by [Aditya Grover](http://aditya-grover.github.io/) and [Stefano Ermon](http://cs.stanford.edu/~ermon/), and have been written by [Aditya Grover](http://aditya-grover.github.io/), with the [help](https://github.com/deepgenerativemodels/notes/commits/master) of many students and course staff. -The compiled version is available [here](https://deepgenerativemodels.github.io/notes/index.html). +The compiled notes are available [here](https://deepgenerativemodels.github.io/notes/index.html). -## Contributing +# Contributing -This material is under construction! Although we have written up most of it, you will probably find several typos. If you do, please let us know, or submit a pull request with your fixes via Github. +This material is under construction! Please help us resolve typos by submitting PRs to this repo. +## Compilation -The notes are written in Markdown and are compiled into HTML using Jekyll. Please add your changes directly to the Markdown source code. In order to install jekyll, you can follow the instructions posted on their website (https://jekyllrb.com/docs/installation/). +The notes are written in Markdown and are compiled into HTML using Jekyll. Please add your changes directly to the Markdown source code. In order to install jekyll, you can follow the instructions posted on their website (https://jekyllrb.com/docs/installation/). -Note that jekyll is only supported on GNU/Linux, Unix, or macOS. Thus, if you run Windows 10 on your local machine, you will have to install Bash on Ubuntu on Windows. Windows gives instructions on how to do that here and Jekyll's website offers helpful instructions on how to proceed through the rest of the process. +To compile Markdown to HTML, run the following commands from the root of your repo: -To compile Markdown to HTML (i.e. after you have made changes to markdown and want them to be accessible to students viewing the docs), -run the following commands from the root of your cloned version of the https://github.com/deepgenerativemodels/notes repo: 1) rm -r docs/ 2) jekyll serve # This should create a folder called _site. Note: This creates a running server; press Ctrl-C to stop the server before proceeding -3) mv _site docs # Change the name of the _site folder to "docs". This won't work if the server is still running. -4) git add file_names -5) git commit -am "your commit message describing what you did" -6) git push origin master +3) git add {...} # Add changed files here +4) git commit -am "your commit message describing what you did" +5) git push origin master + +## Notes on building the site on Windows + +Note that jekyll is only supported on GNU/Linux, Unix, or macOS. Thus, if you run Windows 10 on your local machine, you will have to install Bash on Ubuntu on Windows. Windows gives instructions on how to do that here and Jekyll's website offers helpful instructions on how to proceed through the rest of the process. + +## Notes on Github permissions -Note that if you cloned the ermongroup/cs228-notes repo directly onto your local machine (instead of forking it) then you may see an error like "remote: Permission to ermongroup/cs228-notes.git denied to userjanedoe". If that is the case, then you need to fork their repo first. Then, if your github profile were userjanedoe, you would need to first push your local updates to your forked repo like so: +Note that if you cloned the ermongroup/cs228-notes repo directly you may see an error like "remote: Permission to ermongroup/cs228-notes.git denied to userjanedoe". If that is the case, then you need to fork this repo first. Then, if your github profile were userjanedoe, you would need to first push your local updates to your forked repo like so: git push https://github.com/deepgenerativemodels/notes.git master diff --git a/autoregressive/index.md b/autoregressive/index.md index 9e01c9f..94be915 100644 --- a/autoregressive/index.md +++ b/autoregressive/index.md @@ -43,7 +43,7 @@ where $$\theta_i$$ denotes the set of parameters used to specify the mean function $$f_i: \{0,1\}^{i-1}\rightarrow [0,1]$$. -The number of parameters of an autoregressive generative model are given by $$\sum_{i=1}^n \vert \theta_i \vert$$. As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions. +The number of parameters of an autoregressive generative model are given by $$\sum_{i=1}^n \vert \theta_i \vert$$. As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting, however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions.
drawing @@ -59,7 +59,7 @@ f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\alpha^{(i)}_0 + \alpha^{(i)}_1 x_1 + \ld where $$\sigma$$ denotes the sigmoid function and $$\theta_i=\{\alpha^{(i)}_0,\alpha^{(i)}_1, \ldots, \alpha^{(i)}_{i-1}\}$$ denote the parameters of the mean function. The conditional for variable $$i$$ requires $$i$$ parameters, and hence the total number of parameters in the model is given by $$\sum_{i=1}^ni= O(n^2)$$. Note that the number of parameters are much fewer than the exponential complexity of the tabular case. -A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable $$i$$ can be expressed as +A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function, e.g. multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable $$i$$ can be expressed as {% math %} \mathbf{h}_i = \sigma(A_i \mathbf{x_{< i}} + \mathbf{c}_i)\\ @@ -105,14 +105,14 @@ Notice that NADE requires specifying a single, fixed ordering of the variables. Learning and inference ====================== -Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness in the KL divergence between the data and the model distributions. +Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness is the KL divergence between the data and the model distributions. {% math %} \min_{\theta\in \mathcal{M}}d_{KL} (p_{\mathrm{data}}, p_{\theta}) = \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} }\left[\log p_{\mathrm{data}}(\mathbf{x}) - \log p_{\theta}(\mathbf{x})\right] {% endmath %} -Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution $$p_\theta$$ which assigns low probability to a datapoint that is likely to be sampled under $$p_{\mathrm{data}}$$. In the extreme case, if the density $$p_\theta(\mathbf{x})$$ evaluates to zero for a datapoint sampled from $$p_{\mathrm{data}}$$, the objective evaluates to $$+\infty$$. +Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution $$p_\theta$$ which assigns a low probability to a datapoint that is likely to be sampled under $$p_{\mathrm{data}}$$. In the extreme case, if the density $$p_\theta(\mathbf{x})$$ evaluates to zero for a datapoint sampled from $$p_{\mathrm{data}}$$, the objective evaluates to $$+\infty$$. Since $$p_{\mathrm{data}}$$ does not depend on $$\theta$$, we can equivalently recover the optimal parameters via maximizing likelihood estimation. @@ -138,7 +138,7 @@ In practice, we optimize the MLE objective using mini-batch gradient ascent. The where $$\theta^{(t+1)}$$ and $$\theta^{(t)}$$ are the parameters at iterations $$t+1$$ and $$t$$ respectively, and $$r_t$$ is the learning rate at iteration $$t$$. Typically, we only specify the initial learning rate $$r_1$$ and update the rate based on a schedule. [Variants](http://cs231n.github.io/optimization-1/) of stochastic gradient ascent, such as RMS prop and Adam, employ modified update rules that work slightly better in practice. -From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve[^1]. +From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criterion for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve[^1]. Now that we have a well-defined objective and optimization procedure, the only remaining task is to evaluate the objective in the context of an autoregressive generative model. To this end, we substitute the factorized joint distribution of an autoregressive model in the MLE objective to get @@ -149,13 +149,13 @@ Now that we have a well-defined objective and optimization procedure, the only r where $$\theta = \{\theta_1, \theta_2, \ldots, \theta_n\}$$ now denotes the collective set of parameters for the conditionals. -Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point $$\mathbf{x}$$, we simply evaluate the log-conditionals $$\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i})$$ for each $$i$$ and add these up to obtain the log-likelihood assigned by the model to $$\mathbf{x}$$. Since we know conditioning vector $$\mathbf{x}$$, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware. +Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point $$\mathbf{x}$$, we simply evaluate the log-conditionals $$\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i})$$ for each $$i$$ and add these up to obtain the log-likelihood assigned by the model to $$\mathbf{x}$$. Since we know the conditioning vector $$\mathbf{x}$$, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware. Sampling from an autoregressive model is a sequential procedure. Here, we first sample $$x_1$$, then we sample $$x_2$$ conditioned on the sampled $$x_1$$, followed by $$x_3$$ conditioned on both $$x_1$$ and $$x_2$$ and so on until we sample $$x_n$$ conditioned on the previously sampled $$\mathbf{x}_{< n}$$. For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process. Later in this course, we will discuss how parallel WaveNet, an autoregressive model sidesteps this expensive sampling process. -Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few set of lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data. +Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data. -

Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few set of lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.

+

Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.

-

The generator and discriminator both play a two player minimax game, where the generator minimizes a two-sample test objective () and the discriminator maximizes the objective (). Intuitively, the generator tries to fool the discriminator to the best of its ability by generating samples that look indisginguishable from .

+

The generator and discriminator both play a two player minimax game, where the generator minimizes a two-sample test objective () and the discriminator maximizes the objective (). Intuitively, the generator tries to fool the discriminator to the best of its ability by generating samples that look indistinguishable from .

Formally, the GAN objective can be written as:

@@ -125,7 +125,7 @@

GAN Objective

D_{\textrm{JSD}}[p, q] = \frac{1}{2} \left( D_{\textrm{KL}}\left[p, \frac{p+q}{2} \right] + D_{\textrm{KL}}\left[q, \frac{p+q}{2} \right] \right) -

The JSD satisfies all properties of the KL, and has the additional perk that . With this distance metric, the optimal generator for the GAN objective becomces , and the optimal objective value that we can achieve with optimal generators and discriminators and is .

+

The JSD satisfies all properties of the KL, and has the additional perk that . With this distance metric, the optimal generator for the GAN objective becomes , and the optimal objective value that we can achieve with optimal generators and discriminators and is .

GAN training algorithm

@@ -149,13 +149,13 @@

GAN training algorithm

Challenges

-

Although GANs have been successfully applied to several domains and tasks, working with them in practice is challenging because of their: (1) unstable optimization procedure, (2) potential for mode collapse, (3) difficulty in evaluation.

+

Although GANs have been successfully applied to several domains and tasks, working with them in practice is challenging because of their: (1) unstable optimization procedure, (2) potential for mode collapse, (3) difficulty in performance evaluation.

-

During optimization, the generator and discriminator loss often continue to oscillate without converging to a clear stopping point. Due to the lack of a robust stopping criteria, it is difficult to know when exactly the GAN has finished training. Additionally, the generator of a GAN can often get stuck producing one of a few types of samples over and over again (mode collapse). Most fixes to these challenges are empirically driven, and there has been a significant amount of work put into developing new architectures, regularization schemes, and noise perturbations in an attempt to circumvent these issues. Soumith Chintala has a nice link outlining various tricks of the trade to stabilize GAN training.

+

During optimization, the generator and discriminator loss often continue to oscillate without converging to a definite stopping point. Due to the lack of robust stopping criteria, it is difficult to know when exactly the GAN has finished training. Additionally, the generator of a GAN can often get stuck producing one of a few types of samples over and over again (mode collapse). Most fixes to these challenges are empirically driven, and there has been a significant amount of work put into developing new architectures, regularization schemes, and noise perturbations in an attempt to circumvent these issues. Soumith Chintala has a nice link outlining various tricks of the trade to stabilize GAN training.

Selected GANs

-

Next, we focus our attention to a few select types of GAN architectures and explore them in more detail.

+

Next, we focus our attention on a few select types of GAN architectures and explore them in more detail.

f-GAN

The f-GAN optimizes the variant of the two-sample test objective that we have discussed so far, but using a very general notion of distance: the . Given two densities and , the -divergence can be written as:

@@ -177,7 +177,7 @@

f-GAN

\min_\theta \max_\phi F(\theta,\phi) = \mathbb{E}_{x \sim p_{\textrm{data}}}[T_\phi(\mathbf{x})] - \mathbb{E}_{x \sim p_{G_\theta}}[f^*(T_\phi(\mathbf{x}))] -

Intuitively, we can think about this objective as the generator trying to minimize the divergence estimate, while the discriminator tries to tighten the lower bound.

+

Intuitively, we can think about this objective as the generator trying to minimize the divergence estimate, while the discriminator attempts to tighten the lower bound.

BiGAN

We won’t worry too much about the BiGAN in these notes. However, we can think about this model as one that allows us to infer latent representations even within a GAN framework.

diff --git a/docs/introduction/index.html b/docs/introduction/index.html index f0f5990..53241c5 100644 --- a/docs/introduction/index.html +++ b/docs/introduction/index.html @@ -80,8 +80,7 @@

Introduction

Intelligent agents are constantly generating, acquiring, and processing data. This data could be in the form of images that we capture on our phones, text messages we share with our friends, graphs that model -interactions on social media, videos that record important events, -etc. Natural agents excel at discovering patterns, extracting +interactions on social media, or videos that record important events. Natural agents excel at discovering patterns, extracting knowledge, and performing complex reasoning based on the data they observe. How can we build artificial learning systems to do the same?

@@ -91,7 +90,7 @@

Introduction

underlying distribution, say . At its very core, the goal of any generative model is then to approximate this data distribution given access to the dataset . The hope is that -if we are able to learn a good generative model, we can use the +if we can learn a good generative model, we can use the learned model for downstream inference.

Learning

@@ -105,7 +104,7 @@

Learning

In the parametric setting, we can think of the task of learning a generative model as picking the parameters within a family of model distributions that minimizes some notion of distance1 between the -model distribution and the data distribution.

+model distribution and data distribution.

drawing

@@ -114,7 +113,7 @@

Learning

![goal](learning_2.png =100x20) -->

For instance, we might be given access to a dataset of dog images and -our goal is to learn the paraemeters of a generative model within a model family such that +our goal is to learn the parameters of a generative model within a model family such that the model distribution is close to the data distribution over dogs . Mathematically, we can specify our goal as the following optimization problem: \begin{equation} @@ -130,17 +129,17 @@

Learning

Each pixel has three channels: R(ed), G(reen) and B(lue) and each channel can take a value between 0 to 255. Hence, the number of possible images is given by . -In contrast, Imagenet, one of the largest publicly available datasets, +In contrast, ImageNet, one of the largest publicly available datasets, consists of only about 15 million images. Hence, learning a generative model with such a limited dataset is a highly underdetermined problem.

Fortunately, the real world is highly structured and automatically discovering the underlying structure is key to learning generative models. For example, we can hope to learn some basic artifacts about -dogs even with just a few images: two eyes, two ears, fur etc. Instead +dogs even with just a few images: two eyes, two ears, fur, etc. Instead of incorporating this prior knowledge explicitly, we will hope the model learns the underlying structure directly from data. There is no free -lunch however, and indeed successful learning of generative models will +lunch, however, and indeed successful learning of generative models will involve instantiating the optimization problem in in a suitable way. In this course, we will be primarily interested in the following questions:

@@ -151,9 +150,9 @@

Learning

  • What is the optimization procedure for minimizing ?
  • -

    In the next few set of lectures, we will take a deeper dive into certain +

    In the next few lectures, we will take a deeper dive into certain families of generative models. For each model family, we will note how -the representation is closely tied with the choice of learning objective +the representation relates to the choice of learning objective and the optimization procedure.

    Inference

    @@ -165,7 +164,7 @@

    Inference

    While the range of applications to which generative models have been used continue to grow, we can identify three fundamental inference -queries for evaluating a generative model.:

    +queries for evaluating a generative model:

    1. diff --git a/docs/vae/index.html b/docs/vae/index.html index c9d2f14..e5544e3 100644 --- a/docs/vae/index.html +++ b/docs/vae/index.html @@ -170,11 +170,11 @@

      Learning Directed Latent Varia \log p(\bx) \approx \log \frac{1}{k} \sum_{i=1}^k p(\bx \vert \bz^{(i)}) \text{, where } \bz^{(i)} \sim p(\bz) -

      In practice however, optimizing the above estimate suffers from high variance in gradient estimates.

      +

      In practice, however, optimizing the above estimate suffers from high variance in gradient estimates.

      Rather than maximizing the log-likelihood directly, an alternate is to instead construct a lower bound that is more amenable to optimization. To do so, we note that evaluating the marginal likelihood is at least as difficult as as evaluating the posterior for any latent vector since by definition .

      -

      Next, we introduce a variational family of distributions that approximate the true, but intractable posterior . Further henceforth, we will assume a parameteric setting where any distribution in the model family is specified via a set of parameters and distributions in the variational family are specified via a set of parameters .

      +

      Next, we introduce a variational family of distributions that approximate the true, but intractable posterior . Henceforth, we will assume a parameteric setting where any distribution in the model family is specified via a set of parameters and distributions in the variational family are specified via a set of parameters .

      Given and , we note that the following relationships hold true1 for any and all variational distributions

      @@ -310,7 +310,7 @@

      Parameterizing Di

      Amortized Variational Inference

      -

      A noticable limitation of black-box variational inference is that Step 1 executes an optimization subroutine that is computationally expensive. Recall that the goal of the Step 1 is to find

      +

      A noticeable limitation of black-box variational inference is that Step 1 executes an optimization subroutine that is computationally expensive. Recall that the goal of the Step 1 is to find

      -

      It is also worth noting that optimizing over the entire dataset as a subroutine everytime we sample a new mini-batch is clearly not reasonable. However, if we believe that is capable of quickly adapting to a close-enough approximation of given the current choice of , then we can interleave the optimization and . The yields the following procedure, where for each mini-batch , we perform the following two updates jointly

      +

      It is also worth noting that optimizing over the entire dataset as a subroutine every time we sample a new mini-batch is clearly not reasonable. However, if we believe that is capable of quickly adapting to a close-enough approximation of given the current choice of , then we can interleave the optimization and . This yields the following procedure, where for each mini-batch , we perform the following two updates jointly