diff --git a/README.md b/README.md
index 60f7b21..d4812b7 100644
--- a/README.md
+++ b/README.md
@@ -1,28 +1,32 @@
-# Notes on Deep Generative Models
+# CS236 Notes: Deep Generative Models
 
 These notes form a concise introductory course on deep generative models. They are based on Stanford [CS236](https://deepgenerativemodels.github.io/), taught by [Aditya Grover](http://aditya-grover.github.io/) and [Stefano Ermon](http://cs.stanford.edu/~ermon/), and have been written by [Aditya Grover](http://aditya-grover.github.io/), with the [help](https://github.com/deepgenerativemodels/notes/commits/master) of many students and course staff.
 
-The compiled version is available [here](https://deepgenerativemodels.github.io/notes/index.html).
+The compiled notes are available [here](https://deepgenerativemodels.github.io/notes/index.html).
 
-## Contributing
+# Contributing
 
-This material is under construction! Although we have written up most of it, you will probably find several typos. If you do, please let us know, or submit a pull request with your fixes via Github.
+This material is under construction! Please help us resolve typos by submitting PRs to this repo.
 
+## Compilation
 
-The notes are written in Markdown and are compiled into HTML using Jekyll. Please add your changes directly to the Markdown source code. In order to install jekyll, you can follow the instructions posted on their website (https://jekyllrb.com/docs/installation/). 
+The notes are written in Markdown and are compiled into HTML using Jekyll. Please add your changes directly to the Markdown source code. In order to install jekyll, you can follow the instructions posted on their website (https://jekyllrb.com/docs/installation/).
 
-Note that jekyll is only supported on GNU/Linux, Unix, or macOS. Thus, if you run Windows 10 on your local machine, you will have to install Bash on Ubuntu on Windows. Windows gives instructions on how to do that <a href="https://docs.microsoft.com/en-us/windows/wsl/install-win10">here</a> and Jekyll's <a href="https://jekyllrb.com/docs/windows/">website</a> offers helpful instructions on how to proceed through the rest of the process.
+To compile Markdown to HTML, run the following commands from the root of your repo:
 
-To compile Markdown to HTML (i.e. after you have made changes to markdown and want them to be accessible to students viewing the docs), 
-run the following commands from the root of your cloned version of the https://github.com/deepgenerativemodels/notes repo:
 1) rm -r docs/
 2) jekyll serve  # This should create a folder called _site. Note: This creates a running server; press Ctrl-C to stop the server before proceeding
-3) mv _site docs  # Change the name of the _site folder to "docs". This won't work if the server is still running.
-4) git add file_names
-5) git commit -am "your commit message describing what you did"
-6) git push origin master
+3) git add {...} # Add changed files here
+4) git commit -am "your commit message describing what you did"
+5) git push origin master
+
+## Notes on building the site on Windows
+
+Note that jekyll is only supported on GNU/Linux, Unix, or macOS. Thus, if you run Windows 10 on your local machine, you will have to install Bash on Ubuntu on Windows. Windows gives instructions on how to do that <a href="https://docs.microsoft.com/en-us/windows/wsl/install-win10">here</a> and Jekyll's <a href="https://jekyllrb.com/docs/windows/">website</a> offers helpful instructions on how to proceed through the rest of the process.
+
+## Notes on Github permissions
 
-Note that if you cloned the ermongroup/cs228-notes repo directly onto your local machine (instead of forking it) then you may see an error like "remote: Permission to ermongroup/cs228-notes.git denied to userjanedoe". If that is the case, then you need to fork their repo first. Then, if your github profile were userjanedoe, you would need to first push your local updates to your forked repo like so:
+Note that if you cloned the ermongroup/cs228-notes repo directly you may see an error like "remote: Permission to ermongroup/cs228-notes.git denied to userjanedoe". If that is the case, then you need to fork this repo first. Then, if your github profile were userjanedoe, you would need to first push your local updates to your forked repo like so:
 
 git push https://github.com/deepgenerativemodels/notes.git master
 
diff --git a/autoregressive/index.md b/autoregressive/index.md
index 9e01c9f..94be915 100644
--- a/autoregressive/index.md
+++ b/autoregressive/index.md
@@ -43,7 +43,7 @@ where $$\theta_i$$ denotes the set of parameters used to specify the mean
 function $$f_i: \{0,1\}^{i-1}\rightarrow [0,1]$$. 
 
 
-The number of parameters of an autoregressive generative model are given by $$\sum_{i=1}^n \vert \theta_i \vert$$. As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions.
+The number of parameters of an autoregressive generative model are given by $$\sum_{i=1}^n \vert \theta_i \vert$$. As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting, however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions.
 
 <figure>
 <img src="fvsbn.png" alt="drawing" width="200" class="center"/>
@@ -59,7 +59,7 @@ f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\alpha^{(i)}_0 + \alpha^{(i)}_1 x_1 + \ld
 
 where $$\sigma$$ denotes the sigmoid function and $$\theta_i=\{\alpha^{(i)}_0,\alpha^{(i)}_1, \ldots, \alpha^{(i)}_{i-1}\}$$ denote the parameters of the mean function. The conditional for variable $$i$$ requires $$i$$ parameters, and hence the total number of parameters in the model is given by $$\sum_{i=1}^ni= O(n^2)$$. Note that the number of parameters are much fewer than the exponential complexity of the tabular case.
 
-A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable $$i$$ can be expressed as
+A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function, e.g. multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable $$i$$ can be expressed as
 
 {% math %}
 \mathbf{h}_i = \sigma(A_i \mathbf{x_{< i}} + \mathbf{c}_i)\\
@@ -105,14 +105,14 @@ Notice that NADE requires specifying a single, fixed ordering of the variables.
 Learning and inference
 ======================
 
-Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness in the KL divergence between the data and the model distributions.
+Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness is the KL divergence between the data and the model distributions.
 
 {% math %}
 \min_{\theta\in \mathcal{M}}d_{KL}
 (p_{\mathrm{data}}, p_{\theta}) = \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} }\left[\log p_{\mathrm{data}}(\mathbf{x}) - \log p_{\theta}(\mathbf{x})\right]
 {% endmath %}
 
-Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution $$p_\theta$$ which assigns low probability to a datapoint that is likely to be sampled under $$p_{\mathrm{data}}$$. In the extreme case, if the density $$p_\theta(\mathbf{x})$$ evaluates to zero for a datapoint sampled from $$p_{\mathrm{data}}$$, the objective evaluates to $$+\infty$$.
+Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution $$p_\theta$$ which assigns a low probability to a datapoint that is likely to be sampled under $$p_{\mathrm{data}}$$. In the extreme case, if the density $$p_\theta(\mathbf{x})$$ evaluates to zero for a datapoint sampled from $$p_{\mathrm{data}}$$, the objective evaluates to $$+\infty$$.
 
 Since $$p_{\mathrm{data}}$$ does not depend on $$\theta$$, we can equivalently recover the optimal parameters via maximizing likelihood estimation.
 
@@ -138,7 +138,7 @@ In practice, we optimize the MLE objective using mini-batch gradient ascent. The
 
 where $$\theta^{(t+1)}$$ and $$\theta^{(t)}$$ are the parameters at iterations $$t+1$$ and $$t$$ respectively, and $$r_t$$ is the learning rate at iteration $$t$$. Typically, we only specify the initial learning rate $$r_1$$ and update the rate based on a schedule. [Variants](http://cs231n.github.io/optimization-1/) of stochastic gradient ascent, such as RMS prop and Adam, employ modified update rules that work slightly better in practice.
 
-From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve[^1].
+From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criterion for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve[^1].
 
 Now that we have a well-defined objective and optimization procedure, the only remaining task is to evaluate the objective in the context of an autoregressive generative model. To this end, we substitute the factorized joint distribution of an autoregressive model in the MLE objective to get
 
@@ -149,13 +149,13 @@ Now that we have a well-defined objective and optimization procedure, the only r
 where $$\theta = \{\theta_1, \theta_2, \ldots, \theta_n\}$$ now denotes the
 collective set of parameters for the conditionals.
 
-Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point $$\mathbf{x}$$, we simply evaluate the log-conditionals $$\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i})$$ for each $$i$$ and add these up to obtain the log-likelihood assigned by the model to $$\mathbf{x}$$. Since we know conditioning vector $$\mathbf{x}$$, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.
+Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point $$\mathbf{x}$$, we simply evaluate the log-conditionals $$\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i})$$ for each $$i$$ and add these up to obtain the log-likelihood assigned by the model to $$\mathbf{x}$$. Since we know the conditioning vector $$\mathbf{x}$$, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.
 
 Sampling from an autoregressive model is a sequential procedure. Here, we first sample $$x_1$$, then we sample $$x_2$$ conditioned on the sampled $$x_1$$, followed by $$x_3$$ conditioned on both $$x_1$$ and $$x_2$$ and so on until we sample $$x_n$$ conditioned on the previously sampled $$\mathbf{x}_{< n}$$. For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process. Later in this course, we will discuss how parallel WaveNet, an autoregressive model sidesteps this expensive sampling process.
 
 <!-- TODO: add NADE samples figure -->
 
-Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few set of lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.
+Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.
 
 <!-- 
 
diff --git a/docs/autoregressive/index.html b/docs/autoregressive/index.html
index 80afa71..f0a483d 100644
--- a/docs/autoregressive/index.html
+++ b/docs/autoregressive/index.html
@@ -101,7 +101,7 @@ <h1 id="representation">Representation</h1>
 </figure>
 
 <p>Such a Bayesian network that makes no conditional independence assumptions is said to obey the <em>autoregressive</em> property.
-The term <em>autoregressive</em> originates from the literature on time-series models where observations from the previous time-steps are used to predict the value at the current time step. Here, we fix an ordering of the variables <script type="math/tex">x_1, x_2, \ldots, x_n</script> and the distribution for the <script type="math/tex">i</script>-th random variable depends on the values of all the preceeding random variables in the chosen ordering <script type="math/tex">x_1, x_2, \ldots, x_{i-1}</script>.</p>
+The term <em>autoregressive</em> originates from the literature on time-series models where observations from the previous time-steps are used to predict the value at the current time step. Here, we fix an ordering of the variables <script type="math/tex">x_1, x_2, \ldots, x_n</script> and the distribution for the <script type="math/tex">i</script>-th random variable depends on the values of all the preceding random variables in the chosen ordering <script type="math/tex">x_1, x_2, \ldots, x_{i-1}</script>.</p>
 
 <p>If we allow for every conditional <script type="math/tex">% <![CDATA[
 p(x_i \vert \mathbf{x}_{< i}) %]]></script> to be specified in a tabular form, then such a representation is fully general and can represent any possible distribution over <script type="math/tex">n</script> random variables. However, the space complexity for such a representation grows exponentially with <script type="math/tex">n</script>.</p>
@@ -110,7 +110,7 @@ <h1 id="representation">Representation</h1>
 p(x_n \vert \mathbf{x}_{< n}) %]]></script>. In order to fully specify this conditional, we need to specify a probability for <script type="math/tex">2^{n-1}</script> configurations of the variables <script type="math/tex">x_1, x_2, \ldots, x_{n-1}</script>. Since the probabilities should sum to 1, the total number of parameters for specifying this conditional is given by <script type="math/tex">2^{n-1} -1</script>. Hence, a tabular representation for the conditionals is impractical for learning the joint distribution factorized via chain rule.</p>
 
 <p>In an <em>autoregressive generative model</em>, the conditionals are specified as parameterized functions with a fixed number of parameters. That is, we assume the conditional distributions <script type="math/tex">% <![CDATA[
-p(x_i \vert \mathbf{x}_{< i}) %]]></script> to correspond to a Bernoulli random variable and learn a function that maps the preceeding random variables <script type="math/tex">x_1, x_2, \ldots, x_{i-1}</script> to the
+p(x_i \vert \mathbf{x}_{< i}) %]]></script> to correspond to a Bernoulli random variable and learn a function that maps the preceding random variables <script type="math/tex">x_1, x_2, \ldots, x_{i-1}</script> to the
 mean of this distribution. Hence, we have</p>
 <div class="mathblock"><script type="math/tex; mode=display">
 p_{\theta_i}(x_i \vert \mathbf{x}_{< i}) = \mathrm{Bern}(f_i(x_1, x_2, \ldots, x_{i-1}))
@@ -118,7 +118,7 @@ <h1 id="representation">Representation</h1>
 <p>where <script type="math/tex">\theta_i</script> denotes the set of parameters used to specify the mean
 function <script type="math/tex">f_i: \{0,1\}^{i-1}\rightarrow [0,1]</script>.</p>
 
-<p>The number of parameters of an autoregressive generative model are given by <script type="math/tex">\sum_{i=1}^n \vert \theta_i \vert</script>. As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions.</p>
+<p>The number of parameters of an autoregressive generative model are given by <script type="math/tex">\sum_{i=1}^n \vert \theta_i \vert</script>. As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting, however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions.</p>
 
 <figure>
 <img src="fvsbn.png" alt="drawing" width="200" class="center" />
@@ -134,7 +134,7 @@ <h1 id="representation">Representation</h1>
 
 <p>where <script type="math/tex">\sigma</script> denotes the sigmoid function and <script type="math/tex">\theta_i=\{\alpha^{(i)}_0,\alpha^{(i)}_1, \ldots, \alpha^{(i)}_{i-1}\}</script> denote the parameters of the mean function. The conditional for variable <script type="math/tex">i</script> requires <script type="math/tex">i</script> parameters, and hence the total number of parameters in the model is given by <script type="math/tex">\sum_{i=1}^ni= O(n^2)</script>. Note that the number of parameters are much fewer than the exponential complexity of the tabular case.</p>
 
-<p>A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable <script type="math/tex">i</script> can be expressed as</p>
+<p>A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function, e.g. multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable <script type="math/tex">i</script> can be expressed as</p>
 
 <div class="mathblock"><script type="math/tex; mode=display">
 \mathbf{h}_i = \sigma(A_i \mathbf{x_{< i}} + \mathbf{c}_i)\\
@@ -182,14 +182,14 @@ <h3 id="extensions-to-nade">Extensions to NADE</h3>
 
 <h1 id="learning-and-inference">Learning and inference</h1>
 
-<p>Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness in the KL divergence between the data and the model distributions.</p>
+<p>Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness is the KL divergence between the data and the model distributions.</p>
 
 <div class="mathblock"><script type="math/tex; mode=display">
 \min_{\theta\in \mathcal{M}}d_{KL}
 (p_{\mathrm{data}}, p_{\theta}) = \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} }\left[\log p_{\mathrm{data}}(\mathbf{x}) - \log p_{\theta}(\mathbf{x})\right]
 </script></div>
 
-<p>Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution <script type="math/tex">p_\theta</script> which assigns low probability to a datapoint that is likely to be sampled under <script type="math/tex">p_{\mathrm{data}}</script>. In the extreme case, if the density <script type="math/tex">p_\theta(\mathbf{x})</script> evaluates to zero for a datapoint sampled from <script type="math/tex">p_{\mathrm{data}}</script>, the objective evaluates to <script type="math/tex">+\infty</script>.</p>
+<p>Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution <script type="math/tex">p_\theta</script> which assigns a low probability to a datapoint that is likely to be sampled under <script type="math/tex">p_{\mathrm{data}}</script>. In the extreme case, if the density <script type="math/tex">p_\theta(\mathbf{x})</script> evaluates to zero for a datapoint sampled from <script type="math/tex">p_{\mathrm{data}}</script>, the objective evaluates to <script type="math/tex">+\infty</script>.</p>
 
 <p>Since <script type="math/tex">p_{\mathrm{data}}</script> does not depend on <script type="math/tex">\theta</script>, we can equivalently recover the optimal parameters via maximizing likelihood estimation.</p>
 
@@ -215,7 +215,7 @@ <h1 id="learning-and-inference">Learning and inference</h1>
 
 <p>where <script type="math/tex">\theta^{(t+1)}</script> and <script type="math/tex">\theta^{(t)}</script> are the parameters at iterations <script type="math/tex">t+1</script> and <script type="math/tex">t</script> respectively, and <script type="math/tex">r_t</script> is the learning rate at iteration <script type="math/tex">t</script>. Typically, we only specify the initial learning rate <script type="math/tex">r_1</script> and update the rate based on a schedule. <a href="http://cs231n.github.io/optimization-1/">Variants</a> of stochastic gradient ascent, such as RMS prop and Adam, employ modified update rules that work slightly better in practice.</p>
 
-<p>From a practical standpoint, we must think about how to choose hyperaparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</p>
+<p>From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criterion for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</p>
 
 <p>Now that we have a well-defined objective and optimization procedure, the only remaining task is to evaluate the objective in the context of an autoregressive generative model. To this end, we substitute the factorized joint distribution of an autoregressive model in the MLE objective to get</p>
 
@@ -227,14 +227,14 @@ <h1 id="learning-and-inference">Learning and inference</h1>
 collective set of parameters for the conditionals.</p>
 
 <p>Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point <script type="math/tex">\mathbf{x}</script>, we simply evaluate the log-conditionals <script type="math/tex">% <![CDATA[
-\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i}) %]]></script> for each <script type="math/tex">i</script> and add these up to obtain the log-likelihood assigned by the model to <script type="math/tex">\mathbf{x}</script>. Since we know conditioning vector <script type="math/tex">\mathbf{x}</script>, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.</p>
+\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i}) %]]></script> for each <script type="math/tex">i</script> and add these up to obtain the log-likelihood assigned by the model to <script type="math/tex">\mathbf{x}</script>. Since we know the conditioning vector <script type="math/tex">\mathbf{x}</script>, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.</p>
 
 <p>Sampling from an autoregressive model is a sequential procedure. Here, we first sample <script type="math/tex">x_1</script>, then we sample <script type="math/tex">x_2</script> conditioned on the sampled <script type="math/tex">x_1</script>, followed by <script type="math/tex">x_3</script> conditioned on both <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> and so on until we sample <script type="math/tex">x_n</script> conditioned on the previously sampled <script type="math/tex">% <![CDATA[
-\mathbf{x}_{< n} %]]></script>. For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process. Later in this course, we will discuss how parallel Wavenet, an autoregressive model sidesteps this expensive sampling process.</p>
+\mathbf{x}_{< n} %]]></script>. For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process. Later in this course, we will discuss how parallel WaveNet, an autoregressive model sidesteps this expensive sampling process.</p>
 
 <!-- TODO: add NADE samples figure -->
 
-<p>Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few set of lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.</p>
+<p>Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.</p>
 
 <!-- 
 
diff --git a/docs/autoregressive/index.tex b/docs/autoregressive/index.tex
index bc157b2..edae122 100644
--- a/docs/autoregressive/index.tex
+++ b/docs/autoregressive/index.tex
@@ -13,7 +13,7 @@ \section{Representation}
 \]
 where $\mathbf{x}_{<i}=[x_1, x_2, \ldots, x_{i-1}]$ denotes the vector of random variables with index less than $i$. If we allow for every conditional $p(x_i \vert \mathbf{x}_{<i})$ to be specified in a tabular form, then such a representation is fully general and can represent any possible distribution over $n$ random variables. However, the space complexity for such a representation grows exponentially with $n$. 
 
-To see why, let us consider the conditional for the last dimension, given by $p(x_n \vert \mathbf{x}_{<n})$. In order to fully specify this conditional, we need to specify a probability for $2^{n-1}$ configurations of the variables $x_1, x_2, \ldots, x_{n-1}$.  Since the probabilities should sum to 1, the total number of parameters for specifying this conditional is given by $2^{n-1} -1$. Hence, a tabular representation for the conditionals is impractical for learning the joint distribution in (\ref{eq:chain_rule}) . 
+To see why, let us consider the conditional for the last dimension, given by $p(x_n \vert \mathbf{x}_{<n})$. In order to fully specify this conditional, we need to specify a probability for $2^{n-1}$ configurations of the variables $x_1, x_2, \ldots, x_{n-1}$.  Since $x_n$ is a binary variable we need to specify one parameter per configuration $x_1, x_2, \ldots, x_{n-1}$ : namely $p(x_n = 1 | x_1, x_2, \ldots, x_{n-1})$. In total, the number of parameters needed to specify this conditional is given by $2^{n-1}$. Hence, a tabular representation for the conditionals is impractical for learning the joint distribution in (\ref{eq:chain_rule}) . 
 
 In an \textit{autoregressive generative model}, the conditionals are specified as parameterized functions with a fixed number of parameters. That is, we assume the conditional distributions $p(x_i \vert \mathbf{x}_{<i})$ to correspond to a Bernoulli random variable and learn a function that maps the preceeding random variables $x_1, x_2, \ldots, x_{i-1}$ to the mean of this distribution. Hence, we have:
 \[
@@ -111,4 +111,4 @@ \section{Learning and inference}
 
 
 TODO: Autoregressive generative models based on Autoencoders, RNNs, and CNNs.
-MADE, Char-RNN, Pixel-CNN, Wavenet
\ No newline at end of file
+MADE, Char-RNN, Pixel-CNN, Wavenet
diff --git a/docs/flow/index.html b/docs/flow/index.html
index d01ba1f..02ead78 100644
--- a/docs/flow/index.html
+++ b/docs/flow/index.html
@@ -95,7 +95,7 @@ <h1>Normalizing flow models</h1>
 
 <h1 id="change-of-variables-formula">Change of Variables Formula</h1>
 
-<p>In normalizing flows, we wish to map simple distributions (easy to sample and evaluate densities) to complex ones (learned via data). The change of variables formula describe how to evaluate densities of a random variable that is a deterministic transformation from another variable.</p>
+<p>In normalizing flows, we wish to map simple distributions (easy to sample and evaluate densities) to complex ones (learned via data). The change of variables formula describes how to evaluate densities of a random variable that is a deterministic transformation from another variable.</p>
 
 <p><strong>Change of Variables</strong>: <script type="math/tex">Z</script> and <script type="math/tex">X</script> be random variables which are related by a mapping <script type="math/tex">f: \mathbb{R}^n \to \mathbb{R}^n</script> such that <script type="math/tex">X = f(Z)</script> and <script type="math/tex">Z = f^{-1}(X)</script>. Then</p>
 
@@ -135,7 +135,7 @@ <h1 id="normalizing-flow-models">Normalizing Flow Models</h1>
 
 <p><img src="flow-graphical.png" alt="" /></p>
 
-<p>Using change of variables, the marginal likelihood <script type="math/tex">p(x)</script> is given by</p>
+<p>Using a change of variables, the marginal likelihood <script type="math/tex">p(x)</script> is given by</p>
 
 <div class="mathblock"><script type="math/tex; mode=display">
 
@@ -150,7 +150,7 @@ <h1 id="normalizing-flow-models">Normalizing Flow Models</h1>
   <li>“Flow” means that the invertible transformations can be composed with each other to create more complex invertible transformations.</li>
 </ol>
 
-<p>Different from autoregressive model and variational autoencoders, deep normalizing flow models require specific architectural structures.</p>
+<p>Different from autoregressive models and variational autoencoders, deep normalizing flow models require specific architectural structures.</p>
 
 <ol>
   <li>The input and output dimensions must be the same.</li>
@@ -198,7 +198,7 @@ <h1 id="normalizing-flow-models">Normalizing Flow Models</h1>
   </li>
 </ol>
 
-<p>Therefore, the Jacobian of the forward mapping is lower trangular, whose determinant is simply the product of the elements on the diagonal, which is 1. Therefore, this defines a volume preserving transformation. RealNVP adds scaling factors to the transformation:</p>
+<p>Therefore, the Jacobian of the forward mapping is lower triangular, whose determinant is simply the product of the elements on the diagonal, which is 1. Therefore, this defines a volume preserving transformation. RealNVP adds scaling factors to the transformation:</p>
 
 <div class="mathblock"><script type="math/tex; mode=display">
 
@@ -208,24 +208,24 @@ <h1 id="normalizing-flow-models">Normalizing Flow Models</h1>
 
 <p>where <script type="math/tex">\odot</script> denotes elementwise product. This results in a non-volume preserving transformation.</p>
 
-<p>Some autoregressive models can also be interpreted as flow models. For a Gaussian autoregressive model, one receive some Gaussian noise for each dimension of <script type="math/tex">\mathbb{x}</script>, which can be treated as the latent variables <script type="math/tex">\mathbf{z}</script>. Such transformations are also invertible, meaning that given <script type="math/tex">\mathbf{x}</script> and the model parameters, we can obtain <script type="math/tex">\mathbf{z}</script> exactly.</p>
+<p>Some autoregressive models can also be interpreted as flow models. For a Gaussian autoregressive model, one receives some Gaussian noise for each dimension of <script type="math/tex">\mathbb{x}</script>, which can be treated as the latent variables <script type="math/tex">\mathbf{z}</script>. Such transformations are also invertible, meaning that given <script type="math/tex">\mathbf{x}</script> and the model parameters, we can obtain <script type="math/tex">\mathbf{z}</script> exactly.</p>
 
 <p>Masked Autoregressive Flow (MAF) uses this interpretation, where the forward mapping is an autoregressive model. However, sampling is sequential and slow, in <script type="math/tex">O(n)</script> time where <script type="math/tex">n</script> is the dimension of the samples.</p>
 
 <p><img src="maf.png" alt="" /></p>
 
-<p>To address the sampling problem, the Inverse Autoregressive Flow (IAF) simply inverts the generating process. In this case, generating <script type="math/tex">\mathbf{x}</script> from the noise can be parallelized, but computing the likelihood of new data points is slow. However, for generated points the likelihood can be computed efficiently (since the noise are already obtained).</p>
+<p>To address the sampling problem, the Inverse Autoregressive Flow (IAF) inverts the generating process. In this case, generating <script type="math/tex">\mathbf{x}</script> from the noise can be parallelized, but computing the likelihood of new data points is slow. However, for generated points, the likelihood can be computed efficiently (since the noise is already obtained).</p>
 
 <p><img src="iaf.png" alt="" /></p>
 
-<p>Parallel WaveNet combines the best of both worlds for IAF and MAF where it uses an IAF student model to retrieve sample and a MAF teacher model to compute likelihood. The teacher model can be efficiently trained via maximum likelihood, and the student model is trained by minimizing the KL divergence between itself and the teacher model. Since computing the IAF likelihood for an IAF sample is efficient, this process is efficient.</p>
+<p>Parallel WaveNet combines the best of both worlds for IAF and MAF where it uses an IAF student model to retrieve samples and a MAF teacher model to compute likelihood. The teacher model can be efficiently trained via maximum likelihood, and the student model is trained by minimizing the KL divergence between itself and the teacher model. Since computing the IAF likelihood for an IAF sample is efficient, this process is efficient.</p>
 
 <h1 id="footnotes">Footnotes</h1>
 
 <div class="footnotes">
   <ol>
     <li id="fn:nf">
-      <p>Recall the conditions for change of variable formula. <a href="#fnref:nf" class="reversefootnote">&#8617;</a></p>
+      <p>Recall the conditions for the change of variable formula. <a href="#fnref:nf" class="reversefootnote">&#8617;</a></p>
     </li>
   </ol>
 </div>
diff --git a/docs/gan/index.html b/docs/gan/index.html
index 6b95745..0a94190 100644
--- a/docs/gan/index.html
+++ b/docs/gan/index.html
@@ -81,7 +81,7 @@ <h1>Generative adversarial networks</h1>
 
 <h1 id="likelihood-free-learning">Likelihood-free learning</h1>
 
-<p>Why not? In fact, it is not so clear that better likelihood numbers necessarily correspond to higher sample quality. We know that the <em>optimal generative model</em> will give us the best sample quality and highest test log-likelihood. However, models with high test log-likelihoods can still yield poor samples, and vice versa. To see why, consider pathological cases in which our model is comprised almost entirely of noise, or our model simply memorizes the training set. Therefore, we turn to <em>likelihood-free training</em> with the hope that optimizing a different objective will allow us to disentangle our desiderata of obtaining high likelihoods as well as high-quality samples.</p>
+<p>Why not? It is not clear that better likelihood numbers necessarily correspond to higher sample quality. We know that the <em>optimal generative model</em> will give us the best sample quality and highest test log-likelihood. However, models with high test log-likelihoods can still yield poor samples and vice versa. To see why, consider pathological cases in which our model is comprised almost entirely of noise, or our model memorizes the training set. Therefore, we turn to <em>likelihood-free training</em> with the hope that optimizing a different objective will allow us to disentangle our desiderata of obtaining high likelihoods as well as high-quality samples.</p>
 
 <p>Recall that maximum likelihood required us to evaluate the likelihood of the data under our model <script type="math/tex">p_\theta</script>. A natural way to set up a likelihood-free objective is to consider the <em>two-sample test</em>, a statistical test that determines whether or not a finite set of samples from two distributions are from the same distribution <em>using only samples from <script type="math/tex">P</script> and <script type="math/tex">Q</script></em>. Concretely, given <script type="math/tex">S_1 = \{\mathbf{x} \sim P\}</script> and <script type="math/tex">S_2 = \{\mathbf{x} \sim Q\}</script>, we compute a test statistic <script type="math/tex">T</script> according to the difference in <script type="math/tex">S_1</script> and <script type="math/tex">S_2</script> that, when less than a threshold <script type="math/tex">\alpha</script>, accepts the null hypothesis that <script type="math/tex">P = Q</script>.</p>
 
@@ -98,7 +98,7 @@ <h1 id="gan-objective">GAN Objective</h1>
  </figcaption> -->
 </figure>
 
-<p>The generator and discriminator both play a two player minimax game, where the generator minimizes a two-sample test objective (<script type="math/tex">p_{\textrm{data}} = p_\theta</script>) and the discriminator maximizes the objective (<script type="math/tex">p_{\textrm{data}} \neq p_\theta</script>). Intuitively, the generator tries to fool the discriminator to the best of its ability by generating samples that look indisginguishable from <script type="math/tex">p_{\textrm{data}}</script>.</p>
+<p>The generator and discriminator both play a two player minimax game, where the generator minimizes a two-sample test objective (<script type="math/tex">p_{\textrm{data}} = p_\theta</script>) and the discriminator maximizes the objective (<script type="math/tex">p_{\textrm{data}} \neq p_\theta</script>). Intuitively, the generator tries to fool the discriminator to the best of its ability by generating samples that look indistinguishable from <script type="math/tex">p_{\textrm{data}}</script>.</p>
 
 <p>Formally, the GAN objective can be written as:</p>
 
@@ -125,7 +125,7 @@ <h1 id="gan-objective">GAN Objective</h1>
 D_{\textrm{JSD}}[p, q] = \frac{1}{2} \left( D_{\textrm{KL}}\left[p, \frac{p+q}{2} \right] + D_{\textrm{KL}}\left[q, \frac{p+q}{2} \right] \right)
 </script></div>
 
-<p>The JSD satisfies all properties of the KL, and has the additional perk that <script type="math/tex">D_{\textrm{JSD}}[p,q] = D_{\textrm{JSD}}[q,p]</script>. With this distance metric, the optimal generator for the GAN objective becomces <script type="math/tex">p_G = p_{\textrm{data}}</script>, and the optimal objective value that we can achieve with optimal generators and discriminators <script type="math/tex">G^*(\cdot)</script> and <script type="math/tex">D^*_{G^*}(\mathbf{x})</script> is <script type="math/tex">-\log 4</script>.</p>
+<p>The JSD satisfies all properties of the KL, and has the additional perk that <script type="math/tex">D_{\textrm{JSD}}[p,q] = D_{\textrm{JSD}}[q,p]</script>. With this distance metric, the optimal generator for the GAN objective becomes <script type="math/tex">p_G = p_{\textrm{data}}</script>, and the optimal objective value that we can achieve with optimal generators and discriminators <script type="math/tex">G^*(\cdot)</script> and <script type="math/tex">D^*_{G^*}(\mathbf{x})</script> is <script type="math/tex">-\log 4</script>.</p>
 
 <h1 id="gan-training-algorithm">GAN training algorithm</h1>
 
@@ -149,13 +149,13 @@ <h1 id="gan-training-algorithm">GAN training algorithm</h1>
 
 <h1 id="challenges">Challenges</h1>
 
-<p>Although GANs have been successfully applied to several domains and tasks, working with them in practice is challenging because of their: (1) unstable optimization procedure, (2) potential for mode collapse, (3) difficulty in evaluation.</p>
+<p>Although GANs have been successfully applied to several domains and tasks, working with them in practice is challenging because of their: (1) unstable optimization procedure, (2) potential for mode collapse, (3) difficulty in performance evaluation.</p>
 
-<p>During optimization, the generator and discriminator loss often continue to oscillate without converging to a clear stopping point. Due to the lack of a robust stopping criteria, it is difficult to know when exactly the GAN has finished training. Additionally, the generator of a GAN can often get stuck producing one of a few types of samples over and over again (mode collapse). Most fixes to these challenges are empirically driven, and there has been a significant amount of work put into developing new architectures, regularization schemes, and noise perturbations in an attempt to circumvent these issues. Soumith Chintala has a nice <a href="https://github.com/soumith/ganhacks">link</a> outlining various tricks of the trade to stabilize GAN training.</p>
+<p>During optimization, the generator and discriminator loss often continue to oscillate without converging to a definite stopping point. Due to the lack of robust stopping criteria, it is difficult to know when exactly the GAN has finished training. Additionally, the generator of a GAN can often get stuck producing one of a few types of samples over and over again (mode collapse). Most fixes to these challenges are empirically driven, and there has been a significant amount of work put into developing new architectures, regularization schemes, and noise perturbations in an attempt to circumvent these issues. Soumith Chintala has a nice <a href="https://github.com/soumith/ganhacks">link</a> outlining various tricks of the trade to stabilize GAN training.</p>
 
 <h1 id="selected-gans">Selected GANs</h1>
 
-<p>Next, we focus our attention to a few select types of GAN architectures and explore them in more detail.</p>
+<p>Next, we focus our attention on a few select types of GAN architectures and explore them in more detail.</p>
 
 <h3 id="f-gan">f-GAN</h3>
 <p>The <a href="https://arxiv.org/abs/1606.00709">f-GAN</a> optimizes the variant of the two-sample test objective that we have discussed so far, but using a very general notion of distance: the <script type="math/tex">f divergence</script>. Given two densities <script type="math/tex">p</script> and <script type="math/tex">q</script>, the <script type="math/tex">f</script>-divergence can be written as:</p>
@@ -177,7 +177,7 @@ <h3 id="f-gan">f-GAN</h3>
 \min_\theta \max_\phi F(\theta,\phi) =  \mathbb{E}_{x \sim p_{\textrm{data}}}[T_\phi(\mathbf{x})] - \mathbb{E}_{x \sim p_{G_\theta}}[f^*(T_\phi(\mathbf{x}))]
 </script></div>
 
-<p>Intuitively, we can think about this objective as the generator trying to minimize the divergence estimate, while the discriminator tries to tighten the lower bound.</p>
+<p>Intuitively, we can think about this objective as the generator trying to minimize the divergence estimate, while the discriminator attempts to tighten the lower bound.</p>
 
 <h3 id="bigan">BiGAN</h3>
 <p>We won’t worry too much about the <a href="https://arxiv.org/abs/1605.09782">BiGAN</a> in these notes. However, we can think about this model as one that allows us to infer latent representations even within a GAN framework.</p>
diff --git a/docs/introduction/index.html b/docs/introduction/index.html
index f0f5990..53241c5 100644
--- a/docs/introduction/index.html
+++ b/docs/introduction/index.html
@@ -80,8 +80,7 @@ <h1>Introduction</h1>
 <p>Intelligent agents are constantly generating, acquiring, and processing
 data. This data could be in the form of <em>images</em> that we capture on our
 phones, <em>text</em> messages we share with our friends, <em>graphs</em> that model
-interactions on social media, <em>videos</em> that record important events,
-etc. Natural agents excel at discovering patterns, extracting
+interactions on social media, or <em>videos</em> that record important events. Natural agents excel at discovering patterns, extracting
 knowledge, and performing complex reasoning based on the data they observe. How
 can we build artificial learning systems to do the same?</p>
 
@@ -91,7 +90,7 @@ <h1>Introduction</h1>
 underlying distribution, say <script type="math/tex">p_{\mathrm{data}}</script>. At its very core, the
 goal of any generative model is then to approximate this data
 distribution given access to the dataset <script type="math/tex">\mathcal{D}</script>. The hope is that
-if we are able to <em>learn</em> a good generative model, we can use the
+if we can <em>learn</em> a good generative model, we can use the
 learned model for downstream <em>inference</em>.</p>
 
 <h2 id="learning">Learning</h2>
@@ -105,7 +104,7 @@ <h2 id="learning">Learning</h2>
 <p>In the parametric setting, we can think of the task of learning a
 generative model as picking the parameters within a family of model
 distributions that minimizes some notion of distance<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> between the
-model distribution and the data distribution.</p>
+model distribution and data distribution.</p>
 
 <p><img src="learning_1.png" alt="drawing" width="200" class="center" /></p>
 
@@ -114,7 +113,7 @@ <h2 id="learning">Learning</h2>
 ![goal](learning_2.png =100x20) --></p>
 
 <p>For instance, we might be given access to a dataset of dog images <script type="math/tex">\mathcal{D}</script> and
-our goal is to learn the paraemeters of  a generative model <script type="math/tex">\theta</script> within a model family <script type="math/tex">\mathcal{M}</script> such that
+our goal is to learn the parameters of  a generative model <script type="math/tex">\theta</script> within a model family <script type="math/tex">\mathcal{M}</script> such that
 the model distribution <script type="math/tex">p_\theta</script> is close to the data distribution over dogs
 <script type="math/tex">p_{\mathrm{data}}</script>. Mathematically, we can specify our goal as the
 following optimization problem: <script type="math/tex"></script>\begin{equation}
@@ -130,17 +129,17 @@ <h2 id="learning">Learning</h2>
 Each pixel has three channels: R(ed), G(reen) and B(lue) and each
 channel can take a value between 0 to 255. Hence, the number of possible
 images is given by <script type="math/tex">256^{700 \times 1400 \times 3}\approx 10 ^{800000}</script>.
-In contrast, Imagenet, one of the largest publicly available datasets,
+In contrast, ImageNet, one of the largest publicly available datasets,
 consists of only about 15 million images. Hence, learning a generative
 model with such a limited dataset is a highly underdetermined problem.</p>
 
 <p>Fortunately, the real world is highly structured and automatically
 discovering the underlying structure is key to learning generative
 models. For example, we can hope to learn some basic artifacts about
-dogs even with just a few images: two eyes, two ears, fur etc. Instead
+dogs even with just a few images: two eyes, two ears, fur, etc. Instead
 of incorporating this prior knowledge explicitly, we will hope the model
 learns the underlying structure directly from data. There is no free
-lunch however, and indeed successful learning of generative models will
+lunch, however, and indeed successful learning of generative models will
 involve instantiating the optimization problem in
 <script type="math/tex">(\ref{eq:learning_gm})</script> in a suitable way. In this course, we will be
 primarily interested in the following questions:</p>
@@ -151,9 +150,9 @@ <h2 id="learning">Learning</h2>
   <li>What is the optimization procedure for minimizing <script type="math/tex">d(\cdot)</script>?</li>
 </ul>
 
-<p>In the next few set of lectures, we will take a deeper dive into certain
+<p>In the next few lectures, we will take a deeper dive into certain
 families of generative models. For each model family, we will note how
-the representation is closely tied with the choice of learning objective
+the representation relates to the choice of learning objective
 and the optimization procedure.</p>
 
 <h2 id="inference">Inference</h2>
@@ -165,7 +164,7 @@ <h2 id="inference">Inference</h2>
 
 <p>While the range of applications to which generative models have been
 used continue to grow, we can identify three fundamental inference
-queries for evaluating a generative model.:</p>
+queries for evaluating a generative model:</p>
 
 <ol>
   <li>
diff --git a/docs/vae/index.html b/docs/vae/index.html
index c9d2f14..e5544e3 100644
--- a/docs/vae/index.html
+++ b/docs/vae/index.html
@@ -170,11 +170,11 @@ <h1 id="learning-directed-latent-variable-models">Learning Directed Latent Varia
 \log p(\bx) \approx \log \frac{1}{k} \sum_{i=1}^k p(\bx \vert \bz^{(i)}) \text{, where } \bz^{(i)} \sim p(\bz)
 </script></div>
 
-<p>In practice however, optimizing the above estimate suffers from high variance in gradient estimates.</p>
+<p>In practice, however, optimizing the above estimate suffers from high variance in gradient estimates.</p>
 
 <p>Rather than maximizing the log-likelihood directly, an alternate is to instead construct a lower bound that is more amenable to optimization. To do so, we note that evaluating the marginal likelihood <script type="math/tex">p(\bx)</script> is at least as difficult as as evaluating the posterior <script type="math/tex">p(\bz \mid \bx)</script> for any latent vector <script type="math/tex">\bz</script> since by definition <script type="math/tex">p(\bz \mid \bx) = p(\bx, \bz) / p(\bx)</script>.</p>
 
-<p>Next, we introduce a variational family <script type="math/tex">\Q</script> of distributions that approximate the true, but intractable posterior <script type="math/tex">p(\bz \mid \bx)</script>. Further henceforth, we will assume a parameteric setting where any distribution in the model family <script type="math/tex">\P_{\bx, \bz}</script> is specified via a set of parameters <script type="math/tex">\theta \in \Theta</script> and distributions in the variational family <script type="math/tex">\Q</script> are specified via a set of parameters <script type="math/tex">\lambda \in \Lambda</script>.</p>
+<p>Next, we introduce a variational family <script type="math/tex">\Q</script> of distributions that approximate the true, but intractable posterior <script type="math/tex">p(\bz \mid \bx)</script>. Henceforth, we will assume a parameteric setting where any distribution in the model family <script type="math/tex">\P_{\bx, \bz}</script> is specified via a set of parameters <script type="math/tex">\theta \in \Theta</script> and distributions in the variational family <script type="math/tex">\Q</script> are specified via a set of parameters <script type="math/tex">\lambda \in \Lambda</script>.</p>
 
 <p>Given <script type="math/tex">\P_{\bx, \bz}</script> and <script type="math/tex">\Q</script>, we note that the following relationships hold true<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> for any <script type="math/tex">\bx</script> and all variational distributions <script type="math/tex">q_\lambda(\bz) \in \Q</script></p>
 
@@ -310,7 +310,7 @@ <h1 id="parameterizing-distributions-via-deep-neural-networks">Parameterizing Di
 
 <h1 id="amortized-variational-inference">Amortized Variational Inference</h1>
 
-<p>A noticable limitation of black-box variational inference is that <strong>Step 1</strong> executes an optimization subroutine that is computationally expensive. Recall that the goal of the <strong>Step 1</strong> is to find</p>
+<p>A noticeable limitation of black-box variational inference is that <strong>Step 1</strong> executes an optimization subroutine that is computationally expensive. Recall that the goal of the <strong>Step 1</strong> is to find</p>
 <div class="mathblock"><script type="math/tex; mode=display">
 \begin{align}
     \lambda^* = \argmax_{\lambda\in \Lambda} \ELBO(\bx; \theta, \lambda).
@@ -336,7 +336,7 @@ <h1 id="amortized-variational-inference">Amortized Variational Inference</h1>
     \max_{\phi } \sum_{\bx \in \D} \ELBO(\bx; \theta, \phi).
 \end{align}
 </script></div>
-<p>It is also worth noting that optimizing <script type="math/tex">\phi</script> over the entire dataset as a <em>subroutine</em> everytime we sample a new mini-batch is clearly not reasonable. However, if we believe that <script type="math/tex">f_\phi</script> is capable of quickly adapting to a close-enough approximation of <script type="math/tex">\lambda^\ast</script> given the current choice of <script type="math/tex">\theta</script>, then we can interleave the optimization <script type="math/tex">\phi</script> and <script type="math/tex">\theta</script>. The yields the following procedure, where for each mini-batch <script type="math/tex">\M = \set{\bx^{(1)}, \ldots, \bx^{(m)}}</script>, we perform the following two updates jointly</p>
+<p>It is also worth noting that optimizing <script type="math/tex">\phi</script> over the entire dataset as a <em>subroutine</em> every time we sample a new mini-batch is clearly not reasonable. However, if we believe that <script type="math/tex">f_\phi</script> is capable of quickly adapting to a close-enough approximation of <script type="math/tex">\lambda^\ast</script> given the current choice of <script type="math/tex">\theta</script>, then we can interleave the optimization <script type="math/tex">\phi</script> and <script type="math/tex">\theta</script>. This yields the following procedure, where for each mini-batch <script type="math/tex">\M = \set{\bx^{(1)}, \ldots, \bx^{(m)}}</script>, we perform the following two updates jointly</p>
 <div class="mathblock"><script type="math/tex; mode=display">
 \begin{align}
     \phi &\gets \phi + \tilde{\nabla}_\phi \sum_{\bx \in \M} \ELBO(\bx; \theta, \phi) \\
diff --git a/flow/index.md b/flow/index.md
index 6725fef..13b949a 100644
--- a/flow/index.md
+++ b/flow/index.md
@@ -15,7 +15,7 @@ In this section, we introduce normalizing flows a type of method that combines t
 
 # Change of Variables Formula
 
-In normalizing flows, we wish to map simple distributions (easy to sample and evaluate densities) to complex ones (learned via data). The change of variables formula describe how to evaluate densities of a random variable that is a deterministic transformation from another variable.
+In normalizing flows, we wish to map simple distributions (easy to sample and evaluate densities) to complex ones (learned via data). The change of variables formula describes how to evaluate densities of a random variable that is a deterministic transformation from another variable.
 
 **Change of Variables**: $$Z$$ and $$X$$ be random variables which are related by a mapping $$f: \mathbb{R}^n \to \mathbb{R}^n$$ such that $$X = f(Z)$$ and $$Z = f^{-1}(X)$$. Then 
 
@@ -49,7 +49,7 @@ We are ready to introduce normalizing flow models. Let us consider a directed, l
 
 ![](flow-graphical.png)
 
-Using change of variables, the marginal likelihood $$p(x)$$ is given by
+Using a change of variables, the marginal likelihood $$p(x)$$ is given by
 
 {% math %}
 
@@ -62,7 +62,7 @@ The name “normalizing flow” can be interpreted as the following:
 1. “Normalizing” means that the change of variables gives a normalized density after applying an invertible transformation.
 2. “Flow” means that the invertible transformations can be composed with each other to create more complex invertible transformations.
 
-Different from autoregressive model and variational autoencoders, deep normalizing flow models require specific architectural structures. 
+Different from autoregressive models and variational autoencoders, deep normalizing flow models require specific architectural structures. 
 
 1. The input and output dimensions must be the same.
 2. The transformation must be invertible.
@@ -108,19 +108,19 @@ Therefore, the Jacobian of the forward mapping is lower triangular, whose determ
 
 where $$\odot$$ denotes elementwise product. This results in a non-volume preserving transformation.
 
-Some autoregressive models can also be interpreted as flow models. For a Gaussian autoregressive model, one receive some Gaussian noise for each dimension of $$\mathbb{x}$$, which can be treated as the latent variables $$\mathbf{z}$$. Such transformations are also invertible, meaning that given $$\mathbf{x}$$ and the model parameters, we can obtain $$\mathbf{z}$$ exactly.
+Some autoregressive models can also be interpreted as flow models. For a Gaussian autoregressive model, one receives some Gaussian noise for each dimension of $$\mathbb{x}$$, which can be treated as the latent variables $$\mathbf{z}$$. Such transformations are also invertible, meaning that given $$\mathbf{x}$$ and the model parameters, we can obtain $$\mathbf{z}$$ exactly.
 
 Masked Autoregressive Flow (MAF) uses this interpretation, where the forward mapping is an autoregressive model. However, sampling is sequential and slow, in $$O(n)$$ time where $$n$$ is the dimension of the samples.
 
 ![](maf.png)
 
-To address the sampling problem, the Inverse Autoregressive Flow (IAF) simply inverts the generating process. In this case, generating $$\mathbf{x}$$ from the noise can be parallelized, but computing the likelihood of new data points is slow. However, for generated points the likelihood can be computed efficiently (since the noise are already obtained).
+To address the sampling problem, the Inverse Autoregressive Flow (IAF) inverts the generating process. In this case, generating $$\mathbf{x}$$ from the noise can be parallelized, but computing the likelihood of new data points is slow. However, for generated points, the likelihood can be computed efficiently (since the noise is already obtained).
 
 ![](iaf.png)
 
-Parallel WaveNet combines the best of both worlds for IAF and MAF where it uses an IAF student model to retrieve sample and a MAF teacher model to compute likelihood. The teacher model can be efficiently trained via maximum likelihood, and the student model is trained by minimizing the KL divergence between itself and the teacher model. Since computing the IAF likelihood for an IAF sample is efficient, this process is efficient.  
+Parallel WaveNet combines the best of both worlds for IAF and MAF where it uses an IAF student model to retrieve samples and a MAF teacher model to compute likelihood. The teacher model can be efficiently trained via maximum likelihood, and the student model is trained by minimizing the KL divergence between itself and the teacher model. Since computing the IAF likelihood for an IAF sample is efficient, this process is efficient.  
 
 # Footnotes
 
-[^nf]: Recall the conditions for change of variable formula.
+[^nf]: Recall the conditions for the change of variable formula.
 
diff --git a/gan/index.md b/gan/index.md
index 8e7c7d9..6bdf9c6 100644
--- a/gan/index.md
+++ b/gan/index.md
@@ -9,7 +9,7 @@ We now move onto another family of generative models called generative adversari
 Likelihood-free learning
 ==============
 
-Why not? In fact, it is not so clear that better likelihood numbers necessarily correspond to higher sample quality. We know that the *optimal generative model* will give us the best sample quality and highest test log-likelihood. However, models with high test log-likelihoods can still yield poor samples, and vice versa. To see why, consider pathological cases in which our model is comprised almost entirely of noise, or our model simply memorizes the training set. Therefore, we turn to *likelihood-free training* with the hope that optimizing a different objective will allow us to disentangle our desiderata of obtaining high likelihoods as well as high-quality samples.
+Why not? It is not clear that better likelihood numbers necessarily correspond to higher sample quality. We know that the *optimal generative model* will give us the best sample quality and highest test log-likelihood. However, models with high test log-likelihoods can still yield poor samples and vice versa. To see why, consider pathological cases in which our model is comprised almost entirely of noise, or our model memorizes the training set. Therefore, we turn to *likelihood-free training* with the hope that optimizing a different objective will allow us to disentangle our desiderata of obtaining high likelihoods as well as high-quality samples.
 
 Recall that maximum likelihood required us to evaluate the likelihood of the data under our model $$p_\theta$$. A natural way to set up a likelihood-free objective is to consider the *two-sample test*, a statistical test that determines whether or not a finite set of samples from two distributions are from the same distribution *using only samples from $$P$$ and $$Q$$*. Concretely, given $$S_1 = \{\mathbf{x} \sim P\}$$ and $$S_2 = \{\mathbf{x} \sim Q\}$$, we compute a test statistic $$T$$ according to the difference in $$S_1$$ and $$S_2$$ that, when less than a threshold $$\alpha$$, accepts the null hypothesis that $$P = Q$$. 
 
@@ -67,27 +67,27 @@ For epochs $$1, \ldots, N$$ do:
 1. Sample minibatch of size $$m$$ from data: $$\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(m)} \sim \mathcal{D}$$
 2. Sample minibatch of size $$m$$ of noise: $$\mathbf{z}^{(1)}, \ldots, \mathbf{z}^{(m)} \sim p_z$$
 3. Take a gradient *descent* step on the generator parameters $$\theta$$:
-	{% math %}
-	\triangledown_\theta V(G_\theta, D_\phi) = \frac{1}{m} \triangledown_\theta \sum_{i=1}^m \log \left(1 - D_\phi(G_\theta(\mathbf{z}^{(i)})) \right)
-	{% endmath %} 
+    {% math %}
+    \triangledown_\theta V(G_\theta, D_\phi) = \frac{1}{m} \triangledown_\theta \sum_{i=1}^m \log \left(1 - D_\phi(G_\theta(\mathbf{z}^{(i)})) \right)
+    {% endmath %} 
 4. Take a gradient *ascent* step on the discriminator parameters $$\phi$$:
-	{% math %}
-	\triangledown_\phi V(G_\theta, D_\phi) = \frac{1}{m} \triangledown_\phi \sum_{i=1}^m \left[\log D_\phi(\mathbf{x}^{(i)}) + \log (1 - D_\phi(G_\theta(\mathbf{z}^{(i)}))) \right]
-	{% endmath %} 
+    {% math %}
+    \triangledown_\phi V(G_\theta, D_\phi) = \frac{1}{m} \triangledown_\phi \sum_{i=1}^m \left[\log D_\phi(\mathbf{x}^{(i)}) + \log (1 - D_\phi(G_\theta(\mathbf{z}^{(i)}))) \right]
+    {% endmath %} 
 
 
 Challenges
 ==============
 
-Although GANs have been successfully applied to several domains and tasks, working with them in practice is challenging because of their: (1) unstable optimization procedure, (2) potential for mode collapse, (3) difficulty in evaluation.
+Although GANs have been successfully applied to several domains and tasks, working with them in practice is challenging because of their: (1) unstable optimization procedure, (2) potential for mode collapse, (3) difficulty in performance evaluation.
 
-During optimization, the generator and discriminator loss often continue to oscillate without converging to a clear stopping point. Due to the lack of a robust stopping criteria, it is difficult to know when exactly the GAN has finished training. Additionally, the generator of a GAN can often get stuck producing one of a few types of samples over and over again (mode collapse). Most fixes to these challenges are empirically driven, and there has been a significant amount of work put into developing new architectures, regularization schemes, and noise perturbations in an attempt to circumvent these issues. Soumith Chintala has a nice [link](https://github.com/soumith/ganhacks) outlining various tricks of the trade to stabilize GAN training.
+During optimization, the generator and discriminator loss often continue to oscillate without converging to a definite stopping point. Due to the lack of robust stopping criteria, it is difficult to know when exactly the GAN has finished training. Additionally, the generator of a GAN can often get stuck producing one of a few types of samples over and over again (mode collapse). Most fixes to these challenges are empirically driven, and there has been a significant amount of work put into developing new architectures, regularization schemes, and noise perturbations in an attempt to circumvent these issues. Soumith Chintala has a nice [link](https://github.com/soumith/ganhacks) outlining various tricks of the trade to stabilize GAN training.
 
 
 Selected GANs
 ==============
 
-Next, we focus our attention to a few select types of GAN architectures and explore them in more detail. 
+Next, we focus our attention on a few select types of GAN architectures and explore them in more detail. 
 
 ### f-GAN
 The [f-GAN](https://arxiv.org/abs/1606.00709) optimizes the variant of the two-sample test objective that we have discussed so far, but using a very general notion of distance: the $$f divergence$$. Given two densities $$p$$ and $$q$$, the $$f$$-divergence can be written as: 
@@ -109,7 +109,7 @@ Therefore we can choose any f-divergence that we desire, let $$p = p_{\textrm{da
 \min_\theta \max_\phi F(\theta,\phi) =  \mathbb{E}_{x \sim p_{\textrm{data}}}[T_\phi(\mathbf{x})] - \mathbb{E}_{x \sim p_{G_\theta}}[f^*(T_\phi(\mathbf{x}))]
 {% endmath %}
 
-Intuitively, we can think about this objective as the generator trying to minimize the divergence estimate, while the discriminator tries to tighten the lower bound.
+Intuitively, we can think about this objective as the generator trying to minimize the divergence estimate, while the discriminator attempts to tighten the lower bound.
 
 ### BiGAN
 We won't worry too much about the [BiGAN](https://arxiv.org/abs/1605.09782) in these notes. However, we can think about this model as one that allows us to infer latent representations even within a GAN framework.
diff --git a/introduction/index.md b/introduction/index.md
index 6b7a969..045aefd 100644
--- a/introduction/index.md
+++ b/introduction/index.md
@@ -6,8 +6,7 @@ title: Introduction
 Intelligent agents are constantly generating, acquiring, and processing
 data. This data could be in the form of *images* that we capture on our
 phones, *text* messages we share with our friends, *graphs* that model
-interactions on social media, *videos* that record important events,
-etc. Natural agents excel at discovering patterns, extracting
+interactions on social media, or *videos* that record important events. Natural agents excel at discovering patterns, extracting
 knowledge, and performing complex reasoning based on the data they observe. How
 can we build artificial learning systems to do the same?
 
@@ -17,7 +16,7 @@ observed data, say $$\mathcal{D}$$, as a finite set of samples from an
 underlying distribution, say $$p_{\mathrm{data}}$$. At its very core, the
 goal of any generative model is then to approximate this data
 distribution given access to the dataset $$\mathcal{D}$$. The hope is that
-if we are able to *learn* a good generative model, we can use the
+if we can *learn* a good generative model, we can use the
 learned model for downstream *inference*.
 
 Learning
@@ -32,7 +31,7 @@ limited in the family of distributions they can represent.
 In the parametric setting, we can think of the task of learning a
 generative model as picking the parameters within a family of model
 distributions that minimizes some notion of distance[^1] between the
-model distribution and the data distribution.
+model distribution and data distribution.
  
 <img src="learning_1.png" alt="drawing" width="200" class="center"/>
 
@@ -66,10 +65,10 @@ model with such a limited dataset is a highly underdetermined problem.
 Fortunately, the real world is highly structured and automatically
 discovering the underlying structure is key to learning generative
 models. For example, we can hope to learn some basic artifacts about
-dogs even with just a few images: two eyes, two ears, fur etc. Instead
+dogs even with just a few images: two eyes, two ears, fur, etc. Instead
 of incorporating this prior knowledge explicitly, we will hope the model
 learns the underlying structure directly from data. There is no free
-lunch however, and indeed successful learning of generative models will
+lunch, however, and indeed successful learning of generative models will
 involve instantiating the optimization problem in
 $$(\ref{eq:learning_gm})$$ in a suitable way. In this course, we will be
 primarily interested in the following questions:
@@ -78,9 +77,9 @@ primarily interested in the following questions:
 * What is the objective function $$d(\cdot)$$?
 * What is the optimization procedure for minimizing $$d(\cdot)$$?
 
-In the next few set of lectures, we will take a deeper dive into certain
+In the next few lectures, we will take a deeper dive into certain
 families of generative models. For each model family, we will note how
-the representation is closely tied with the choice of learning objective
+the representation relates to the choice of learning objective
 and the optimization procedure.
 
 Inference
@@ -93,7 +92,7 @@ data.[^2]
 
 While the range of applications to which generative models have been
 used continue to grow, we can identify three fundamental inference
-queries for evaluating a generative model.:
+queries for evaluating a generative model:
 
 1.  *Density estimation:* Given a datapoint $$\mathbf{x}$$, what is the
     probability assigned by the model, i.e., $$p_\theta(\mathbf{x})$$?
diff --git a/vae/index.md b/vae/index.md
index 2c4f8d9..c514722 100644
--- a/vae/index.md
+++ b/vae/index.md
@@ -97,13 +97,13 @@ However, it turns out this problem is generally intractable for high-dimensional
 \log p(\bx) \approx \log \frac{1}{k} \sum_{i=1}^k p(\bx \vert \bz^{(i)}) \text{, where } \bz^{(i)} \sim p(\bz)
 {% endmath %}
 
-In practice however, optimizing the above estimate suffers from high variance in gradient estimates. 
+In practice, however, optimizing the above estimate suffers from high variance in gradient estimates. 
 
 
 Rather than maximizing the log-likelihood directly, an alternate is to instead construct a lower bound that is more amenable to optimization. To do so, we note that evaluating the marginal likelihood $$p(\bx)$$ is at least as difficult as as evaluating the posterior $$p(\bz \mid \bx)$$ for any latent vector $$\bz$$ since by definition $$p(\bz \mid \bx) = p(\bx, \bz) / p(\bx)$$. 
 
 
-Next, we introduce a variational family $$\Q$$ of distributions that approximate the true, but intractable posterior $$p(\bz \mid \bx)$$. Further henceforth, we will assume a parameteric setting where any distribution in the model family $$\P_{\bx, \bz}$$ is specified via a set of parameters $$\theta \in \Theta$$ and distributions in the variational family $$\Q$$ are specified via a set of parameters $$\lambda \in \Lambda$$. 
+Next, we introduce a variational family $$\Q$$ of distributions that approximate the true, but intractable posterior $$p(\bz \mid \bx)$$. Henceforth, we will assume a parameteric setting where any distribution in the model family $$\P_{\bx, \bz}$$ is specified via a set of parameters $$\theta \in \Theta$$ and distributions in the variational family $$\Q$$ are specified via a set of parameters $$\lambda \in \Lambda$$. 
 
 
 Given $$\P_{\bx, \bz}$$ and $$\Q$$, we note that the following relationships hold true[^1] for any $$\bx$$ and all variational distributions $$q_\lambda(\bz) \in \Q$$