wiseodd
diff --git a/‎…ent/post/annealed-importance-sampling.md‎ ‎…nt/post/annealed-importance-sampling.mdx‎src/content/post/annealed-importance-sampling.md renamed to src/content/post/annealed-importance-sampling.mdx
Lines changed: 22 additions & 9 deletions b/‎…ent/post/annealed-importance-sampling.md‎ ‎…nt/post/annealed-importance-sampling.mdx‎src/content/post/annealed-importance-sampling.md renamed to src/content/post/annealed-importance-sampling.mdx
Lines changed: 22 additions & 9 deletions
diff --git a/‎src/content/post/boundary-seeking-gan.md‎
Lines changed: 0 additions & 73 deletions b/‎src/content/post/boundary-seeking-gan.md‎
Lines changed: 0 additions & 73 deletions
diff --git a/‎src/content/post/boundary-seeking-gan.mdx‎
Lines changed: 77 additions & 0 deletions b/‎src/content/post/boundary-seeking-gan.mdx‎
Lines changed: 77 additions & 0 deletions
diff --git a/‎src/content/post/gibbs-sampling.md‎ ‎src/content/post/gibbs-sampling.mdx‎src/content/post/gibbs-sampling.md renamed to src/content/post/gibbs-sampling.mdx b/‎src/content/post/gibbs-sampling.md‎ ‎src/content/post/gibbs-sampling.mdx‎src/content/post/gibbs-sampling.md renamed to src/content/post/gibbs-sampling.mdx
diff --git a/‎src/content/post/lda-gibbs.md‎ ‎src/content/post/lda-gibbs.mdx‎src/content/post/lda-gibbs.md renamed to src/content/post/lda-gibbs.mdx
Lines changed: 28 additions & 35 deletions b/‎src/content/post/lda-gibbs.md‎ ‎src/content/post/lda-gibbs.mdx‎src/content/post/lda-gibbs.md renamed to src/content/post/lda-gibbs.mdx
Lines changed: 28 additions & 35 deletions
@@ -4,16 +4,21 @@ description: 'An introduction and implementation of Annealed Importance Sampling
 publishDate: 2017-12-23 07:00
 tags: [machine learning, bayesian]
 ---
+import BlogImage from '@/components/BlogImage.astro';
 
 Suppose we have this distribution:
 
-$$ p(x) = \frac{1}{Z} f(x) $$
+$$
+    p(x) = \frac{1}{Z} f(x)
+$$
 
 where $ Z = \sum_x f(x) $. In high dimension, this summation is intractable as there would be exponential number of terms. We are hopeless on computing $ Z $ and in turn we can't evaluate this distribution.
 
 Now, how do we compute an expectation w.r.t. to $p(x)$, i.e.:
 
-$$ \mathbb{E}\_{p(x)}[x] = \sum_x x p(x) $$
+$$
+    \mathbb{E}_{p(x)}[x] = \sum_x x p(x)
+$$
 
 It is impossible for us to do this as we don't know $ p(x) $. Our best hope is to approximate that. One of the popular way is to use importance sampling. However, importance sampling has a hyperparameter that is hard to adjust, i.e. the proposal distribution $ q(x) $. Importance sampling works well if we can provide $ q(x) $ that is a good approximation of $ p(x) $. It is problematic to find a good $ q(x) $, and this is one of the motivations behind Annealed Importance Sampling (AIS) [1].
 
@@ -28,8 +33,8 @@ The construction of AIS is as follows:
 
 Then to sample from $ p_0(x) $, we need to:
 
-- Sample an independent point from $ x\_{n-1} \sim p_n(x) $.
-- Sample $ x*{n-2} $ from $ x*{n-1} $ by doing MCMC w.r.t. $ T\_{n-1} $.
+- Sample an independent point from $ x_{n-1} \sim p_n(x) $.
+- Sample $ x*{n-2} $ from $ x*{n-1} $ by doing MCMC w.r.t. $ T_{n-1} $.
 - $ \dots $
 - Sample $ x_1 $ from $ x_2 $ by doing MCMC w.r.t. $ T_2 $.
 - Sample $ x_0 $ from $ x_1 $ by doing MCMC w.r.t. $ T_1 $.
@@ -38,13 +43,17 @@ Intuitively given two distributions, which might be disjoint in their support, w
 
 At this point, we have sequence of points $ x*{n-1}, x*{n-2}, \dots, x_1, x_0 $. We can use them to compute the importance weight as follows:
 
-$$ w = \frac{f*{n-1}(x*{n-1})}{f*n(x*{n-1})} \frac{f*{n-2}(x*{n-2})}{f*{n-1}(x*{n-2})} \dots \frac{f_1(x_1)}{f_2(x_1)} \frac{f_0(x_0)}{f_1(x_0)} $$
+$$
+    w = \frac{f*{n-1}(x*{n-1})}{f*n(x*{n-1})} \frac{f*{n-2}(x*{n-2})}{f*{n-1}(x*{n-2})} \dots \frac{f_1(x_1)}{f_2(x_1)} \frac{f_0(x_0)}{f_1(x_0)}
+$$
 
 Notice that $ w $ is telescoping, and without the intermediate distributions, it reduces to the usual weight used in importance sampling.
 
 With this importance weight, then we can compute the expectation as in importance sampling:
 
-$$ \mathbb{E}\_{p(x)}[x] = \frac{1}{\sum_i^N w_i} \sum_i^N x_i w_i $$
+$$
+    \mathbb{E}_{p(x)}[x] = \frac{1}{\sum_i^N w_i} \sum_i^N x_i w_i
+$$
 
 where $ N $ is the number of samples.
 
@@ -54,15 +63,19 @@ We now have the full algorithm. However several things are missing, namely, the
 
 For the intermediate distributions, we can set it as an annealing between to our target and proposal functions, i.e:
 
-$$ f_j(x) = f_0(x)^{\beta_j} f_n(x)^{1-\beta_j} $$
+$$
+    f_j(x) = f_0(x)^{\beta_j} f_n(x)^{1-\beta_j}
+$$
 
 where $ 1 = \beta_0 > \beta_1 > \dots > \beta_n = 0 $. For visual example, annealing between $ N(0, I) $ to $ N(5, I) $ with 10 intermediate distributions gives us:
 
-![Annealing]({{ site.baseurl }}/img/2017-12-23-annealed-importance-sampling/intermediate_dists.png)
+<BlogImage imagePath="/img/annealed-importance-sampling/intermediate_dists.png" altText="Intermediate distributions." />
 
 For the transition functions, we can use Metropolis-Hastings with acceptance probability:
 
-$$ A_j(x, x') = \frac{f_j(x')}{f_j(x)} $$
+$$
+    A_j(x, x') = \frac{f_j(x')}{f_j(x)}
+$$
 
 assuming we have symmetric proposal, e.g. $ N(0, I) $.
 
 
@@ -0,0 +1,77 @@
+---
+title: 'Boundary Seeking GAN'
+description: 'Training GAN by moving the generated samples to the decision boundary.'
+publishDate: 2017-03-07 00:10
+tags: [machine learning, gan]
+---
+
+import BlogImage from '@/components/BlogImage.astro'
+
+Boundary Seeking GAN (BGAN) is a recently introduced modification of GAN training. Here, in this post, we will look at the intuition behind BGAN, and also the implementation, which consists of one line change from vanilla GAN.
+
+## Intuition of Boundary Seeking GAN
+
+Recall, in GAN the following objective is optimized:
+
+$$
+    \min_G \max_D V(D, G) = \E_{x \sim p_\text{data}}[\log D(x)] + \E_{z \sim p_z(x)} [\log (1 - D(G(z)))]
+$$
+
+Following the objective above, as shown in the original GAN paper [1], the optimal discriminator $D^*_G(x)$ is given by:
+
+$$
+D^*_G(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}
+$$
+
+Hence, if we know the optimal discriminator with respect to our generator, $D^*_G(x)$, we are good to go, as we have this following amount by rearranging the above equation:
+
+$$
+p_{data}(x) = p_g(x) \frac{D^*_G(x)}{1 - D^*_G(x)}
+$$
+
+What does it tell us is that, even if we have non-optimal generator $G$, we could still find the true data distribution by weighting $p_g(x)$, the generator's distribution, with the ratio of optimal discriminator for that generator.
+
+Unfortunately, perfect discriminator is hard to get. But we can work with its approximation $D(x)$ instead. The assumption is that if we train $D(x)$ more and more, it becomes closer and closer to $D^*_G(x)$, and our GAN training becomes better and better.
+
+If we think further at the above equation, we would get $p_{data}(x) = p_g(x)$, i.e. our generator is optimal, if the ratio of the discriminator is equal to one. If that ratio is equal to one, then consequently $D(x)$ must be equal to $0.5$. Therefore, the optimal generator is the one that can make make the discriminator to be $0.5$ everywhere. Notice that $D(x) = 0.5$ is the decision boundary. Hence, we want to generate $x \sim G(z)$ such that $D(x)$ is near the decision boundary. Therefore, the authors of the paper named this method _Boundary Seeking GAN_ (BGAN).
+
+That statement has a very intuitive explanation. If we consider the generator to be perfect, $D(x)$ can't distinguish the real and the fake data. In other words, real and fake data are equally likely, as far as $D(x)$ concerned. As $D(x)$ has two outputs (real or fake), then, those outputs has the probability of $0.5$ each.
+
+Now, we could modify the generator's objective in order to make the discriminator outputting $0.5$ for every data we generated. One way to do it is to minimize the distance between $D(x)$ and $1 - D(x)$ for all $x$. If we do so, as $D(x)$ is a probability measure, we will get the minimum at $D(x) = 1 - D(x) = 0.5$, which is what we want.
+
+Therefore, the new objective for the generator is:
+
+$$
+\min_{G} \, \mathbb{E}_{z \sim p_z(z)} \left[ \frac{1}{2} (\log D(x) - \log(1 - D(x)))^2 \right]
+$$
+
+which is just an $L_2$ loss. We added $\log$ as $D(x)$ is a probability measure, and we want to undo that, as we are talking about distance, not divergence.
+
+## Implementation
+
+This should be the shortest ever implementation note in my blog.
+
+We just need to change the original GAN's $G$ objective from:
+
+```python
+G_loss = -torch.mean(log(D_fake))
+```
+
+to:
+
+```python
+G_loss = 0.5 * torch.mean((log(D_fake) - log(1 - D_fake))**2)
+```
+
+And we're done. For full code, check out https://github.com/wiseodd/generative-modes.
+
+## Conclusion
+
+In this post we looked at a new GAN variation called Boundary Seeking GAN (BGAN). We looked at the intuition of BGAN, and tried to understand why it's called "boundary seeking".
+
+We also implemented BGAN in Pytorch with just one line of code change.
+
+## References
+
+1. Hjelm, R. Devon, et al. "Boundary-Seeking Generative Adversarial Networks." arXiv preprint arXiv:1702.08431 (2017). [arxiv](https://arxiv.org/abs/1702.08431)
+2. Goodfellow, Ian, et al. "Generative adversarial nets." Advances in Neural Information Processing Systems. 2014. [arxiv](http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf)
@@ -5,9 +5,9 @@ publishDate: 2017-09-07 11:56
 tags: [machine learning, bayesian]
 ---
 
-Latent Dirichlet Allocation (LDA) [1] is a mixed membership model for topic modeling. Given a set of documents in bag of words representation, we want to infer the underlying topics those documents represent. To get a better intuition, we shall look at LDA's generative story. Note, the full code is available at <https://github.com/wiseodd/mixture-models>.
+Latent Dirichlet Allocation (LDA) [1] is a mixed membership model for topic modeling. Given a set of documents in bag of words representation, we want to infer the underlying topics those documents represent. To get a better intuition, we shall look at LDA's generative story. Note, the full code is available at https://github.com/wiseodd/mixture-models.
 
-Given \\( i = \\{1, \dots, N_D\\} \\) the document index, \\( v = \\{1, \dots, N_W\\} \\) the word index, \\( k = \\{1, \dots, N_K\\} \\) the topic index, LDA assumes:
+Given $i = \\{1, \dots, N_D\\}$ the document index, $v = \\{1, \dots, N_W\\}$ the word index, $k = \\{1, \dots, N_K\\}$ the topic index, LDA assumes:
 
 $$
 \begin{align}
@@ -20,18 +20,18 @@ y_{iv} &\sim \text{Cat}(y_{iv} \, \vert \, z_{iv} = k, \mathbf{B})
 \end {align}
 $$
 
-where \\( \alpha \\) and \\( \gamma \\) are the parameters for the Dirichlet priors. They tell us how narrow or spread the document topic and topic word distributions are.
+where $\alpha$ and $\gamma$ are the parameters for the Dirichlet priors. They tell us how narrow or spread the document topic and topic word distributions are.
 
 Details for the above generative process above in words:
 
-1. Assume each document generated by selecting the topic first. Thus, sample \\( \mathbf{\pi}\_i \\), the topic distribution for \\( i \\)-th document.
-2. Assume each words in \\( i \\)-th document comes from one of the topics. Therefore, we sample \\( z\_{iv} \\), the topic for each word \\( v \\) in document \\( i \\).
-3. Assume each topic is composed of words, e.g. topic "computer" consits of words "cpu", "gpu", etc. Therefore, we sample \\( \mathbf{b}\_k \\), the distribution those words for particular topic \\( k \\).
-4. Finally, to actually generate the word, given that we already know it comes from topic \\( k \\), we sample the word \\( y\_{iv} \\) given the \\( k \\)-th topic word distribution.
+1. Assume each document generated by selecting the topic first. Thus, sample $\mathbf{\pi}_i$, the topic distribution for $i$-th document.
+2. Assume each words in $i$-th document comes from one of the topics. Therefore, we sample $z_{iv}$, the topic for each word $v$ in document $i$.
+3. Assume each topic is composed of words, e.g. topic "computer" consits of words "cpu", "gpu", etc. Therefore, we sample $\mathbf{b}_k$, the distribution those words for particular topic $k$.
+4. Finally, to actually generate the word, given that we already know it comes from topic $k$, we sample the word $y_{iv}$ given the $k$-th topic word distribution.
 
 ## Inference
 
-The goal of inference in LDA is that given a corpus, we infer the underlying topics that explain those documents, according to the generative process above. Essentially, given \\( y\_{iv} \\), we are inverting the above process to find \\( z\_{iv} \\), \\( \mathbf{\pi}\_i \\), and \\( \mathbf{b}\_k \\).
+The goal of inference in LDA is that given a corpus, we infer the underlying topics that explain those documents, according to the generative process above. Essentially, given $y_{iv}$, we are inverting the above process to find $z_{iv}$, $\mathbf{\pi}_i$, and $\mathbf{b}_k$.
 
 We will infer those variables using Gibbs Sampling algorithm. In short, it works by sampling each of those variables given the other variables (full conditional distribution). Because of the conjugacy, the full conditionals are as follows:
 
@@ -51,46 +51,42 @@ Given those full conditionals, the rest is as easy as plugging those into the Gi
 
 ## Implementation
 
-We begin with randomly initializing topic assignment matrix \\( \mathbf{Z}\_{N_D \times N_W} \\). We also sample the initial values of \\( \mathbf{\Pi}\_{N_D \times N_K} \\) and \\( \mathbf{B}\_{N_K \times N_W} \\).
+We begin with randomly initializing topic assignment matrix $\mathbf{Z}_{N_D \times N_W}$. We also sample the initial values of $\mathbf{\Pi}_{N_D \times N_K}$ and $\mathbf{B}_{N_K \times N_W}$.
 
 ```python
-
 # Dirichlet priors
-
 alpha = 1
 gamma = 1
 
 # Z := word topic assignment
-
 Z = np.zeros(shape=[N_D, N_W])
 
 for i in range(N_D):
-for l in range(N_W):
-Z[i, l] = np.random.randint(N_K) # randomly assign word's topic
+    for l in range(N_W):
+        Z[i, l] = np.random.randint(N_K) # randomly assign word's topic
 
 # Pi := document topic distribution
-
 Pi = np.zeros([N_D, N_K])
-
 for i in range(N_D):
-Pi[i] = np.random.dirichlet(alpha\*np.ones(N_K))
+    Pi[i] = np.random.dirichlet(alpha*np.ones(N_K))
 
 # B := word topic distribution
-
 B = np.zeros([N_K, N_W])
-
 for k in range(N_K):
-B[k] = np.random.dirichlet(gamma\*np.ones(N_W))
+    B[k] = np.random.dirichlet(gamma*np.ones(N_W))
 ```
 
 Then we sample the new values for each of those variables from the full conditionals in the previous section, and iterate:
 
 ```python
-for it in range(1000): # Sample from full conditional of Z # ---------------------------------
-for i in range(N_D):
-for v in range(N_W): # Calculate params for Z
-p_iv = np.exp(np.log(Pi[i]) + np.log(B[:, X[i, v]]))
-p_iv /= np.sum(p_iv)
+for it in range(1000):
+    # Sample from full conditional of Z
+    # ---------------------------------
+    for i in range(N_D):
+        for v in range(N_W):
+            # Calculate params for Z
+            p_iv = np.exp(np.log(Pi[i]) + np.log(B[:, X[i, v]]))
+            p_iv /= np.sum(p_iv)
 
             # Resample word topic assignment Z
             Z[i, v] = np.random.multinomial(1, p_iv).argmax()
@@ -130,28 +126,25 @@ And basically we are done. We could inspect the result by looking at those varia
 Let's say we have these data:
 
 ```python
-
 # Words
-
 W = np.array([0, 1, 2, 3, 4])
 
 # D := document words
-
 X = np.array([
-[0, 0, 1, 2, 2],
-[0, 0, 1, 1, 1],
-[0, 1, 2, 2, 2],
-[4, 4, 4, 4, 4],
-[3, 3, 4, 4, 4],
-[3, 4, 4, 4, 4]
+    [0, 0, 1, 2, 2],
+    [0, 0, 1, 1, 1],
+    [0, 1, 2, 2, 2],
+    [4, 4, 4, 4, 4],
+    [3, 3, 4, 4, 4],
+    [3, 4, 4, 4, 4]
 ])
 
 N_D = X.shape[0] # num of docs
 N_W = W.shape[0] # num of words
 N_K = 2 # num of topics
 ```
 
-Those data are already in bag of words representation, so it is a little abstract at a glance. However if we look at it, we could see two big clusters of documents based on their words: \\( \\{ 1, 2, 3 \\} \\) and \\( \\{ 4, 5, 6 \\} \\). Therefore, we expect after our sampler converges to the posterior, the topic distribution for those documents will follow our intuition.
+Those data are already in bag of words representation, so it is a little abstract at a glance. However if we look at it, we could see two big clusters of documents based on their words: $\\{ 1, 2, 3 \\}$ and $\\{ 4, 5, 6 \\}$. Therefore, we expect after our sampler converges to the posterior, the topic distribution for those documents will follow our intuition.
 
 Here is the result: