Skip to content

Commit c695f6b

Browse files
committed
More md -> mdx
1 parent ba5c00a commit c695f6b

File tree

7 files changed

+163
-144
lines changed

7 files changed

+163
-144
lines changed

src/content/post/annealed-importance-sampling.md renamed to src/content/post/annealed-importance-sampling.mdx

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,21 @@ description: 'An introduction and implementation of Annealed Importance Sampling
44
publishDate: 2017-12-23 07:00
55
tags: [machine learning, bayesian]
66
---
7+
import BlogImage from '@/components/BlogImage.astro';
78

89
Suppose we have this distribution:
910

10-
$$ p(x) = \frac{1}{Z} f(x) $$
11+
$$
12+
p(x) = \frac{1}{Z} f(x)
13+
$$
1114

1215
where $ Z = \sum_x f(x) $. In high dimension, this summation is intractable as there would be exponential number of terms. We are hopeless on computing $ Z $ and in turn we can't evaluate this distribution.
1316

1417
Now, how do we compute an expectation w.r.t. to $p(x)$, i.e.:
1518

16-
$$ \mathbb{E}\_{p(x)}[x] = \sum_x x p(x) $$
19+
$$
20+
\mathbb{E}_{p(x)}[x] = \sum_x x p(x)
21+
$$
1722

1823
It is impossible for us to do this as we don't know $ p(x) $. Our best hope is to approximate that. One of the popular way is to use importance sampling. However, importance sampling has a hyperparameter that is hard to adjust, i.e. the proposal distribution $ q(x) $. Importance sampling works well if we can provide $ q(x) $ that is a good approximation of $ p(x) $. It is problematic to find a good $ q(x) $, and this is one of the motivations behind Annealed Importance Sampling (AIS) [1].
1924

@@ -28,8 +33,8 @@ The construction of AIS is as follows:
2833

2934
Then to sample from $ p_0(x) $, we need to:
3035

31-
- Sample an independent point from $ x\_{n-1} \sim p_n(x) $.
32-
- Sample $ x*{n-2} $ from $ x*{n-1} $ by doing MCMC w.r.t. $ T\_{n-1} $.
36+
- Sample an independent point from $ x_{n-1} \sim p_n(x) $.
37+
- Sample $ x*{n-2} $ from $ x*{n-1} $ by doing MCMC w.r.t. $ T_{n-1} $.
3338
- $ \dots $
3439
- Sample $ x_1 $ from $ x_2 $ by doing MCMC w.r.t. $ T_2 $.
3540
- Sample $ x_0 $ from $ x_1 $ by doing MCMC w.r.t. $ T_1 $.
@@ -38,13 +43,17 @@ Intuitively given two distributions, which might be disjoint in their support, w
3843

3944
At this point, we have sequence of points $ x*{n-1}, x*{n-2}, \dots, x_1, x_0 $. We can use them to compute the importance weight as follows:
4045

41-
$$ w = \frac{f*{n-1}(x*{n-1})}{f*n(x*{n-1})} \frac{f*{n-2}(x*{n-2})}{f*{n-1}(x*{n-2})} \dots \frac{f_1(x_1)}{f_2(x_1)} \frac{f_0(x_0)}{f_1(x_0)} $$
46+
$$
47+
w = \frac{f*{n-1}(x*{n-1})}{f*n(x*{n-1})} \frac{f*{n-2}(x*{n-2})}{f*{n-1}(x*{n-2})} \dots \frac{f_1(x_1)}{f_2(x_1)} \frac{f_0(x_0)}{f_1(x_0)}
48+
$$
4249

4350
Notice that $ w $ is telescoping, and without the intermediate distributions, it reduces to the usual weight used in importance sampling.
4451

4552
With this importance weight, then we can compute the expectation as in importance sampling:
4653

47-
$$ \mathbb{E}\_{p(x)}[x] = \frac{1}{\sum_i^N w_i} \sum_i^N x_i w_i $$
54+
$$
55+
\mathbb{E}_{p(x)}[x] = \frac{1}{\sum_i^N w_i} \sum_i^N x_i w_i
56+
$$
4857

4958
where $ N $ is the number of samples.
5059

@@ -54,15 +63,19 @@ We now have the full algorithm. However several things are missing, namely, the
5463

5564
For the intermediate distributions, we can set it as an annealing between to our target and proposal functions, i.e:
5665

57-
$$ f_j(x) = f_0(x)^{\beta_j} f_n(x)^{1-\beta_j} $$
66+
$$
67+
f_j(x) = f_0(x)^{\beta_j} f_n(x)^{1-\beta_j}
68+
$$
5869

5970
where $ 1 = \beta_0 > \beta_1 > \dots > \beta_n = 0 $. For visual example, annealing between $ N(0, I) $ to $ N(5, I) $ with 10 intermediate distributions gives us:
6071

61-
![Annealing]({{ site.baseurl }}/img/2017-12-23-annealed-importance-sampling/intermediate_dists.png)
72+
<BlogImage imagePath="/img/annealed-importance-sampling/intermediate_dists.png" altText="Intermediate distributions." />
6273

6374
For the transition functions, we can use Metropolis-Hastings with acceptance probability:
6475

65-
$$ A_j(x, x') = \frac{f_j(x')}{f_j(x)} $$
76+
$$
77+
A_j(x, x') = \frac{f_j(x')}{f_j(x)}
78+
$$
6679

6780
assuming we have symmetric proposal, e.g. $ N(0, I) $.
6881

src/content/post/boundary-seeking-gan.md

Lines changed: 0 additions & 73 deletions
This file was deleted.
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
title: 'Boundary Seeking GAN'
3+
description: 'Training GAN by moving the generated samples to the decision boundary.'
4+
publishDate: 2017-03-07 00:10
5+
tags: [machine learning, gan]
6+
---
7+
8+
import BlogImage from '@/components/BlogImage.astro'
9+
10+
Boundary Seeking GAN (BGAN) is a recently introduced modification of GAN training. Here, in this post, we will look at the intuition behind BGAN, and also the implementation, which consists of one line change from vanilla GAN.
11+
12+
## Intuition of Boundary Seeking GAN
13+
14+
Recall, in GAN the following objective is optimized:
15+
16+
$$
17+
\min_G \max_D V(D, G) = \E_{x \sim p_\text{data}}[\log D(x)] + \E_{z \sim p_z(x)} [\log (1 - D(G(z)))]
18+
$$
19+
20+
Following the objective above, as shown in the original GAN paper [1], the optimal discriminator $D^*_G(x)$ is given by:
21+
22+
$$
23+
D^*_G(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}
24+
$$
25+
26+
Hence, if we know the optimal discriminator with respect to our generator, $D^*_G(x)$, we are good to go, as we have this following amount by rearranging the above equation:
27+
28+
$$
29+
p_{data}(x) = p_g(x) \frac{D^*_G(x)}{1 - D^*_G(x)}
30+
$$
31+
32+
What does it tell us is that, even if we have non-optimal generator $G$, we could still find the true data distribution by weighting $p_g(x)$, the generator's distribution, with the ratio of optimal discriminator for that generator.
33+
34+
Unfortunately, perfect discriminator is hard to get. But we can work with its approximation $D(x)$ instead. The assumption is that if we train $D(x)$ more and more, it becomes closer and closer to $D^*_G(x)$, and our GAN training becomes better and better.
35+
36+
If we think further at the above equation, we would get $p_{data}(x) = p_g(x)$, i.e. our generator is optimal, if the ratio of the discriminator is equal to one. If that ratio is equal to one, then consequently $D(x)$ must be equal to $0.5$. Therefore, the optimal generator is the one that can make make the discriminator to be $0.5$ everywhere. Notice that $D(x) = 0.5$ is the decision boundary. Hence, we want to generate $x \sim G(z)$ such that $D(x)$ is near the decision boundary. Therefore, the authors of the paper named this method _Boundary Seeking GAN_ (BGAN).
37+
38+
That statement has a very intuitive explanation. If we consider the generator to be perfect, $D(x)$ can't distinguish the real and the fake data. In other words, real and fake data are equally likely, as far as $D(x)$ concerned. As $D(x)$ has two outputs (real or fake), then, those outputs has the probability of $0.5$ each.
39+
40+
Now, we could modify the generator's objective in order to make the discriminator outputting $0.5$ for every data we generated. One way to do it is to minimize the distance between $D(x)$ and $1 - D(x)$ for all $x$. If we do so, as $D(x)$ is a probability measure, we will get the minimum at $D(x) = 1 - D(x) = 0.5$, which is what we want.
41+
42+
Therefore, the new objective for the generator is:
43+
44+
$$
45+
\min_{G} \, \mathbb{E}_{z \sim p_z(z)} \left[ \frac{1}{2} (\log D(x) - \log(1 - D(x)))^2 \right]
46+
$$
47+
48+
which is just an $L_2$ loss. We added $\log$ as $D(x)$ is a probability measure, and we want to undo that, as we are talking about distance, not divergence.
49+
50+
## Implementation
51+
52+
This should be the shortest ever implementation note in my blog.
53+
54+
We just need to change the original GAN's $G$ objective from:
55+
56+
```python
57+
G_loss = -torch.mean(log(D_fake))
58+
```
59+
60+
to:
61+
62+
```python
63+
G_loss = 0.5 * torch.mean((log(D_fake) - log(1 - D_fake))**2)
64+
```
65+
66+
And we're done. For full code, check out https://github.com/wiseodd/generative-modes.
67+
68+
## Conclusion
69+
70+
In this post we looked at a new GAN variation called Boundary Seeking GAN (BGAN). We looked at the intuition of BGAN, and tried to understand why it's called "boundary seeking".
71+
72+
We also implemented BGAN in Pytorch with just one line of code change.
73+
74+
## References
75+
76+
1. Hjelm, R. Devon, et al. "Boundary-Seeking Generative Adversarial Networks." arXiv preprint arXiv:1702.08431 (2017). [arxiv](https://arxiv.org/abs/1702.08431)
77+
2. Goodfellow, Ian, et al. "Generative adversarial nets." Advances in Neural Information Processing Systems. 2014. [arxiv](http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf)
File renamed without changes.
Lines changed: 28 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ publishDate: 2017-09-07 11:56
55
tags: [machine learning, bayesian]
66
---
77

8-
Latent Dirichlet Allocation (LDA) [1] is a mixed membership model for topic modeling. Given a set of documents in bag of words representation, we want to infer the underlying topics those documents represent. To get a better intuition, we shall look at LDA's generative story. Note, the full code is available at <https://github.com/wiseodd/mixture-models>.
8+
Latent Dirichlet Allocation (LDA) [1] is a mixed membership model for topic modeling. Given a set of documents in bag of words representation, we want to infer the underlying topics those documents represent. To get a better intuition, we shall look at LDA's generative story. Note, the full code is available at https://github.com/wiseodd/mixture-models.
99

10-
Given \\( i = \\{1, \dots, N_D\\} \\) the document index, \\( v = \\{1, \dots, N_W\\} \\) the word index, \\( k = \\{1, \dots, N_K\\} \\) the topic index, LDA assumes:
10+
Given $i = \\{1, \dots, N_D\\}$ the document index, $v = \\{1, \dots, N_W\\}$ the word index, $k = \\{1, \dots, N_K\\}$ the topic index, LDA assumes:
1111

1212
$$
1313
\begin{align}
@@ -20,18 +20,18 @@ y_{iv} &\sim \text{Cat}(y_{iv} \, \vert \, z_{iv} = k, \mathbf{B})
2020
\end {align}
2121
$$
2222

23-
where \\( \alpha \\) and \\( \gamma \\) are the parameters for the Dirichlet priors. They tell us how narrow or spread the document topic and topic word distributions are.
23+
where $\alpha$ and $\gamma$ are the parameters for the Dirichlet priors. They tell us how narrow or spread the document topic and topic word distributions are.
2424

2525
Details for the above generative process above in words:
2626

27-
1. Assume each document generated by selecting the topic first. Thus, sample \\( \mathbf{\pi}\_i \\), the topic distribution for \\( i \\)-th document.
28-
2. Assume each words in \\( i \\)-th document comes from one of the topics. Therefore, we sample \\( z\_{iv} \\), the topic for each word \\( v \\) in document \\( i \\).
29-
3. Assume each topic is composed of words, e.g. topic "computer" consits of words "cpu", "gpu", etc. Therefore, we sample \\( \mathbf{b}\_k \\), the distribution those words for particular topic \\( k \\).
30-
4. Finally, to actually generate the word, given that we already know it comes from topic \\( k \\), we sample the word \\( y\_{iv} \\) given the \\( k \\)-th topic word distribution.
27+
1. Assume each document generated by selecting the topic first. Thus, sample $\mathbf{\pi}_i$, the topic distribution for $i$-th document.
28+
2. Assume each words in $i$-th document comes from one of the topics. Therefore, we sample $z_{iv}$, the topic for each word $v$ in document $i$.
29+
3. Assume each topic is composed of words, e.g. topic "computer" consits of words "cpu", "gpu", etc. Therefore, we sample $\mathbf{b}_k$, the distribution those words for particular topic $k$.
30+
4. Finally, to actually generate the word, given that we already know it comes from topic $k$, we sample the word $y_{iv}$ given the $k$-th topic word distribution.
3131

3232
## Inference
3333

34-
The goal of inference in LDA is that given a corpus, we infer the underlying topics that explain those documents, according to the generative process above. Essentially, given \\( y\_{iv} \\), we are inverting the above process to find \\( z\_{iv} \\), \\( \mathbf{\pi}\_i \\), and \\( \mathbf{b}\_k \\).
34+
The goal of inference in LDA is that given a corpus, we infer the underlying topics that explain those documents, according to the generative process above. Essentially, given $y_{iv}$, we are inverting the above process to find $z_{iv}$, $\mathbf{\pi}_i$, and $\mathbf{b}_k$.
3535

3636
We will infer those variables using Gibbs Sampling algorithm. In short, it works by sampling each of those variables given the other variables (full conditional distribution). Because of the conjugacy, the full conditionals are as follows:
3737

@@ -51,46 +51,42 @@ Given those full conditionals, the rest is as easy as plugging those into the Gi
5151

5252
## Implementation
5353

54-
We begin with randomly initializing topic assignment matrix \\( \mathbf{Z}\_{N_D \times N_W} \\). We also sample the initial values of \\( \mathbf{\Pi}\_{N_D \times N_K} \\) and \\( \mathbf{B}\_{N_K \times N_W} \\).
54+
We begin with randomly initializing topic assignment matrix $\mathbf{Z}_{N_D \times N_W}$. We also sample the initial values of $\mathbf{\Pi}_{N_D \times N_K}$ and $\mathbf{B}_{N_K \times N_W}$.
5555

5656
```python
57-
5857
# Dirichlet priors
59-
6058
alpha = 1
6159
gamma = 1
6260

6361
# Z := word topic assignment
64-
6562
Z = np.zeros(shape=[N_D, N_W])
6663

6764
for i in range(N_D):
68-
for l in range(N_W):
69-
Z[i, l] = np.random.randint(N_K) # randomly assign word's topic
65+
for l in range(N_W):
66+
Z[i, l] = np.random.randint(N_K) # randomly assign word's topic
7067

7168
# Pi := document topic distribution
72-
7369
Pi = np.zeros([N_D, N_K])
74-
7570
for i in range(N_D):
76-
Pi[i] = np.random.dirichlet(alpha\*np.ones(N_K))
71+
Pi[i] = np.random.dirichlet(alpha*np.ones(N_K))
7772

7873
# B := word topic distribution
79-
8074
B = np.zeros([N_K, N_W])
81-
8275
for k in range(N_K):
83-
B[k] = np.random.dirichlet(gamma\*np.ones(N_W))
76+
B[k] = np.random.dirichlet(gamma*np.ones(N_W))
8477
```
8578

8679
Then we sample the new values for each of those variables from the full conditionals in the previous section, and iterate:
8780

8881
```python
89-
for it in range(1000): # Sample from full conditional of Z # ---------------------------------
90-
for i in range(N_D):
91-
for v in range(N_W): # Calculate params for Z
92-
p_iv = np.exp(np.log(Pi[i]) + np.log(B[:, X[i, v]]))
93-
p_iv /= np.sum(p_iv)
82+
for it in range(1000):
83+
# Sample from full conditional of Z
84+
# ---------------------------------
85+
for i in range(N_D):
86+
for v in range(N_W):
87+
# Calculate params for Z
88+
p_iv = np.exp(np.log(Pi[i]) + np.log(B[:, X[i, v]]))
89+
p_iv /= np.sum(p_iv)
9490

9591
# Resample word topic assignment Z
9692
Z[i, v] = np.random.multinomial(1, p_iv).argmax()
@@ -130,28 +126,25 @@ And basically we are done. We could inspect the result by looking at those varia
130126
Let's say we have these data:
131127

132128
```python
133-
134129
# Words
135-
136130
W = np.array([0, 1, 2, 3, 4])
137131

138132
# D := document words
139-
140133
X = np.array([
141-
[0, 0, 1, 2, 2],
142-
[0, 0, 1, 1, 1],
143-
[0, 1, 2, 2, 2],
144-
[4, 4, 4, 4, 4],
145-
[3, 3, 4, 4, 4],
146-
[3, 4, 4, 4, 4]
134+
[0, 0, 1, 2, 2],
135+
[0, 0, 1, 1, 1],
136+
[0, 1, 2, 2, 2],
137+
[4, 4, 4, 4, 4],
138+
[3, 3, 4, 4, 4],
139+
[3, 4, 4, 4, 4]
147140
])
148141

149142
N_D = X.shape[0] # num of docs
150143
N_W = W.shape[0] # num of words
151144
N_K = 2 # num of topics
152145
```
153146

154-
Those data are already in bag of words representation, so it is a little abstract at a glance. However if we look at it, we could see two big clusters of documents based on their words: \\( \\{ 1, 2, 3 \\} \\) and \\( \\{ 4, 5, 6 \\} \\). Therefore, we expect after our sampler converges to the posterior, the topic distribution for those documents will follow our intuition.
147+
Those data are already in bag of words representation, so it is a little abstract at a glance. However if we look at it, we could see two big clusters of documents based on their words: $\\{ 1, 2, 3 \\}$ and $\\{ 4, 5, 6 \\}$. Therefore, we expect after our sampler converges to the posterior, the topic distribution for those documents will follow our intuition.
155148

156149
Here is the result:
157150

0 commit comments

Comments
 (0)