You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
where $ Z = \sum_x f(x) $. In high dimension, this summation is intractable as there would be exponential number of terms. We are hopeless on computing $ Z $ and in turn we can't evaluate this distribution.
13
16
14
17
Now, how do we compute an expectation w.r.t. to $p(x)$, i.e.:
15
18
16
-
$$ \mathbb{E}\_{p(x)}[x] = \sum_x x p(x) $$
19
+
$$
20
+
\mathbb{E}_{p(x)}[x] = \sum_x x p(x)
21
+
$$
17
22
18
23
It is impossible for us to do this as we don't know $ p(x) $. Our best hope is to approximate that. One of the popular way is to use importance sampling. However, importance sampling has a hyperparameter that is hard to adjust, i.e. the proposal distribution $ q(x) $. Importance sampling works well if we can provide $ q(x) $ that is a good approximation of $ p(x) $. It is problematic to find a good $ q(x) $, and this is one of the motivations behind Annealed Importance Sampling (AIS) [1].
19
24
@@ -28,8 +33,8 @@ The construction of AIS is as follows:
28
33
29
34
Then to sample from $ p_0(x) $, we need to:
30
35
31
-
- Sample an independent point from $ x\_{n-1} \sim p_n(x) $.
32
-
- Sample $ x*{n-2} $ from $ x*{n-1} $ by doing MCMC w.r.t. $ T\_{n-1} $.
36
+
- Sample an independent point from $ x_{n-1} \sim p_n(x) $.
37
+
- Sample $ x*{n-2} $ from $ x*{n-1} $ by doing MCMC w.r.t. $ T_{n-1} $.
33
38
- $ \dots $
34
39
- Sample $ x_1 $ from $ x_2 $ by doing MCMC w.r.t. $ T_2 $.
35
40
- Sample $ x_0 $ from $ x_1 $ by doing MCMC w.r.t. $ T_1 $.
@@ -38,13 +43,17 @@ Intuitively given two distributions, which might be disjoint in their support, w
38
43
39
44
At this point, we have sequence of points $ x*{n-1}, x*{n-2}, \dots, x_1, x_0 $. We can use them to compute the importance weight as follows:
40
45
41
-
$$ w = \frac{f*{n-1}(x*{n-1})}{f*n(x*{n-1})} \frac{f*{n-2}(x*{n-2})}{f*{n-1}(x*{n-2})} \dots \frac{f_1(x_1)}{f_2(x_1)} \frac{f_0(x_0)}{f_1(x_0)} $$
46
+
$$
47
+
w = \frac{f*{n-1}(x*{n-1})}{f*n(x*{n-1})} \frac{f*{n-2}(x*{n-2})}{f*{n-1}(x*{n-2})} \dots \frac{f_1(x_1)}{f_2(x_1)} \frac{f_0(x_0)}{f_1(x_0)}
48
+
$$
42
49
43
50
Notice that $ w $ is telescoping, and without the intermediate distributions, it reduces to the usual weight used in importance sampling.
44
51
45
52
With this importance weight, then we can compute the expectation as in importance sampling:
description: 'Training GAN by moving the generated samples to the decision boundary.'
4
+
publishDate: 2017-03-07 00:10
5
+
tags: [machine learning, gan]
6
+
---
7
+
8
+
importBlogImagefrom'@/components/BlogImage.astro'
9
+
10
+
Boundary Seeking GAN (BGAN) is a recently introduced modification of GAN training. Here, in this post, we will look at the intuition behind BGAN, and also the implementation, which consists of one line change from vanilla GAN.
11
+
12
+
## Intuition of Boundary Seeking GAN
13
+
14
+
Recall, in GAN the following objective is optimized:
Hence, if we know the optimal discriminator with respect to our generator, $D^*_G(x)$, we are good to go, as we have this following amount by rearranging the above equation:
What does it tell us is that, even if we have non-optimal generator $G$, we could still find the true data distribution by weighting $p_g(x)$, the generator's distribution, with the ratio of optimal discriminator for that generator.
33
+
34
+
Unfortunately, perfect discriminator is hard to get. But we can work with its approximation $D(x)$ instead. The assumption is that if we train $D(x)$ more and more, it becomes closer and closer to $D^*_G(x)$, and our GAN training becomes better and better.
35
+
36
+
If we think further at the above equation, we would get $p_{data}(x) = p_g(x)$, i.e. our generator is optimal, if the ratio of the discriminator is equal to one. If that ratio is equal to one, then consequently $D(x)$ must be equal to $0.5$. Therefore, the optimal generator is the one that can make make the discriminator to be $0.5$ everywhere. Notice that $D(x) = 0.5$ is the decision boundary. Hence, we want to generate $x \sim G(z)$ such that $D(x)$ is near the decision boundary. Therefore, the authors of the paper named this method _Boundary Seeking GAN_ (BGAN).
37
+
38
+
That statement has a very intuitive explanation. If we consider the generator to be perfect, $D(x)$ can't distinguish the real and the fake data. In other words, real and fake data are equally likely, as far as $D(x)$ concerned. As $D(x)$ has two outputs (real or fake), then, those outputs has the probability of $0.5$ each.
39
+
40
+
Now, we could modify the generator's objective in order to make the discriminator outputting $0.5$ for every data we generated. One way to do it is to minimize the distance between $D(x)$ and $1 - D(x)$ for all $x$. If we do so, as $D(x)$ is a probability measure, we will get the minimum at $D(x) = 1 - D(x) = 0.5$, which is what we want.
41
+
42
+
Therefore, the new objective for the generator is:
which is just an $L_2$ loss. We added $\log$ as $D(x)$ is a probability measure, and we want to undo that, as we are talking about distance, not divergence.
49
+
50
+
## Implementation
51
+
52
+
This should be the shortest ever implementation note in my blog.
53
+
54
+
We just need to change the original GAN's $G$ objective from:
And we're done. For full code, check out https://github.com/wiseodd/generative-modes.
67
+
68
+
## Conclusion
69
+
70
+
In this post we looked at a new GAN variation called Boundary Seeking GAN (BGAN). We looked at the intuition of BGAN, and tried to understand why it's called "boundary seeking".
71
+
72
+
We also implemented BGAN in Pytorch with just one line of code change.
73
+
74
+
## References
75
+
76
+
1. Hjelm, R. Devon, et al. "Boundary-Seeking Generative Adversarial Networks." arXiv preprint arXiv:1702.08431 (2017). [arxiv](https://arxiv.org/abs/1702.08431)
77
+
2. Goodfellow, Ian, et al. "Generative adversarial nets." Advances in Neural Information Processing Systems. 2014. [arxiv](http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf)
Copy file name to clipboardExpand all lines: src/content/post/lda-gibbs.mdx
+28-35Lines changed: 28 additions & 35 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,9 +5,9 @@ publishDate: 2017-09-07 11:56
5
5
tags: [machine learning, bayesian]
6
6
---
7
7
8
-
Latent Dirichlet Allocation (LDA) [1] is a mixed membership model for topic modeling. Given a set of documents in bag of words representation, we want to infer the underlying topics those documents represent. To get a better intuition, we shall look at LDA's generative story. Note, the full code is available at <https://github.com/wiseodd/mixture-models>.
8
+
Latent Dirichlet Allocation (LDA) [1] is a mixed membership model for topic modeling. Given a set of documents in bag of words representation, we want to infer the underlying topics those documents represent. To get a better intuition, we shall look at LDA's generative story. Note, the full code is available at https://github.com/wiseodd/mixture-models.
9
9
10
-
Given \\( i = \\{1, \dots, N_D\\}\\) the document index, \\( v = \\{1, \dots, N_W\\}\\) the word index, \\( k = \\{1, \dots, N_K\\}\\) the topic index, LDA assumes:
10
+
Given $i = \\{1, \dots, N_D\\}$ the document index, $v = \\{1, \dots, N_W\\}$ the word index, $k = \\{1, \dots, N_K\\}$ the topic index, LDA assumes:
where \\( \alpha\\) and \\( \gamma\\) are the parameters for the Dirichlet priors. They tell us how narrow or spread the document topic and topic word distributions are.
23
+
where $\alpha$ and $\gamma$ are the parameters for the Dirichlet priors. They tell us how narrow or spread the document topic and topic word distributions are.
24
24
25
25
Details for the above generative process above in words:
26
26
27
-
1. Assume each document generated by selecting the topic first. Thus, sample \\( \mathbf{\pi}\_i \\), the topic distribution for \\( i \\)-th document.
28
-
2. Assume each words in \\( i \\)-th document comes from one of the topics. Therefore, we sample \\( z\_{iv}\\), the topic for each word \\( v \\) in document \\( i \\).
29
-
3. Assume each topic is composed of words, e.g. topic "computer" consits of words "cpu", "gpu", etc. Therefore, we sample \\( \mathbf{b}\_k \\), the distribution those words for particular topic \\( k \\).
30
-
4. Finally, to actually generate the word, given that we already know it comes from topic \\( k \\), we sample the word \\( y\_{iv}\\) given the \\( k \\)-th topic word distribution.
27
+
1. Assume each document generated by selecting the topic first. Thus, sample $\mathbf{\pi}_i$, the topic distribution for $i$-th document.
28
+
2. Assume each words in $i$-th document comes from one of the topics. Therefore, we sample $z_{iv}$, the topic for each word $v$ in document $i$.
29
+
3. Assume each topic is composed of words, e.g. topic "computer" consits of words "cpu", "gpu", etc. Therefore, we sample $\mathbf{b}_k$, the distribution those words for particular topic $k$.
30
+
4. Finally, to actually generate the word, given that we already know it comes from topic $k$, we sample the word $y_{iv}$ given the $k$-th topic word distribution.
31
31
32
32
## Inference
33
33
34
-
The goal of inference in LDA is that given a corpus, we infer the underlying topics that explain those documents, according to the generative process above. Essentially, given \\( y\_{iv}\\), we are inverting the above process to find \\( z\_{iv}\\), \\( \mathbf{\pi}\_i \\), and \\( \mathbf{b}\_k \\).
34
+
The goal of inference in LDA is that given a corpus, we infer the underlying topics that explain those documents, according to the generative process above. Essentially, given $y_{iv}$, we are inverting the above process to find $z_{iv}$, $\mathbf{\pi}_i$, and $\mathbf{b}_k$.
35
35
36
36
We will infer those variables using Gibbs Sampling algorithm. In short, it works by sampling each of those variables given the other variables (full conditional distribution). Because of the conjugacy, the full conditionals are as follows:
37
37
@@ -51,46 +51,42 @@ Given those full conditionals, the rest is as easy as plugging those into the Gi
51
51
52
52
## Implementation
53
53
54
-
We begin with randomly initializing topic assignment matrix \\( \mathbf{Z}\_{N_D \times N_W}\\). We also sample the initial values of \\( \mathbf{\Pi}\_{N_D \times N_K}\\) and \\( \mathbf{B}\_{N_K \times N_W}\\).
54
+
We begin with randomly initializing topic assignment matrix $\mathbf{Z}_{N_D \timesN_W}$. We also sample the initial values of $\mathbf{\Pi}_{N_D \timesN_K}$ and $\mathbf{B}_{N_K \timesN_W}$.
@@ -130,28 +126,25 @@ And basically we are done. We could inspect the result by looking at those varia
130
126
Let's say we have these data:
131
127
132
128
```python
133
-
134
129
# Words
135
-
136
130
W = np.array([0, 1, 2, 3, 4])
137
131
138
132
# D := document words
139
-
140
133
X = np.array([
141
-
[0, 0, 1, 2, 2],
142
-
[0, 0, 1, 1, 1],
143
-
[0, 1, 2, 2, 2],
144
-
[4, 4, 4, 4, 4],
145
-
[3, 3, 4, 4, 4],
146
-
[3, 4, 4, 4, 4]
134
+
[0, 0, 1, 2, 2],
135
+
[0, 0, 1, 1, 1],
136
+
[0, 1, 2, 2, 2],
137
+
[4, 4, 4, 4, 4],
138
+
[3, 3, 4, 4, 4],
139
+
[3, 4, 4, 4, 4]
147
140
])
148
141
149
142
N_D= X.shape[0] # num of docs
150
143
N_W= W.shape[0] # num of words
151
144
N_K=2# num of topics
152
145
```
153
146
154
-
Those data are already in bag of words representation, so it is a little abstract at a glance. However if we look at it, we could see two big clusters of documents based on their words: \\( \\{ 1, 2, 3 \\} \\) and \\( \\{ 4, 5, 6 \\} \\). Therefore, we expect after our sampler converges to the posterior, the topic distribution for those documents will follow our intuition.
147
+
Those data are already in bag of words representation, so it is a little abstract at a glance. However if we look at it, we could see two big clusters of documents based on their words: $\\{1, 2, 3 \\}$ and $\\{4, 5, 6 \\}$. Therefore, we expect after our sampler converges to the posterior, the topic distribution for those documents will follow our intuition.
0 commit comments