Skip to content

Commit a95799b

Browse files
committed
Another md -> mdx
1 parent 5709e85 commit a95799b

File tree

6 files changed

+151
-122
lines changed

6 files changed

+151
-122
lines changed

.astro/types.d.ts

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -157,13 +157,13 @@ declare module 'astro:content' {
157157
collection: "post";
158158
data: InferEntrySchema<"post">
159159
} & { render(): Render[".md"] };
160-
"bayesian-regression.md": {
161-
id: "bayesian-regression.md";
160+
"bayesian-regression.mdx": {
161+
id: "bayesian-regression.mdx";
162162
slug: "bayesian-regression";
163163
body: string;
164164
collection: "post";
165165
data: InferEntrySchema<"post">
166-
} & { render(): Render[".md"] };
166+
} & { render(): Render[".mdx"] };
167167
"boundary-seeking-gan.mdx": {
168168
id: "boundary-seeking-gan.mdx";
169169
slug: "boundary-seeking-gan";
@@ -269,13 +269,13 @@ declare module 'astro:content' {
269269
collection: "post";
270270
data: InferEntrySchema<"post">
271271
} & { render(): Render[".md"] };
272-
"gan-pytorch.md": {
273-
id: "gan-pytorch.md";
272+
"gan-pytorch.mdx": {
273+
id: "gan-pytorch.mdx";
274274
slug: "gan-pytorch";
275275
body: string;
276276
collection: "post";
277277
data: InferEntrySchema<"post">
278-
} & { render(): Render[".md"] };
278+
} & { render(): Render[".mdx"] };
279279
"gan-tensorflow.md": {
280280
id: "gan-tensorflow.md";
281281
slug: "gan-tensorflow";
@@ -332,13 +332,13 @@ declare module 'astro:content' {
332332
collection: "post";
333333
data: InferEntrySchema<"post">
334334
} & { render(): Render[".md"] };
335-
"kl-mle.md": {
336-
id: "kl-mle.md";
335+
"kl-mle.mdx": {
336+
id: "kl-mle.mdx";
337337
slug: "kl-mle";
338338
body: string;
339339
collection: "post";
340340
data: InferEntrySchema<"post">
341-
} & { render(): Render[".md"] };
341+
} & { render(): Render[".mdx"] };
342342
"laplace.mdx": {
343343
id: "laplace.mdx";
344344
slug: "laplace";
@@ -486,27 +486,27 @@ declare module 'astro:content' {
486486
collection: "post";
487487
data: InferEntrySchema<"post">
488488
} & { render(): Render[".md"] };
489-
"theano-pde.md": {
490-
id: "theano-pde.md";
489+
"theano-pde.mdx": {
490+
id: "theano-pde.mdx";
491491
slug: "theano-pde";
492492
body: string;
493493
collection: "post";
494494
data: InferEntrySchema<"post">
495-
} & { render(): Render[".md"] };
495+
} & { render(): Render[".mdx"] };
496496
"twitter-auth-flask.md": {
497497
id: "twitter-auth-flask.md";
498498
slug: "twitter-auth-flask";
499499
body: string;
500500
collection: "post";
501501
data: InferEntrySchema<"post">
502502
} & { render(): Render[".md"] };
503-
"vae-pytorch.md": {
504-
id: "vae-pytorch.md";
503+
"vae-pytorch.mdx": {
504+
id: "vae-pytorch.mdx";
505505
slug: "vae-pytorch";
506506
body: string;
507507
collection: "post";
508508
data: InferEntrySchema<"post">
509-
} & { render(): Render[".md"] };
509+
} & { render(): Render[".mdx"] };
510510
"variational-autoencoder.md": {
511511
id: "variational-autoencoder.md";
512512
slug: "variational-autoencoder";
Lines changed: 48 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -11,25 +11,31 @@ Linear Regression could be intuitively interpreted in several point of views, e.
1111

1212
## Linear Regression: Refreshments
1313

14-
Recall, in Linear Regression, we want to map our inputs into real numbers, i.e. \\( f: \mathbb{R}^N \to \mathbb{R} \\). For example, given some features, e.g. how many hour of studying, number of subject taken, and the IQ of a student, we want to predict his or her GPA.
14+
Recall, in Linear Regression, we want to map our inputs into real numbers, i.e. $f: \mathbb{R}^N \to \mathbb{R}$. For example, given some features, e.g. how many hour of studying, number of subject taken, and the IQ of a student, we want to predict his or her GPA.
1515

16-
There are several types of Linear Regression, depending on their cost function and the regularizer. In this post, we would focus on Linear Regression with \\( \ell_2 \\) cost and \\( \ell_2 \\) regularization. In statistics, this kind of regression is called Ridge Regression.
16+
There are several types of Linear Regression, depending on their cost function and the regularizer. In this post, we would focus on Linear Regression with $\ell_2$ cost and $\ell_2$ regularization. In statistics, this kind of regression is called Ridge Regression.
1717

1818
Formally, the objective is as follows:
1919

20-
$$ L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2 + \frac{\lambda}{2} \Vert W \Vert^2_2 $$
20+
$$
21+
L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2 + \frac{\lambda}{2} \Vert W \Vert^2_2
22+
$$
2123

22-
where \\( \hat{y} \\) is the ground truth value, and \\( y \\) is given by:
24+
where $\hat{y}$ is the ground truth value, and $y$ is given by:
2325

24-
$$ y = W^Tx $$
26+
$$
27+
y = W^Tx
28+
$$
2529

26-
which is a linear combination of feature vector and weight matrix. The additional \\( \frac{1}{2} \\) in both terms is just for mathematical convenience when taking the derivative.
30+
which is a linear combination of feature vector and weight matrix. The additional $\frac{1}{2}$ in both terms is just for mathematical convenience when taking the derivative.
2731

28-
The idea is then to minimize this objective function with regard to \\( W \\). That is, we want to find weight matrix \\( W \\) that minimize the squared error.
32+
The idea is then to minimize this objective function with regard to $W$. That is, we want to find weight matrix $W$ that minimize the squared error.
2933

3034
Of course we could ignore the regularization term. What we end up with then, is a vanilla Linear Regression:
3135

32-
$$ L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2 $$
36+
$$
37+
L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2
38+
$$
3339

3440
Minimization this objective is the definition of Linear Least Square problem.
3541

@@ -48,23 +54,33 @@ $$
4854

4955
or equivalently, we could say that the error is:
5056

51-
$$ \epsilon = \hat{y} - y $$
57+
$$
58+
\epsilon = \hat{y} - y
59+
$$
5260

53-
Now, let's say we model the regression target as a Gaussian random variable, i.e. \\( y \sim N(\mu, \sigma^2) \\), with \\( \mu = y = W^Tx \\), the prediction of our model. Formally:
61+
Now, let's say we model the regression target as a Gaussian random variable, i.e. $y \sim N(\mu, \sigma^2)$, with $\mu = y = W^Tx$, the prediction of our model. Formally:
5462

55-
$$ P(\hat{y} \vert x, W) = N(\hat{y} \vert W^Tx, \sigma^2) $$
63+
$$
64+
P(\hat{y} \vert x, W) = N(\hat{y} \vert W^Tx, \sigma^2)
65+
$$
5666

57-
Then, to find the optimum \\( W \\), we could use Maximum Likelihood Estimation (MLE). As the above model is a likelihood, i.e. describing our data \\( y \\) under parameter \\( W \\), we will do MLE on that:
67+
Then, to find the optimum $W$, we could use Maximum Likelihood Estimation (MLE). As the above model is a likelihood, i.e. describing our data $y$ under parameter $W$, we will do MLE on that:
5868

59-
$$ W*{MLE} = \mathop{\rm arg\,max}\limits*{W} N(\hat{y} \vert W^Tx, \sigma^2) $$
69+
$$
70+
W*{MLE} = \mathop{\rm arg\,max}\limits*{W} N(\hat{y} \vert W^Tx, \sigma^2)
71+
$$
6072

6173
The PDF of Gaussian is given by:
6274

63-
$$ P(\hat{y} \vert x, W) = \frac{1}{\sqrt{2 \sigma^2 \pi}} \, \exp \left( -\frac{(\hat{y} - W^Tx)^2}{2 \sigma^2} \right) $$
75+
$$
76+
P(\hat{y} \vert x, W) = \frac{1}{\sqrt{2 \sigma^2 \pi}} \, \exp \left( -\frac{(\hat{y} - W^Tx)^2}{2 \sigma^2} \right)
77+
$$
6478

6579
As we are doing maximization, we could ignore the normalizing constant of the likelihood. Hence:
6680

67-
$$ W*{MLE} = \mathop{\rm arg\,max}\limits*{W} \, \exp \left( -\frac{(\hat{y} - W^Tx)^2}{2 \sigma^2} \right) $$
81+
$$
82+
W*{MLE} = \mathop{\rm arg\,max}\limits*{W} \, \exp \left( -\frac{(\hat{y} - W^Tx)^2}{2 \sigma^2} \right)
83+
$$
6884

6985
As always, it is easier to optimize the log likelihood:
7086

@@ -77,7 +93,7 @@ W_{MLE} &= \mathop{\rm arg\,max}\limits_{W} \, \log \left( \exp \left( -\frac{(\
7793
\end{align}
7894
$$
7995

80-
For simplicity, let's say \\( \sigma^2 = 1 \\), then:
96+
For simplicity, let's say $\sigma^2 = 1$, then:
8197

8298
$$
8399
\begin{align}
@@ -95,13 +111,17 @@ So we see, doing MLE on Gaussian likelihood is equal to Linear Regression!
95111

96112
But what if we want to go Bayesian, i.e. introduce a prior, and working with the posterior instead? Well, then we are doing MAP estimation! The posterior is likelihood times prior:
97113

98-
$$ P(W \vert \hat{y}, x) = P(\hat{y} \vert x, W) P(W \vert \mu_0, \sigma^2_0) $$
114+
$$
115+
P(W \vert \hat{y}, x) = P(\hat{y} \vert x, W) P(W \vert \mu_0, \sigma^2_0)
116+
$$
99117

100-
Since we have already known the likelihood, now we ask, what should be the prior? If we set it to be uniformly distributed, then we will be back to the MLE estimation, full detail [here]({% post_url 2017-01-01-mle-vs-map %}). So, for non-trivial example, let's use Gaussian prior for weight \\( W \\):
118+
Since we have already known the likelihood, now we ask, what should be the prior? If we set it to be uniformly distributed, then we will be back to the MLE estimation. So, for non-trivial example, let's use Gaussian prior for weight $W$:
101119

102-
$$ P(W \vert \mu_0, \sigma^2_0) = N(0, \sigma^2_0) $$
120+
$$
121+
P(W \vert \mu_0, \sigma^2_0) = N(0, \sigma^2_0)
122+
$$
103123

104-
Expanding the PDF, and again ignoring the normalizing constant and keeping in mind that \\( \mu_0 = 0 \\), we have:
124+
Expanding the PDF, and again ignoring the normalizing constant and keeping in mind that $\mu_0 = 0$, we have:
105125

106126
$$
107127
\begin{align}
@@ -134,11 +154,13 @@ $$
134154
\end{align}
135155
$$
136156

137-
Seems familiar, right! Now if we assume that \\( \sigma^2 = 1 \\) and \\( \lambda = \frac{1}{\sigma^2_0} \\), then our log posterior becomes:
157+
Seems familiar, right! Now if we assume that $\sigma^2 = 1$ and $\lambda = \frac{1}{\sigma^2_0}$, then our log posterior becomes:
138158

139-
$$ \log P(W \vert \hat{y}, x) \propto -\frac{1}{2} \Vert \hat{y} - W^Tx\Vert^2_2 - \frac{\lambda}{2} \Vert W \Vert^2_2 $$
159+
$$
160+
\log P(W \vert \hat{y}, x) \propto -\frac{1}{2} \Vert \hat{y} - W^Tx\Vert^2_2 - \frac{\lambda}{2} \Vert W \Vert^2_2
161+
$$
140162

141-
That is, the log posterior of Gaussian likelihood and Gaussian prior is the same as the objective function for Ridge Regression! Hence, Gaussian prior is equal to \\( \ell_2 \\) regularization!
163+
That is, the log posterior of Gaussian likelihood and Gaussian prior is the same as the objective function for Ridge Regression! Hence, Gaussian prior is equal to $\ell_2$ regularization!
142164

143165
## Full Bayesian Approach
144166

@@ -153,11 +175,11 @@ P(y' \vert \hat{y}, x) &= \int_W P(y' \vert x', W) P(W \vert \hat{y}, x) \\[10pt
153175
\end{align}
154176
$$
155177

156-
that is, given the likelihood of our new data point \\( (x', y') \\), we compute the likelihood, and weigh it with the posterior.
178+
that is, given the likelihood of our new data point $(x', y')$, we compute the likelihood, and weigh it with the posterior.
157179

158-
Intuitively, given all possible value for \\( W \\) in the posterior, we try those values one by one to predict the new data. The result is then averaged proportionality to the probability of those values, hence we are taking expectation.
180+
Intuitively, given all possible value for $W$ in the posterior, we try those values one by one to predict the new data. The result is then averaged proportionality to the probability of those values, hence we are taking expectation.
159181

160-
And of course, that is the reason why we use a shortcut in the form of MAP. For illustration, if each component of \\( W \\) is binary, i.e. have two possible values, and there are \\( K \\) components in \\( W \\), we are talking about \\( 2^K \\) possible assignments for \\( W \\), which is exponential! In real world, each component of \\( W \\) is a real number, which makes the problem of enumerating all possible values of \\( W \\) intractable!
182+
And of course, that is the reason why we use a shortcut in the form of MAP. For illustration, if each component of $W$ is binary, i.e. have two possible values, and there are $K$ components in $W$, we are talking about $2^K$ possible assignments for $W$, which is exponential! In real world, each component of $W$ is a real number, which makes the problem of enumerating all possible values of $W$ intractable!
161183

162184
Of course we could use approximate method like Variational Bayes or MCMC, but they are still more costly than MAP. As MAP and MLE is guaranteed to find one of the modes (local maxima), it is good enough.
163185

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Those two libraries are different from the existing libraries like TensorFlow an
1111

1212
Enter Pytorch. It is a Torch's port for Python. The programming style of Pytorch is imperative, meaning that if we've already familiar using Numpy to code our alogrithm up, then jumping to Pytorch should be a breeze. One does not need to learn symbolic mathematical computation, like in TensorFlow and Theano.
1313

14-
With that being said, let's try Pytorch by implementing Generative Adversarial Networks (GAN). As a reference point, here is the [TensorFlow version]({% post_url 2016-09-17-gan-tensorflow %}).
14+
With that being said, let's try Pytorch by implementing Generative Adversarial Networks (GAN).
1515

1616
Let's start by importing stuffs:
1717

@@ -32,13 +32,13 @@ h_dim = 128
3232
lr = 1e-3
3333
```
3434

35-
Now let's construct our Generative Network \\( G(z) \\):
35+
Now let's construct our Generative Network $G(z)$:
3636

3737
```python
3838
def xavier*init(size):
39-
in_dim = size[0]
40-
xavier_stddev = 1. / np.sqrt(in_dim / 2.)
41-
return Variable(torch.randn(\_size) * xavier_stddev, requires_grad=True)
39+
in_dim = size[0]
40+
xavier_stddev = 1. / np.sqrt(in_dim / 2.)
41+
return Variable(torch.randn(\_size) * xavier_stddev, requires_grad=True)
4242

4343
Wzh = xavier_init(size=[Z_dim, h_dim])
4444
bzh = Variable(torch.zeros(h_dim), requires_grad=True)
@@ -47,14 +47,14 @@ Whx = xavier_init(size=[h_dim, X_dim])
4747
bhx = Variable(torch.zeros(X_dim), requires_grad=True)
4848

4949
def G(z):
50-
h = nn.relu(z @ Wzh + bzh.repeat(z.size(0), 1))
51-
X = nn.sigmoid(h @ Whx + bhx.repeat(h.size(0), 1))
52-
return X
50+
h = nn.relu(z @ Wzh + bzh.repeat(z.size(0), 1))
51+
X = nn.sigmoid(h @ Whx + bhx.repeat(h.size(0), 1))
52+
return X
5353
```
5454

5555
It is awfully similar to the TensorFlow version, what is the difference then? It is subtle without more hints, but basically those variables `Wzh, bzh, Whx, bhx` are real tensor/ndarray, just like in Numpy. That means, if we evaluate it with `print(Wzh)` the value is immediately shown. Also, the function `G(z)` is a real function, in the sense that if we input a tensor, we will immediately get the return value back. Try doing those things in TensorFlow or Theano.
5656

57-
Next is the Discriminator Network \\( D(X) \\):
57+
Next is the Discriminator Network $D(X)$:
5858

5959
```python
6060
Wxh = xavier_init(size=[X_dim, h_dim])
@@ -98,7 +98,7 @@ X = Variable(torch.from_numpy(X))
9898

9999
```
100100

101-
So first, let's define the \\( D(X) \\)'s "forward-loss-backward-update" step. First, the forward step:
101+
So first, let's define the $D(X)$'s "forward-loss-backward-update" step. First, the forward step:
102102

103103
```python # D(X) forward and loss
104104
G_sample = G(z)
@@ -125,29 +125,29 @@ Of course we could code up our own optimizer. But Pytorch has built-in optimizer
125125
As we have two different optimizers, we need to clear up the computed gradient in our computational graph as we do not need it anymore. Also, it is necessary so that the gradients won't mix up with the subsequent call of `backward()` as `D_solver` shares some subgraphs with `G_solver`.
126126

127127
```python
128-
def reset*grad():
129-
for p in params:
130-
p.grad.data.zero*()
128+
def reset_grad():
129+
for p in params:
130+
p.grad.data.zero_()
131131
```
132132

133-
We do similar things to implement the "forward-loss-backward-update" for \\( G(z) \\):
133+
We do similar things to implement the "forward-loss-backward-update" for $G(z)$:
134134

135-
```python # Housekeeping - reset gradient
135+
```python
136+
# Housekeeping - reset gradient
136137
reset_grad()
137138

138-
# Generator forward-loss-backward-update
139-
z = Variable(torch.randn(mb_size, Z_dim))
140-
G_sample = G(z)
141-
D_fake = D(G_sample)
142-
143-
G_loss = nn.binary_cross_entropy(D_fake, ones_label)
139+
# Generator forward-loss-backward-update
140+
z = Variable(torch.randn(mb_size, Z_dim))
141+
G_sample = G(z)
142+
D_fake = D(G_sample)
144143

145-
G_loss.backward()
146-
G_solver.step()
144+
G_loss = nn.binary_cross_entropy(D_fake, ones_label)
147145

148-
# Housekeeping - reset gradient
149-
reset_grad()
146+
G_loss.backward()
147+
G_solver.step()
150148

149+
# Housekeeping - reset gradient
150+
reset_grad()
151151
```
152152

153153
And that is it, really.
@@ -156,4 +156,4 @@ But we might ask, why do all of those things matter? Why not to just use TensorF
156156

157157
In contrast, in imperative computation, we could just use `print()` function basically anywhere and anytime we want and immediately it will display the value. Doing other "non-trivial" operations like loop and conditional are also become much more easier in Pytorch, just like the good old Python. Hence, one could argue that this way of programming is more "natural".
158158

159-
The full code is available in my Github repo: <https://github.com/wiseodd/generative-models>.
159+
The full code is available in my Github repo: https://github.com/wiseodd/generative-models.
Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,11 @@ publishDate: 2017-01-26 03:53
55
tags: [machine learning, probability]
66
---
77

8-
When reading Kevin Murphy's book, I came across this statement: _"... maxmizing likelihood is equivalent to minimizing_ \\( D*{KL}[P(. \vert \theta^{\ast}) \, \Vert \, P(. \vert \theta)] \\)*, where \\( P(. \vert \theta^{\ast}) \\) is the true distribution and \\( P(. \vert \theta) \\) is our estimate ..."\_. So here is an attempt to prove that.
8+
When reading Kevin Murphy's book, I came across this statement:
9+
10+
> ... maximizing likelihood is equivalent to minimizing $D_{KL}[P(. \vert \theta^*) \, \Vert \, P(. \vert \theta)]$, where $P(. \vert \theta^*)$ is the true distribution and $P(. \vert \theta)$ is our estimate ...
11+
12+
So here is an attempt to prove that.
913

1014
$$
1115
\begin{align}
@@ -17,15 +21,12 @@ D_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)] &= \mathbb{E}_{x \sim
1721
\end{align}
1822
$$
1923

20-
If it looks familiar, the left term is the entropy of \\( P(x \vert \theta^\*) \\). However it does not depend on the estimated parameter \\( \theta \\), so we will ignore that.
24+
If it looks familiar, the left term is the entropy of $P(x \vert \theta^*)$. However it does not depend on the estimated parameter $\theta$, so we will ignore that.
2125

22-
Suppose we sample \\( N \\) of these \\( x \sim P(x \vert \theta^\*) \\). Then, the [Law of Large Number](https://en.wikipedia.org/wiki/Law_of_large_numbers) says that as \\( N \\) goes to infinity:
26+
Suppose we sample $N$ of these $x \sim P(x \vert \theta^*)$. Then, the [Law of Large Number](https://en.wikipedia.org/wiki/Law_of_large_numbers) says that as $N$ goes to infinity:
2327

2428
$$
25-
2629
-\frac{1}{N} \sum_i^N \log \, P(x_i \vert \theta) = -\mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log \, P(x \vert \theta) \right]
27-
28-
2930
$$
3031

3132
which is the right term of the above KL-Divergence. Notice that:
@@ -39,8 +40,8 @@ $$
3940
\end{align}
4041
$$
4142

42-
where NLL is the negative log-likelihood and \\( c \\) is a constant.
43+
where NLL is the negative log-likelihood and $c$ is a constant.
4344

44-
Then, if we minimize \\( D\_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)] \\), it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.
45+
Then, if we minimize $D_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)]$, it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.
4546

4647
Why does this matter, though? Because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.

0 commit comments

Comments
 (0)