You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/post/bayesian-regression.mdx
+48-26Lines changed: 48 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,25 +11,31 @@ Linear Regression could be intuitively interpreted in several point of views, e.
11
11
12
12
## Linear Regression: Refreshments
13
13
14
-
Recall, in Linear Regression, we want to map our inputs into real numbers, i.e. \\( f: \mathbb{R}^N \to \mathbb{R}\\). For example, given some features, e.g. how many hour of studying, number of subject taken, and the IQ of a student, we want to predict his or her GPA.
14
+
Recall, in Linear Regression, we want to map our inputs into real numbers, i.e. $f: \mathbb{R}^N \to \mathbb{R}$. For example, given some features, e.g. how many hour of studying, number of subject taken, and the IQ of a student, we want to predict his or her GPA.
15
15
16
-
There are several types of Linear Regression, depending on their cost function and the regularizer. In this post, we would focus on Linear Regression with \\( \ell_2\\) cost and \\( \ell_2\\) regularization. In statistics, this kind of regression is called Ridge Regression.
16
+
There are several types of Linear Regression, depending on their cost function and the regularizer. In this post, we would focus on Linear Regression with $\ell_2$ cost and $\ell_2$ regularization. In statistics, this kind of regression is called Ridge Regression.
17
17
18
18
Formally, the objective is as follows:
19
19
20
-
$$ L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2 + \frac{\lambda}{2} \Vert W \Vert^2_2 $$
20
+
$$
21
+
L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2 + \frac{\lambda}{2} \Vert W \Vert^2_2
22
+
$$
21
23
22
-
where \\( \hat{y}\\) is the ground truth value, and \\( y \\) is given by:
24
+
where $\hat{y}$ is the ground truth value, and $y$ is given by:
23
25
24
-
$$ y = W^Tx $$
26
+
$$
27
+
y = W^Tx
28
+
$$
25
29
26
-
which is a linear combination of feature vector and weight matrix. The additional \\( \frac{1}{2}\\) in both terms is just for mathematical convenience when taking the derivative.
30
+
which is a linear combination of feature vector and weight matrix. The additional $\frac{1}{2}$ in both terms is just for mathematical convenience when taking the derivative.
27
31
28
-
The idea is then to minimize this objective function with regard to \\( W \\). That is, we want to find weight matrix \\( W \\) that minimize the squared error.
32
+
The idea is then to minimize this objective function with regard to $W$. That is, we want to find weight matrix $W$ that minimize the squared error.
29
33
30
34
Of course we could ignore the regularization term. What we end up with then, is a vanilla Linear Regression:
31
35
32
-
$$ L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2 $$
36
+
$$
37
+
L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2
38
+
$$
33
39
34
40
Minimization this objective is the definition of Linear Least Square problem.
35
41
@@ -48,23 +54,33 @@ $$
48
54
49
55
or equivalently, we could say that the error is:
50
56
51
-
$$ \epsilon = \hat{y} - y $$
57
+
$$
58
+
\epsilon = \hat{y} - y
59
+
$$
52
60
53
-
Now, let's say we model the regression target as a Gaussian random variable, i.e. \\( y \sim N(\mu, \sigma^2)\\), with \\( \mu = y = W^Tx\\), the prediction of our model. Formally:
61
+
Now, let's say we model the regression target as a Gaussian random variable, i.e. $y \sim N(\mu, \sigma^2)$, with $\mu = y = W^Tx$, the prediction of our model. Formally:
Then, to find the optimum \\( W \\), we could use Maximum Likelihood Estimation (MLE). As the above model is a likelihood, i.e. describing our data \\( y \\) under parameter \\( W \\), we will do MLE on that:
67
+
Then, to find the optimum $W$, we could use Maximum Likelihood Estimation (MLE). As the above model is a likelihood, i.e. describing our data $y$ under parameter $W$, we will do MLE on that:
For simplicity, let's say \\( \sigma^2 = 1\\), then:
96
+
For simplicity, let's say $\sigma^2 = 1$, then:
81
97
82
98
$$
83
99
\begin{align}
@@ -95,13 +111,17 @@ So we see, doing MLE on Gaussian likelihood is equal to Linear Regression!
95
111
96
112
But what if we want to go Bayesian, i.e. introduce a prior, and working with the posterior instead? Well, then we are doing MAP estimation! The posterior is likelihood times prior:
Since we have already known the likelihood, now we ask, what should be the prior? If we set it to be uniformly distributed, then we will be back to the MLE estimation, full detail [here]({% post_url 2017-01-01-mle-vs-map %}). So, for non-trivial example, let's use Gaussian prior for weight \\( W \\):
118
+
Since we have already known the likelihood, now we ask, what should be the prior? If we set it to be uniformly distributed, then we will be back to the MLE estimation. So, for non-trivial example, let's use Gaussian prior for weight $W$:
That is, the log posterior of Gaussian likelihood and Gaussian prior is the same as the objective function for Ridge Regression! Hence, Gaussian prior is equal to \\( \ell_2\\) regularization!
163
+
That is, the log posterior of Gaussian likelihood and Gaussian prior is the same as the objective function for Ridge Regression! Hence, Gaussian prior is equal to $\ell_2$ regularization!
that is, given the likelihood of our new data point \\( (x', y')\\), we compute the likelihood, and weigh it with the posterior.
178
+
that is, given the likelihood of our new data point $(x', y')$, we compute the likelihood, and weigh it with the posterior.
157
179
158
-
Intuitively, given all possible value for \\( W \\) in the posterior, we try those values one by one to predict the new data. The result is then averaged proportionality to the probability of those values, hence we are taking expectation.
180
+
Intuitively, given all possible value for $W$ in the posterior, we try those values one by one to predict the new data. The result is then averaged proportionality to the probability of those values, hence we are taking expectation.
159
181
160
-
And of course, that is the reason why we use a shortcut in the form of MAP. For illustration, if each component of \\( W \\) is binary, i.e. have two possible values, and there are \\( K \\) components in \\( W \\), we are talking about \\( 2^K\\) possible assignments for \\( W \\), which is exponential! In real world, each component of \\( W \\) is a real number, which makes the problem of enumerating all possible values of \\( W \\) intractable!
182
+
And of course, that is the reason why we use a shortcut in the form of MAP. For illustration, if each component of $W$ is binary, i.e. have two possible values, and there are $K$ components in $W$, we are talking about $2^K$ possible assignments for $W$, which is exponential! In real world, each component of $W$ is a real number, which makes the problem of enumerating all possible values of $W$ intractable!
161
183
162
184
Of course we could use approximate method like Variational Bayes or MCMC, but they are still more costly than MAP. As MAP and MLE is guaranteed to find one of the modes (local maxima), it is good enough.
Copy file name to clipboardExpand all lines: src/content/post/gan-pytorch.mdx
+26-26Lines changed: 26 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ Those two libraries are different from the existing libraries like TensorFlow an
11
11
12
12
Enter Pytorch. It is a Torch's port for Python. The programming style of Pytorch is imperative, meaning that if we've already familiar using Numpy to code our alogrithm up, then jumping to Pytorch should be a breeze. One does not need to learn symbolic mathematical computation, like in TensorFlow and Theano.
13
13
14
-
With that being said, let's try Pytorch by implementing Generative Adversarial Networks (GAN). As a reference point, here is the [TensorFlow version]({% post_url 2016-09-17-gan-tensorflow %}).
14
+
With that being said, let's try Pytorch by implementing Generative Adversarial Networks (GAN).
15
15
16
16
Let's start by importing stuffs:
17
17
@@ -32,13 +32,13 @@ h_dim = 128
32
32
lr =1e-3
33
33
```
34
34
35
-
Now let's construct our Generative Network \\( G(z)\\):
35
+
Now let's construct our Generative Network $G(z)$:
It is awfully similar to the TensorFlow version, what is the difference then? It is subtle without more hints, but basically those variables `Wzh, bzh, Whx, bhx` are real tensor/ndarray, just like in Numpy. That means, if we evaluate it with`print(Wzh)` the value is immediately shown. Also, the function `G(z)`is a real function, in the sense that if we input a tensor, we will immediately get the return value back. Try doing those things in TensorFlow or Theano.
56
56
57
-
Next is the Discriminator Network \\( D(X) \\):
57
+
Next is the Discriminator Network $D(X)$:
58
58
59
59
```python
60
60
Wxh= xavier_init(size=[X_dim, h_dim])
@@ -98,7 +98,7 @@ X = Variable(torch.from_numpy(X))
98
98
99
99
```
100
100
101
-
So first, let's define the \\( D(X)\\)'s "forward-loss-backward-update" step. First, the forward step:
101
+
So first, let's define the $D(X)$'s "forward-loss-backward-update" step. First, the forward step:
102
102
103
103
```python # D(X) forward and loss
104
104
G_sample= G(z)
@@ -125,29 +125,29 @@ Of course we could code up our own optimizer. But Pytorch has built-in optimizer
125
125
As we have two different optimizers, we need to clear up the computed gradient in our computational graph as we do not need it anymore. Also, it is necessary so that the gradients won't mix up with the subsequent call of `backward()` as `D_solver` shares some subgraphs with `G_solver`.
126
126
127
127
```python
128
-
defreset*grad():
129
-
for p in params:
130
-
p.grad.data.zero*()
128
+
defreset_grad():
129
+
for p in params:
130
+
p.grad.data.zero_()
131
131
```
132
132
133
-
We do similar things to implement the "forward-loss-backward-update"for\\( G(z) \\):
133
+
We do similar things to implement the "forward-loss-backward-update"for$G(z)$:
@@ -156,4 +156,4 @@ But we might ask, why do all of those things matter? Why not to just use TensorF
156
156
157
157
In contrast, in imperative computation, we could just use `print()` function basically anywhere and anytime we want and immediately it will display the value. Doing other "non-trivial" operations like loop and conditional are also become much more easier in Pytorch, just like the good old Python. Hence, one could argue that this way of programming is more "natural".
158
158
159
-
The full code is available in my Github repo: <https://github.com/wiseodd/generative-models>.
159
+
The full code is available in my Github repo: https://github.com/wiseodd/generative-models.
Copy file name to clipboardExpand all lines: src/content/post/kl-mle.mdx
+9-8Lines changed: 9 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,11 @@ publishDate: 2017-01-26 03:53
5
5
tags: [machine learning, probability]
6
6
---
7
7
8
-
When reading Kevin Murphy's book, I came across this statement: _"... maxmizing likelihood is equivalent to minimizing_\\( D*{KL}[P(. \vert \theta^{\ast}) \, \Vert \, P(. \vert \theta)]\\)*, where \\( P(. \vert \theta^{\ast}) \\) is the true distribution and \\( P(. \vert \theta) \\) is our estimate ..."\_. So here is an attempt to prove that.
8
+
When reading Kevin Murphy's book, I came across this statement:
9
+
10
+
> ... maximizing likelihood is equivalent to minimizing $D_{KL}[P(. \vert \theta^*) \, \Vert \, P(. \vert \theta)]$, where $P(. \vert \theta^*)$ is the true distribution and $P(. \vert \theta)$ is our estimate ...
If it looks familiar, the left term is the entropy of \\( P(x \vert \theta^\*) \\). However it does not depend on the estimated parameter \\( \theta\\), so we will ignore that.
24
+
If it looks familiar, the left term is the entropy of $P(x \vert \theta^*)$. However it does not depend on the estimated parameter $\theta$, so we will ignore that.
21
25
22
-
Suppose we sample \\( N \\) of these \\( x \sim P(x \vert \theta^\*) \\). Then, the [Law of Large Number](https://en.wikipedia.org/wiki/Law_of_large_numbers) says that as \\( N \\) goes to infinity:
26
+
Suppose we sample $N$ of these $x \sim P(x \vert \theta^*)$. Then, the [Law of Large Number](https://en.wikipedia.org/wiki/Law_of_large_numbers) says that as $N$ goes to infinity:
which is the right term of the above KL-Divergence. Notice that:
@@ -39,8 +40,8 @@ $$
39
40
\end{align}
40
41
$$
41
42
42
-
where NLL is the negative log-likelihood and \\( c \\) is a constant.
43
+
where NLL is the negative log-likelihood and $c$ is a constant.
43
44
44
-
Then, if we minimize \\( D\_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)]\\), it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.
45
+
Then, if we minimize $D_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)]$, it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.
45
46
46
47
Why does this matter, though? Because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.
0 commit comments