You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/post/autoencoders.mdx
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,15 +5,15 @@ publishDate: 2016-12-03 12:20
5
5
tags: [programming, python, neuralnet]
6
6
---
7
7
8
-
Consider a neural net. Usually we use it for classification and regression task, that is, given an input vector \\( X \\), we want to find \\( y \\). In other words, we want neural net to find a mapping \\( y = f(X)\\).
8
+
Consider a neural net. Usually we use it for classification and regression task, that is, given an input vector $X$, we want to find $y$. In other words, we want neural net to find a mapping $y = f(X)$.
9
9
10
-
Now, what happens if we use the same data as codomain of the function? That is, we want to find a mapping \\( X = f(X)\\). Well, the neural net now will learn an identity mapping of \\( X \\). We probably would ask, how is that useful?
10
+
Now, what happens if we use the same data as codomain of the function? That is, we want to find a mapping $X = f(X)$. Well, the neural net now will learn an identity mapping of $X$. We probably would ask, how is that useful?
11
11
12
12
It turns out, the hidden layer(s) of neural net learns a very interesting respresentation of the data. Hence, we can use the hidden layer representation for many things, for example data compression, dimensionality reduction, and feature learning. This is exactly the last decade idea of Deep Learning: by stacking Autoencoders to learn the representation of data, and train it greedily, hopefully we can train deep net effectively.
13
13
14
14
## Vanilla Autoencoder
15
15
16
-
In its simplest form, Autoencoder is a two layer net, i.e. a neural net with one hidden layer. The input and output are the same, and we learn how to reconstruct the input, for example using the \\( \ell\_{2}\\) norm.
16
+
In its simplest form, Autoencoder is a two layer net, i.e. a neural net with one hidden layer. The input and output are the same, and we learn how to reconstruct the input, for example using the $\ell_{2}$ norm.
17
17
18
18
```python
19
19
from tensorflow.examples.tutorials.mnist import input_data
Notice in our hidden layer, we added an \\( \ell\_{1}\\) penalty. As a result, the representation is now sparser compared to the vanilla Autoencoder. We could see that by looking at the statistics of the hidden layer. The mean value of vanilla Autoencoder is 0.512477, whereas Sparse Autoencoder 0.148664.
60
+
Notice in our hidden layer, we added an $\ell_{1}$ penalty. As a result, the representation is now sparser compared to the vanilla Autoencoder. We could see that by looking at the statistics of the hidden layer. The mean value of vanilla Autoencoder is 0.512477, whereas Sparse Autoencoder 0.148664.
61
61
62
62
## Multilayer Autoencoder
63
63
@@ -104,6 +104,6 @@ The learned representation of Autoencoder can be used for dimensionality reducti
3. Rifai, Salah, et al. "Contractive auto-encoders: Explicit invariance during feature extraction." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.
Copy file name to clipboardExpand all lines: src/content/post/contractive-autoencoder.mdx
+18-18Lines changed: 18 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,15 +9,19 @@ In the last post, we have seen many different flavors of a family of methods cal
9
9
10
10
The idea of Contractive Autoencoder is to make the learned representation to be robust towards small changes around the training examples. It achieves that by using different penalty term imposed to the representation.
11
11
12
-
The loss function for the reconstruction term is similar to previous Autoencoders that we have been seen, i.e. using \\( \ell_2\\) loss. The penalty term, however is more complicated: we need to calculate the representation's jacobian matrix with regards of the training data.
12
+
The loss function for the reconstruction term is similar to previous Autoencoders that we have been seen, i.e. using $\ell_2$ loss. The penalty term, however is more complicated: we need to calculate the representation's jacobian matrix with regards of the training data.
13
13
14
14
Hence, the loss function is as follows:
15
15
16
-
$$ L = \lVert X - \hat{X} \rVert_2^2 + \lambda \lVert J_h(X) \rVert_F^2 $$
16
+
$$
17
+
L = \lVert X - \hat{X} \rVert_2^2 + \lambda \lVert J_h(X) \rVert_F^2
that is, the penalty term is the Frobenius norm of the jacobian matrix, which is the sum squared over all elements inside the matrix. We could think Frobenius norm as the generalization of euclidean norm.
23
27
@@ -27,33 +31,31 @@ Let's calculate the jacobian of the hidden layer of our autoencoder then. Let's
27
31
28
32
$$
29
33
\begin{align}
30
-
31
34
Z_j &= W_i X_i \\[10pt]
32
35
h_j &= \phi(Z_j)
33
-
34
36
\end{align}
35
37
$$
36
38
37
-
where \\( \phi\\) is sigmoid nonlinearity. That is, to get the \\( j\text{-th}\\) hidden unit, we need to get the dot product of the \\( i\text{-th}\\) feature and the corresponding weight. Then using chain rule:
39
+
where $\phi$ is sigmoid nonlinearity. That is, to get the $j\text{-th}$ hidden unit, we need to get the dot product of the $i\text{-th}$ feature and the corresponding weight. Then using chain rule:
It looks familiar, doesn't it? Because it's exactly how we calculate gradient. The difference is however, that we treat \\( h(X)\\) as a vector valued function. That is, we treat \\( h\_{i}(X)\\) each as a separate output. Intuitively, let's say for example we have 64 hidden units, then we have 64 function outputs, and so we will have a gradient vector for each of those 64 hidden unit. Hence, when we get the derivative of that hidden layer, what we get instead is a jacobian matrix. And as we now know how to calculate the jacobian, we can calculate the penalty term in our loss.
50
+
It looks familiar, doesn't it? Because it's exactly how we calculate gradient. The difference is however, that we treat $h(X)$ as a vector valued function. That is, we treat $h\_{i}(X)$ each as a separate output. Intuitively, let's say for example we have 64 hidden units, then we have 64 function outputs, and so we will have a gradient vector for each of those 64 hidden unit. Hence, when we get the derivative of that hidden layer, what we get instead is a jacobian matrix. And as we now know how to calculate the jacobian, we can calculate the penalty term in our loss.
51
51
52
-
Let \\( diag(x)\\) be a diagonal matrix, the matrix form of the above derivative is as follows:
52
+
Let $diag(x)$ be a diagonal matrix, the matrix form of the above derivative is as follows:
We need to form a diagonal matrix of the gradient of \\( h \\) because if we look carefully at the original equation, the first term doesn't depend on \\( i \\). Hence, for all values of \\( W_i\\), we want to multiply it with the correspondent \\( h_j\\). And the nice way to do that is to use diagonal matrix.
58
+
We need to form a diagonal matrix of the gradient of $h$ because if we look carefully at the original equation, the first term doesn't depend on $i$. Hence, for all values of $W_i$, we want to multiply it with the correspondent $h_j$. And the nice way to do that is to use diagonal matrix.
57
59
58
60
As our main objective is to calculate the norm, we could simplify that in our implementation so that we don't need to construct the diagonal matrix:
59
61
@@ -73,18 +75,16 @@ Translated to code:
73
75
import numpy as np
74
76
75
77
# Let's say we have minibatch of 32, and 64 hidden units
76
-
77
78
# Our input is 786 elements vector
78
-
79
79
X = np.random.randn(32, 786)
80
80
W = np.random.randn(786, 64)
81
81
82
82
Z = np.dot(W, X)
83
83
h = sigmoid(Z) # 32x64
84
84
85
85
Wj_sqr = np.sum(W.T**2, axis=1) # Marginalize i (note the transpose), 64x1
0 commit comments