wiseodd
diff --git a/‎.astro/types.d.ts‎
Lines changed: 15 additions & 15 deletions b/‎.astro/types.d.ts‎
Lines changed: 15 additions & 15 deletions
diff --git a/‎src/content/post/autoencoders.md‎ ‎src/content/post/autoencoders.mdx‎src/content/post/autoencoders.md renamed to src/content/post/autoencoders.mdx
Lines changed: 8 additions & 8 deletions b/‎src/content/post/autoencoders.md‎ ‎src/content/post/autoencoders.mdx‎src/content/post/autoencoders.md renamed to src/content/post/autoencoders.mdx
Lines changed: 8 additions & 8 deletions
diff --git a/‎…/content/post/contractive-autoencoder.md‎ ‎…content/post/contractive-autoencoder.mdx‎src/content/post/contractive-autoencoder.md renamed to src/content/post/contractive-autoencoder.mdx
Lines changed: 18 additions & 18 deletions b/‎…/content/post/contractive-autoencoder.md‎ ‎…content/post/contractive-autoencoder.mdx‎src/content/post/contractive-autoencoder.md renamed to src/content/post/contractive-autoencoder.mdx
Lines changed: 18 additions & 18 deletions
@@ -143,13 +143,13 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".mdx"] };
-"autoencoders.md": {
-	id: "autoencoders.md";
+"autoencoders.mdx": {
+	id: "autoencoders.mdx";
   slug: "autoencoders";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "batchnorm.md": {
 	id: "batchnorm.md";
   slug: "batchnorm";
@@ -199,13 +199,13 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".mdx"] };
-"contractive-autoencoder.md": {
-	id: "contractive-autoencoder.md";
+"contractive-autoencoder.mdx": {
+	id: "contractive-autoencoder.mdx";
   slug: "contractive-autoencoder";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "conv-probit.mdx": {
 	id: "conv-probit.mdx";
   slug: "conv-probit";
@@ -360,20 +360,20 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".mdx"] };
-"levelset-method.md": {
-	id: "levelset-method.md";
+"levelset-method.mdx": {
+	id: "levelset-method.mdx";
   slug: "levelset-method";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
-"levelset-segmentation.md": {
-	id: "levelset-segmentation.md";
+} & { render(): Render[".mdx"] };
+"levelset-segmentation.mdx": {
+	id: "levelset-segmentation.mdx";
   slug: "levelset-segmentation";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "lstm-backprop.md": {
 	id: "lstm-backprop.md";
   slug: "lstm-backprop";
@@ -465,13 +465,13 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".md"] };
-"residual-net.md": {
-	id: "residual-net.md";
+"residual-net.mdx": {
+	id: "residual-net.mdx";
   slug: "residual-net";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "scrapy-long-running.md": {
 	id: "scrapy-long-running.md";
   slug: "scrapy-long-running";
 
@@ -5,15 +5,15 @@ publishDate: 2016-12-03 12:20
 tags: [programming, python, neuralnet]
 ---
 
-Consider a neural net. Usually we use it for classification and regression task, that is, given an input vector \\( X \\), we want to find \\( y \\). In other words, we want neural net to find a mapping \\( y = f(X) \\).
+Consider a neural net. Usually we use it for classification and regression task, that is, given an input vector $X$, we want to find $y$. In other words, we want neural net to find a mapping $y = f(X)$.
 
-Now, what happens if we use the same data as codomain of the function? That is, we want to find a mapping \\( X = f(X) \\). Well, the neural net now will learn an identity mapping of \\( X \\). We probably would ask, how is that useful?
+Now, what happens if we use the same data as codomain of the function? That is, we want to find a mapping $X = f(X)$. Well, the neural net now will learn an identity mapping of $X$. We probably would ask, how is that useful?
 
 It turns out, the hidden layer(s) of neural net learns a very interesting respresentation of the data. Hence, we can use the hidden layer representation for many things, for example data compression, dimensionality reduction, and feature learning. This is exactly the last decade idea of Deep Learning: by stacking Autoencoders to learn the representation of data, and train it greedily, hopefully we can train deep net effectively.
 
 ## Vanilla Autoencoder
 
-In its simplest form, Autoencoder is a two layer net, i.e. a neural net with one hidden layer. The input and output are the same, and we learn how to reconstruct the input, for example using the \\( \ell\_{2} \\) norm.
+In its simplest form, Autoencoder is a two layer net, i.e. a neural net with one hidden layer. The input and output are the same, and we learn how to reconstruct the input, for example using the $\ell_{2}$ norm.
 
 ```python
 from tensorflow.examples.tutorials.mnist import input_data
@@ -27,8 +27,8 @@ import matplotlib.pyplot as plt
 import keras.backend as K
 import tensorflow as tf
 
-mnist = input*data.read_data_sets('../data/MNIST_data', one_hot=True)
-X, * = mnist.train.images, mnist.train.labels
+mnist = input_data.read_data_sets('../data/MNIST_data', one_hot=True)
+X, _ = mnist.train.images, mnist.train.labels
 
 inputs = Input(shape=(784,))
 h = Dense(64, activation='sigmoid')(inputs)
@@ -57,7 +57,7 @@ model.compile(optimizer='adam', loss='mse')
 model.fit(X, X, batch_size=64, nb_epoch=5)
 ```
 
-Notice in our hidden layer, we added an \\( \ell\_{1} \\) penalty. As a result, the representation is now sparser compared to the vanilla Autoencoder. We could see that by looking at the statistics of the hidden layer. The mean value of vanilla Autoencoder is 0.512477, whereas Sparse Autoencoder 0.148664.
+Notice in our hidden layer, we added an $\ell_{1}$ penalty. As a result, the representation is now sparser compared to the vanilla Autoencoder. We could see that by looking at the statistics of the hidden layer. The mean value of vanilla Autoencoder is 0.512477, whereas Sparse Autoencoder 0.148664.
 
 ## Multilayer Autoencoder
 
@@ -104,6 +104,6 @@ The learned representation of Autoencoder can be used for dimensionality reducti
 
 ## References
 
-1. <https://en.wikipedia.org/wiki/Autoencoder>
-2. <https://blog.keras.io/building-autoencoders-in-keras.html>
+1. https://en.wikipedia.org/wiki/Autoencoder
+2. https://blog.keras.io/building-autoencoders-in-keras.html
 3. Rifai, Salah, et al. "Contractive auto-encoders: Explicit invariance during feature extraction." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.
@@ -9,15 +9,19 @@ In the last post, we have seen many different flavors of a family of methods cal
 
 The idea of Contractive Autoencoder is to make the learned representation to be robust towards small changes around the training examples. It achieves that by using different penalty term imposed to the representation.
 
-The loss function for the reconstruction term is similar to previous Autoencoders that we have been seen, i.e. using \\( \ell_2 \\) loss. The penalty term, however is more complicated: we need to calculate the representation's jacobian matrix with regards of the training data.
+The loss function for the reconstruction term is similar to previous Autoencoders that we have been seen, i.e. using $\ell_2$ loss. The penalty term, however is more complicated: we need to calculate the representation's jacobian matrix with regards of the training data.
 
 Hence, the loss function is as follows:
 
-$$ L = \lVert X - \hat{X} \rVert_2^2 + \lambda \lVert J_h(X) \rVert_F^2 $$
+$$
+    L = \lVert X - \hat{X} \rVert_2^2 + \lambda \lVert J_h(X) \rVert_F^2
+$$
 
 in which
 
-$$ \lVert J*h(X) \rVert_F^2 = \sum*{ij} \left( \frac{\partial h_j(X)}{\partial X_i} \right)^2 $$
+$$
+    \lVert J_h(X) \rVert_F^2 = \sum_{ij} \left( \frac{\partial h_j(X)}{\partial X_i} \right)^2
+$$
 
 that is, the penalty term is the Frobenius norm of the jacobian matrix, which is the sum squared over all elements inside the matrix. We could think Frobenius norm as the generalization of euclidean norm.
 
@@ -27,33 +31,31 @@ Let's calculate the jacobian of the hidden layer of our autoencoder then. Let's
 
 $$
 \begin{align}
-
 Z_j &= W_i X_i \\[10pt]
 h_j &= \phi(Z_j)
-
 \end{align}
 $$
 
-where \\( \phi \\) is sigmoid nonlinearity. That is, to get the \\( j\text{-th} \\) hidden unit, we need to get the dot product of the \\( i\text{-th} \\) feature and the corresponding weight. Then using chain rule:
+where $\phi$ is sigmoid nonlinearity. That is, to get the $j\text{-th}$ hidden unit, we need to get the dot product of the $i\text{-th}$ feature and the corresponding weight. Then using chain rule:
 
 $$
 \begin{align}
-
 \frac{\partial h_j}{\partial X_i} &= \frac{\partial \phi(Z_j)}{\partial X_i} \\[10pt]
                                  &= \frac{\partial \phi(W_i X_i)}{\partial W_i X_i} \frac{\partial W_i X_i}{\partial X_i} \\[10pt]
                                  &= [\phi(W_i X_i)(1 - \phi(W_i X_i))] \, W_{i} \\[10pt]
                                  &= [h_j(1 - h_j)] \, W_i
-
 \end{align}
 $$
 
-It looks familiar, doesn't it? Because it's exactly how we calculate gradient. The difference is however, that we treat \\( h(X) \\) as a vector valued function. That is, we treat \\( h\_{i}(X) \\) each as a separate output. Intuitively, let's say for example we have 64 hidden units, then we have 64 function outputs, and so we will have a gradient vector for each of those 64 hidden unit. Hence, when we get the derivative of that hidden layer, what we get instead is a jacobian matrix. And as we now know how to calculate the jacobian, we can calculate the penalty term in our loss.
+It looks familiar, doesn't it? Because it's exactly how we calculate gradient. The difference is however, that we treat $h(X)$ as a vector valued function. That is, we treat $h\_{i}(X)$ each as a separate output. Intuitively, let's say for example we have 64 hidden units, then we have 64 function outputs, and so we will have a gradient vector for each of those 64 hidden unit. Hence, when we get the derivative of that hidden layer, what we get instead is a jacobian matrix. And as we now know how to calculate the jacobian, we can calculate the penalty term in our loss.
 
-Let \\( diag(x) \\) be a diagonal matrix, the matrix form of the above derivative is as follows:
+Let $diag(x)$ be a diagonal matrix, the matrix form of the above derivative is as follows:
 
-$$ \frac{\partial h}{\partial X} = diag[h(1 - h)] \, W^T $$
+$$
+    \frac{\partial h}{\partial X} = diag[h(1 - h)] \, W^T
+$$
 
-We need to form a diagonal matrix of the gradient of \\( h \\) because if we look carefully at the original equation, the first term doesn't depend on \\( i \\). Hence, for all values of \\( W_i \\), we want to multiply it with the correspondent \\( h_j \\). And the nice way to do that is to use diagonal matrix.
+We need to form a diagonal matrix of the gradient of $h$ because if we look carefully at the original equation, the first term doesn't depend on $i$. Hence, for all values of $W_i$, we want to multiply it with the correspondent $h_j$. And the nice way to do that is to use diagonal matrix.
 
 As our main objective is to calculate the norm, we could simplify that in our implementation so that we don't need to construct the diagonal matrix:
 
@@ -73,18 +75,16 @@ Translated to code:
 import numpy as np
 
 # Let's say we have minibatch of 32, and 64 hidden units
-
 # Our input is 786 elements vector
-
 X = np.random.randn(32, 786)
 W = np.random.randn(786, 64)
 
 Z = np.dot(W, X)
 h = sigmoid(Z) # 32x64
 
 Wj_sqr = np.sum(W.T**2, axis=1) # Marginalize i (note the transpose), 64x1
-dhj_sqr = (h \* (1 - h))**2 # Derivative of h, 32x64
-J_norm = np.sum(dhj_sqr \* Wj_sqr, axis=1) # 32x1, i.e. 1 jacobian norm for each data point
+dhj_sqr = (h * (1 - h))**2 # Derivative of h, 32x64
+J_norm = np.sum(dhj_sqr * Wj_sqr, axis=1) # 32x1, i.e. 1 jacobian norm for each data point
 ```
 
 Putting all of those together, we have our full Contractive Autoencoder implemented in Keras:
@@ -103,7 +103,7 @@ outputs = Dense(N, activation='linear')(encoded)
 model = Model(input=inputs, output=outputs)
 
 def contractive_loss(y_pred, y_true):
-mse = K.mean(K.square(y_true - y_pred), axis=1)
+    mse = K.mean(K.square(y_true - y_pred), axis=1)
 
     W = K.variable(value=model.get_layer('encoded').get_weights()[0])  # N x N_hidden
     W = K.transpose(W)  # N_hidden x N
@@ -119,7 +119,7 @@ model.compile(optimizer='adam', loss=contractive_loss)
 model.fit(X, X, batch_size=N_batch, nb_epoch=5)
 ```
 
-And that is it! The full code could be found in my Github repository: <https://github.com/wiseodd/hipsternet>.
+And that is it! The full code could be found in my Github repository: https://github.com/wiseodd/hipsternet.
 
 ## References