wiseodd
diff --git a/‎.astro/types.d.ts‎
Lines changed: 15 additions & 15 deletions b/‎.astro/types.d.ts‎
Lines changed: 15 additions & 15 deletions
diff --git a/‎bun.lockb‎
340 Bytes b/‎bun.lockb‎
340 Bytes
diff --git a/‎package.json‎
Lines changed: 3 additions & 2 deletions b/‎package.json‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎src/content/post/batchnorm.md‎ ‎src/content/post/batchnorm.mdx‎src/content/post/batchnorm.md renamed to src/content/post/batchnorm.mdx
Lines changed: 12 additions & 25 deletions b/‎src/content/post/batchnorm.md‎ ‎src/content/post/batchnorm.mdx‎src/content/post/batchnorm.md renamed to src/content/post/batchnorm.mdx
Lines changed: 12 additions & 25 deletions
diff --git a/‎src/content/post/dropout.md‎ ‎src/content/post/dropout.mdx‎src/content/post/dropout.md renamed to src/content/post/dropout.mdx
Lines changed: 19 additions & 24 deletions b/‎src/content/post/dropout.md‎ ‎src/content/post/dropout.mdx‎src/content/post/dropout.md renamed to src/content/post/dropout.mdx
Lines changed: 19 additions & 24 deletions
@@ -152,13 +152,13 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".mdx"] };
-"batchnorm.md": {
-	id: "batchnorm.md";
+"batchnorm.mdx": {
+	id: "batchnorm.mdx";
   slug: "batchnorm";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "bayesian-regression.mdx": {
 	id: "bayesian-regression.mdx";
   slug: "bayesian-regression";
@@ -250,13 +250,13 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".md"] };
-"dropout.md": {
-	id: "dropout.md";
+"dropout.mdx": {
+	id: "dropout.mdx";
   slug: "dropout";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "fisher-information.mdx": {
 	id: "fisher-information.mdx";
   slug: "fisher-information";
@@ -425,34 +425,34 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".mdx"] };
-"nn-optimization.md": {
-	id: "nn-optimization.md";
+"nn-optimization.mdx": {
+	id: "nn-optimization.mdx";
   slug: "nn-optimization";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
-"nn-sgd.md": {
-	id: "nn-sgd.md";
+} & { render(): Render[".mdx"] };
+"nn-sgd.mdx": {
+	id: "nn-sgd.mdx";
   slug: "nn-sgd";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "optimization-riemannian-manifolds.mdx": {
 	id: "optimization-riemannian-manifolds.mdx";
   slug: "optimization-riemannian-manifolds";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".mdx"] };
-"parallel-monte-carlo.md": {
-	id: "parallel-monte-carlo.md";
+"parallel-monte-carlo.mdx": {
+	id: "parallel-monte-carlo.mdx";
   slug: "parallel-monte-carlo";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "plotting.mdx": {
 	id: "plotting.mdx";
   slug: "plotting";
 
@@ -21,10 +21,11 @@
     "@fontsource/jetbrains-mono": "^5.0.20",
     "@vercel/analytics": "^1.3.1",
     "ajv": "^8.17.1",
-    "astro": "^4.13.1",
+    "astro": "^4.13.3",
     "astro-expressive-code": "^0.33.5",
     "astro-icon": "^1.1.0",
     "clsx": "^2.1.1",
+    "jquery": "^3.7.1",
     "mdast-util-to-string": "^4.0.0",
     "punycode": "^2.3.1",
     "reading-time": "^1.5.0",
@@ -34,7 +35,7 @@
     "remark-smartypants": "^3.0.2",
     "remark-unwrap-images": "^4.0.0",
     "sharp": "^0.33.4",
-    "tailwind-merge": "^2.4.0",
+    "tailwind-merge": "^2.5.1",
     "tailwindcss": "^3.4.9",
     "typescript": "^5.5.4"
   },
 
@@ -5,6 +5,8 @@ publishDate: 2016-07-04 10:53
 tags: [machine learning, programming, python, neural networks]
 ---
 
+import BlogImage from "@/components/BlogImage.astro";
+
 BatchNorm a.k.a Batch Normalization is a relatively new technique proposed by Ioffe & Szegedy in 2015. It promises us acceleration in training (deep) neural net.
 
 One difficult thing about training a neural net is to choose the initial weights. BatchNorm promises the remedy: it makes the network less dependant to the initialization strategy. Another key points are that it enables us to use higher learning rate. They even go further to state that BatchNorm could reduce the dependency on Dropout.
@@ -19,34 +21,33 @@ It enables us to be less careful with weights initialization as we don't need to
 
 The forward propagation of BatchNorm is shown below:
 
-![BatchNorm Forward]({{ site.baseurl }}/img/2016-07-04-batchnorm/00.png)
+<BlogImage imagePath='/img/batchnorm/00.png' />
 
 Pretty simple right? We just need to compute the activations mean and variance over the current minibatch and normalize the activations with that. What could go wrong.
 
 Well, we're training neural net with Backpropagation here, so that algorithm is half the story. We still need to derive the backprop scheme for the BatchNorm layer. Which is given by this:
 
-![BatchNorm Backward]({{ site.baseurl }}/img/2016-07-04-batchnorm/01.png)
+<BlogImage imagePath='/img/batchnorm/01.png' />
 
 If the above derivation doesn't make any sense, you could try reading [this](https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html) in combination with computational graph approach of backprop in [CS231 lecture](https://www.youtube.com/playlist?list=PLLvH2FwAQhnpj1WEB-jHmPuUeQ8mX-XXG).
 
 Now that we know how to do forward and backward propagation for BatchNorm, let's try to implement that.
 
 ## Training with BatchNorm
 
-As always, I will reuse the code from previous posts. It's in my repo here: <https://github.com/wiseodd/hipsternet>.
+As always, I will reuse the code from previous posts. It's in my repo here: https://github.com/wiseodd/hipsternet.
 
 ```python
 def batchnorm_forward(X, gamma, beta):
-mu = np.mean(X, axis=0)
-var = np.var(X, axis=0)
+    mu = np.mean(X, axis=0)
+    var = np.var(X, axis=0)
 
     X_norm = (X - mu) / np.sqrt(var + 1e-8)
     out = gamma * X_norm + beta
 
     cache = (X, X_norm, mu, var, gamma, beta)
 
     return out, cache, mu, var
-
 ```
 
 This is the forward propagation algorithm. It's simple. However, remember that we're normalizing each dimension of activations. So, if our activations over a minibatch is MxN matrix, then we want the mean and variance to be 1xN: one value of mean and variance for each dimension. So, if we normalize our activations matrix with that, each dimension will have zero mean and one variance.
@@ -56,17 +57,13 @@ At the end, we're also spitting out the intermediate variable used for normaliza
 This is how we use that above method:
 
 ```python
-
 # Input to hidden
-
 h1 = X @ W1 + b1
 
 # BatchNorm
-
 h1, bn1_cache, mu, var = batchnorm_forward(h1, gamma1, beta1)
 
 # ReLU
-
 h1[h1 < 0] = 0
 ```
 
@@ -76,7 +73,7 @@ For the backprop, here's the implementation:
 
 ```python
 def batchnorm_backward(dout, cache):
-X, X_norm, mu, var, gamma, beta = cache
+    X, X_norm, mu, var, gamma, beta = cache
 
     N, D = X.shape
 
@@ -92,27 +89,21 @@ X, X_norm, mu, var, gamma, beta = cache
     dbeta = np.sum(dout, axis=0)
 
     return dX, dgamma, dbeta
-
 ```
 
 For the explanation of the code, refer to the derivation of the BatchNorm gradient in the last section. As we can see, we're also returning derivative of gamma and beta: the linear transform for BatchNorm. It will be used to update the model, so that the net could also learn them.
 
 ```python
-
 # h1
-
 dh1 = dh2 @ W2.T
 
 # ReLU
-
 dh1[h1 <= 0] = 0
 
 # Dropout h1
-
-dh1 \*= u1
+dh1 *= u1
 
 # BatchNorm
-
 dh1, dgamma1, dbeta1 = batchnorm_backward(dh2, bn2_cache)
 ```
 
@@ -123,22 +114,18 @@ Remember, the order of backprop is important! We will get wrong result if we swa
 One more thing we need to take care of is that we want to fix the normalization at test time. That means we don't want to normalize our activations with the test set. Hence, as we're essentially using SGD, which is stochastic, we're going to estimate the mean and variance of our activations using running average.
 
 ```python
-
 # BatchNorm training forward propagation
-
 h2, bn2*cache, mu, var = batchnorm_forward(h2, gamma2, beta2)
 bn_params['bn2_mean'] = .9 * bn*params['bn2_mean'] + .1 * mu
-bn*params['bn2_var'] = .9 * bn*params['bn2_var'] + .1 * var
+bn_params['bn2_var'] = .9 * bn*params['bn2_var'] + .1 * var
 ```
 
 There, we store each BatchNorm layer's running mean and variance while training. It's a decaying running average.
 
 Then, at the test time, we just use that running average for the normalization:
 
 ```python
-
 # BatchNorm inference forward propagation
-
 h2 = (h2 - bn_params['bn2_mean']) / np.sqrt(bn_params['bn2_var'] + 1e-8)
 h2 = gamma2 \* h2 + beta2
 ```
@@ -222,5 +209,5 @@ In the test, we found that by using BatchNorm, our network become more tolerant
 
 ## References
 
-- <http://arxiv.org/pdf/1502.03167v3.pdf>
-- <https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html>
+- http://arxiv.org/pdf/1502.03167v3.pdf
+- https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
@@ -5,6 +5,8 @@ publishDate: 2016-06-25 10:00
 tags: [machine learning, programming, python, neural networks]
 ---
 
+import BlogImage from "@/components/BlogImage.astro";
+
 Dropout is one of the recent advancement in Deep Learning that enables us to train deeper and deeper network. Essentially, Dropout act as a regularization, and what it does is to make the network less prone to overfitting.
 
 As we already know, the deeper the network is, the more parameter it has. For example, VGGNet from ImageNet competition 2014, has some 148 million parameters. That's a lot. With that many parameters, the network could easily overfit, especially with small dataset.
@@ -13,7 +15,7 @@ Enter Dropout.
 
 In training phase, with Dropout, at each hidden layer, with probability `p`, we kill the neuron. What it means by 'kill' is to set the neuron to 0. As neural net is a collection multiplicative operations, then those 0 neuron won't propagate anything to the rest of the network.
 
-![Dropout]({{ site.baseurl }}/img/2016-06-25-dropout/00.png)
+<BlogImage imagePath='/img/dropout/00.png' fullWidth />
 
 Let `n` be the number of neuron in a hidden layer, then the expectation of the number of neuron to be active at each Dropout is `p*n`, as we sample the neurons uniformly with probability `p`. Concretely, if we have 1024 neurons in hidden layer, if we set `p = 0.5`, then we can expect that only half of the neurons (512) would be active at each given time.
 
@@ -26,11 +28,9 @@ So, that's why Dropout will increase the test time performance: it improves gene
 Let's see the concrete code for Dropout:
 
 ```python
-
 # Dropout training
-
 u1 = np.random.binomial(1, p, size=h1.shape)
-h1 \*= u1
+h1 *= u1
 ```
 
 First, we sample an array of independent Bernoulli Distribution, which is just a collection of zero or one to indicate whether we kill the neuron or not. For example, the value of `u1` would be `np.array([1, 0, 0, 1, 1, 0, 1, 0])`. Then, if we multiply our hidden layer with this array, what we get is the originial value of the neuron if the array element is 1, and 0 if the array element is also 0.
@@ -42,25 +42,20 @@ Now, because we're only using `p*n` of the neurons, the output then has the expe
 As we don't use Dropout in test time, then the expected output of the layer is `x`. That doesn't match with the training phase. What we need to do is to make it matches the training phase expectation, so we scale the layer output with `p`.
 
 ```python
-
 # Test time forward pass
-
 h1 = X_train @ W1 + b1
 h1[h1 < 0] = 0
 
 # Scale the hidden layer with p
-
-h1 \*= p
+h1 *= p
 ```
 
 In practice, it's better to simplify things. It's cumbersome to maintain codes in two places. So, we move that scaling into the Dropout training itself.
 
 ```python
-
 # Dropout training, notice the scaling of 1/p
-
 u1 = np.random.binomial(1, p, size=h1.shape) / p
-h1 \*= u1
+h1 *= u1
 ```
 
 With that code, we essentially make the expectation of layer output to be `x` instead of `px`, because we scale it back with `1/p`. Hence in the test time, we don't need to do anything as the expected output of the layer is the same.
@@ -70,25 +65,25 @@ With that code, we essentially make the expectation of layer output to be `x` in
 During the backprop, what we need to do is just to consider the Dropout. The killed neurons don't contribute anything to the network, so we won't flow the gradient through them.
 
 ```python
-dh1 \*= u1
+dh1 *= u1
 ```
 
-For full example, please refer to: <https://github.com/wiseodd/hipsternet/blob/master/hipsternet/neuralnet.py>.
+For full example, please refer to: https://github.com/wiseodd/hipsternet/blob/master/hipsternet/neuralnet.py.
 
 ## Test and Comparison
 
 Test time! But first, let's declare what kind of network we will use for testing.
 
 ```python
 def make_network(D, C, H=100):
-model = dict(
-W1=np.random.randn(D, H) / np.sqrt(D / 2.),
-W2=np.random.randn(H, H) / np.sqrt(H / 2.),
-W3=np.random.randn(H, C) / np.sqrt(H / 2.),
-b1=np.zeros((1, H)),
-b2=np.zeros((1, H)),
-b3=np.zeros((1, C))
-)
+    model = dict(
+        W1=np.random.randn(D, H) / np.sqrt(D / 2.),
+        W2=np.random.randn(H, H) / np.sqrt(H / 2.),
+        W3=np.random.randn(H, C) / np.sqrt(H / 2.),
+        b1=np.zeros((1, H)),
+        b2=np.zeros((1, H)),
+        b3=np.zeros((1, C))
+    )
 
     return model
 
@@ -147,9 +142,9 @@ We also implement Dropout in our model. Implementing Dropout in our neural net m
 
 We then compare the Dropout network with non Dropout network. The result is nice: Dropout network performs consistenly better in test time compared to the non Dropout Network.
 
-To see more about, check my full example in my Github page: <https://github.com/wiseodd/hipsternet>
+To see more about, check my full example in my Github page: https://github.com/wiseodd/hipsternet.
 
 ## References
 
-- <http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf>
-- <http://cs231n.github.io/neural-networks-2/#reg>
+- http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
+- http://cs231n.github.io/neural-networks-2/#reg