Skip to content

Commit c373a9e

Browse files
committed
More md -> mdx
1 parent 6ca3d70 commit c373a9e

File tree

10 files changed

+105
-136
lines changed

10 files changed

+105
-136
lines changed

.astro/types.d.ts

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -152,13 +152,13 @@ declare module 'astro:content' {
152152
collection: "post";
153153
data: InferEntrySchema<"post">
154154
} & { render(): Render[".mdx"] };
155-
"batchnorm.md": {
156-
id: "batchnorm.md";
155+
"batchnorm.mdx": {
156+
id: "batchnorm.mdx";
157157
slug: "batchnorm";
158158
body: string;
159159
collection: "post";
160160
data: InferEntrySchema<"post">
161-
} & { render(): Render[".md"] };
161+
} & { render(): Render[".mdx"] };
162162
"bayesian-regression.mdx": {
163163
id: "bayesian-regression.mdx";
164164
slug: "bayesian-regression";
@@ -250,13 +250,13 @@ declare module 'astro:content' {
250250
collection: "post";
251251
data: InferEntrySchema<"post">
252252
} & { render(): Render[".md"] };
253-
"dropout.md": {
254-
id: "dropout.md";
253+
"dropout.mdx": {
254+
id: "dropout.mdx";
255255
slug: "dropout";
256256
body: string;
257257
collection: "post";
258258
data: InferEntrySchema<"post">
259-
} & { render(): Render[".md"] };
259+
} & { render(): Render[".mdx"] };
260260
"fisher-information.mdx": {
261261
id: "fisher-information.mdx";
262262
slug: "fisher-information";
@@ -425,34 +425,34 @@ declare module 'astro:content' {
425425
collection: "post";
426426
data: InferEntrySchema<"post">
427427
} & { render(): Render[".mdx"] };
428-
"nn-optimization.md": {
429-
id: "nn-optimization.md";
428+
"nn-optimization.mdx": {
429+
id: "nn-optimization.mdx";
430430
slug: "nn-optimization";
431431
body: string;
432432
collection: "post";
433433
data: InferEntrySchema<"post">
434-
} & { render(): Render[".md"] };
435-
"nn-sgd.md": {
436-
id: "nn-sgd.md";
434+
} & { render(): Render[".mdx"] };
435+
"nn-sgd.mdx": {
436+
id: "nn-sgd.mdx";
437437
slug: "nn-sgd";
438438
body: string;
439439
collection: "post";
440440
data: InferEntrySchema<"post">
441-
} & { render(): Render[".md"] };
441+
} & { render(): Render[".mdx"] };
442442
"optimization-riemannian-manifolds.mdx": {
443443
id: "optimization-riemannian-manifolds.mdx";
444444
slug: "optimization-riemannian-manifolds";
445445
body: string;
446446
collection: "post";
447447
data: InferEntrySchema<"post">
448448
} & { render(): Render[".mdx"] };
449-
"parallel-monte-carlo.md": {
450-
id: "parallel-monte-carlo.md";
449+
"parallel-monte-carlo.mdx": {
450+
id: "parallel-monte-carlo.mdx";
451451
slug: "parallel-monte-carlo";
452452
body: string;
453453
collection: "post";
454454
data: InferEntrySchema<"post">
455-
} & { render(): Render[".md"] };
455+
} & { render(): Render[".mdx"] };
456456
"plotting.mdx": {
457457
id: "plotting.mdx";
458458
slug: "plotting";

bun.lockb

340 Bytes
Binary file not shown.

package.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,11 @@
2121
"@fontsource/jetbrains-mono": "^5.0.20",
2222
"@vercel/analytics": "^1.3.1",
2323
"ajv": "^8.17.1",
24-
"astro": "^4.13.1",
24+
"astro": "^4.13.3",
2525
"astro-expressive-code": "^0.33.5",
2626
"astro-icon": "^1.1.0",
2727
"clsx": "^2.1.1",
28+
"jquery": "^3.7.1",
2829
"mdast-util-to-string": "^4.0.0",
2930
"punycode": "^2.3.1",
3031
"reading-time": "^1.5.0",
@@ -34,7 +35,7 @@
3435
"remark-smartypants": "^3.0.2",
3536
"remark-unwrap-images": "^4.0.0",
3637
"sharp": "^0.33.4",
37-
"tailwind-merge": "^2.4.0",
38+
"tailwind-merge": "^2.5.1",
3839
"tailwindcss": "^3.4.9",
3940
"typescript": "^5.5.4"
4041
},
Lines changed: 12 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ publishDate: 2016-07-04 10:53
55
tags: [machine learning, programming, python, neural networks]
66
---
77

8+
import BlogImage from "@/components/BlogImage.astro";
9+
810
BatchNorm a.k.a Batch Normalization is a relatively new technique proposed by Ioffe & Szegedy in 2015. It promises us acceleration in training (deep) neural net.
911

1012
One difficult thing about training a neural net is to choose the initial weights. BatchNorm promises the remedy: it makes the network less dependant to the initialization strategy. Another key points are that it enables us to use higher learning rate. They even go further to state that BatchNorm could reduce the dependency on Dropout.
@@ -19,34 +21,33 @@ It enables us to be less careful with weights initialization as we don't need to
1921

2022
The forward propagation of BatchNorm is shown below:
2123

22-
![BatchNorm Forward]({{ site.baseurl }}/img/2016-07-04-batchnorm/00.png)
24+
<BlogImage imagePath='/img/batchnorm/00.png' />
2325

2426
Pretty simple right? We just need to compute the activations mean and variance over the current minibatch and normalize the activations with that. What could go wrong.
2527

2628
Well, we're training neural net with Backpropagation here, so that algorithm is half the story. We still need to derive the backprop scheme for the BatchNorm layer. Which is given by this:
2729

28-
![BatchNorm Backward]({{ site.baseurl }}/img/2016-07-04-batchnorm/01.png)
30+
<BlogImage imagePath='/img/batchnorm/01.png' />
2931

3032
If the above derivation doesn't make any sense, you could try reading [this](https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html) in combination with computational graph approach of backprop in [CS231 lecture](https://www.youtube.com/playlist?list=PLLvH2FwAQhnpj1WEB-jHmPuUeQ8mX-XXG).
3133

3234
Now that we know how to do forward and backward propagation for BatchNorm, let's try to implement that.
3335

3436
## Training with BatchNorm
3537

36-
As always, I will reuse the code from previous posts. It's in my repo here: <https://github.com/wiseodd/hipsternet>.
38+
As always, I will reuse the code from previous posts. It's in my repo here: https://github.com/wiseodd/hipsternet.
3739

3840
```python
3941
def batchnorm_forward(X, gamma, beta):
40-
mu = np.mean(X, axis=0)
41-
var = np.var(X, axis=0)
42+
mu = np.mean(X, axis=0)
43+
var = np.var(X, axis=0)
4244

4345
X_norm = (X - mu) / np.sqrt(var + 1e-8)
4446
out = gamma * X_norm + beta
4547

4648
cache = (X, X_norm, mu, var, gamma, beta)
4749

4850
return out, cache, mu, var
49-
5051
```
5152

5253
This is the forward propagation algorithm. It's simple. However, remember that we're normalizing each dimension of activations. So, if our activations over a minibatch is MxN matrix, then we want the mean and variance to be 1xN: one value of mean and variance for each dimension. So, if we normalize our activations matrix with that, each dimension will have zero mean and one variance.
@@ -56,17 +57,13 @@ At the end, we're also spitting out the intermediate variable used for normaliza
5657
This is how we use that above method:
5758

5859
```python
59-
6060
# Input to hidden
61-
6261
h1 = X @ W1 + b1
6362

6463
# BatchNorm
65-
6664
h1, bn1_cache, mu, var = batchnorm_forward(h1, gamma1, beta1)
6765

6866
# ReLU
69-
7067
h1[h1 < 0] = 0
7168
```
7269

@@ -76,7 +73,7 @@ For the backprop, here's the implementation:
7673

7774
```python
7875
def batchnorm_backward(dout, cache):
79-
X, X_norm, mu, var, gamma, beta = cache
76+
X, X_norm, mu, var, gamma, beta = cache
8077

8178
N, D = X.shape
8279

@@ -92,27 +89,21 @@ X, X_norm, mu, var, gamma, beta = cache
9289
dbeta = np.sum(dout, axis=0)
9390

9491
return dX, dgamma, dbeta
95-
9692
```
9793

9894
For the explanation of the code, refer to the derivation of the BatchNorm gradient in the last section. As we can see, we're also returning derivative of gamma and beta: the linear transform for BatchNorm. It will be used to update the model, so that the net could also learn them.
9995

10096
```python
101-
10297
# h1
103-
10498
dh1 = dh2 @ W2.T
10599

106100
# ReLU
107-
108101
dh1[h1 <= 0] = 0
109102

110103
# Dropout h1
111-
112-
dh1 \*= u1
104+
dh1 *= u1
113105

114106
# BatchNorm
115-
116107
dh1, dgamma1, dbeta1 = batchnorm_backward(dh2, bn2_cache)
117108
```
118109

@@ -123,22 +114,18 @@ Remember, the order of backprop is important! We will get wrong result if we swa
123114
One more thing we need to take care of is that we want to fix the normalization at test time. That means we don't want to normalize our activations with the test set. Hence, as we're essentially using SGD, which is stochastic, we're going to estimate the mean and variance of our activations using running average.
124115

125116
```python
126-
127117
# BatchNorm training forward propagation
128-
129118
h2, bn2*cache, mu, var = batchnorm_forward(h2, gamma2, beta2)
130119
bn_params['bn2_mean'] = .9 * bn*params['bn2_mean'] + .1 * mu
131-
bn*params['bn2_var'] = .9 * bn*params['bn2_var'] + .1 * var
120+
bn_params['bn2_var'] = .9 * bn*params['bn2_var'] + .1 * var
132121
```
133122

134123
There, we store each BatchNorm layer's running mean and variance while training. It's a decaying running average.
135124

136125
Then, at the test time, we just use that running average for the normalization:
137126

138127
```python
139-
140128
# BatchNorm inference forward propagation
141-
142129
h2 = (h2 - bn_params['bn2_mean']) / np.sqrt(bn_params['bn2_var'] + 1e-8)
143130
h2 = gamma2 \* h2 + beta2
144131
```
@@ -222,5 +209,5 @@ In the test, we found that by using BatchNorm, our network become more tolerant
222209

223210
## References
224211

225-
- <http://arxiv.org/pdf/1502.03167v3.pdf>
226-
- <https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html>
212+
- http://arxiv.org/pdf/1502.03167v3.pdf
213+
- https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
Lines changed: 19 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ publishDate: 2016-06-25 10:00
55
tags: [machine learning, programming, python, neural networks]
66
---
77

8+
import BlogImage from "@/components/BlogImage.astro";
9+
810
Dropout is one of the recent advancement in Deep Learning that enables us to train deeper and deeper network. Essentially, Dropout act as a regularization, and what it does is to make the network less prone to overfitting.
911

1012
As we already know, the deeper the network is, the more parameter it has. For example, VGGNet from ImageNet competition 2014, has some 148 million parameters. That's a lot. With that many parameters, the network could easily overfit, especially with small dataset.
@@ -13,7 +15,7 @@ Enter Dropout.
1315

1416
In training phase, with Dropout, at each hidden layer, with probability `p`, we kill the neuron. What it means by 'kill' is to set the neuron to 0. As neural net is a collection multiplicative operations, then those 0 neuron won't propagate anything to the rest of the network.
1517

16-
![Dropout]({{ site.baseurl }}/img/2016-06-25-dropout/00.png)
18+
<BlogImage imagePath='/img/dropout/00.png' fullWidth />
1719

1820
Let `n` be the number of neuron in a hidden layer, then the expectation of the number of neuron to be active at each Dropout is `p*n`, as we sample the neurons uniformly with probability `p`. Concretely, if we have 1024 neurons in hidden layer, if we set `p = 0.5`, then we can expect that only half of the neurons (512) would be active at each given time.
1921

@@ -26,11 +28,9 @@ So, that's why Dropout will increase the test time performance: it improves gene
2628
Let's see the concrete code for Dropout:
2729

2830
```python
29-
3031
# Dropout training
31-
3232
u1 = np.random.binomial(1, p, size=h1.shape)
33-
h1 \*= u1
33+
h1 *= u1
3434
```
3535

3636
First, we sample an array of independent Bernoulli Distribution, which is just a collection of zero or one to indicate whether we kill the neuron or not. For example, the value of `u1` would be `np.array([1, 0, 0, 1, 1, 0, 1, 0])`. Then, if we multiply our hidden layer with this array, what we get is the originial value of the neuron if the array element is 1, and 0 if the array element is also 0.
@@ -42,25 +42,20 @@ Now, because we're only using `p*n` of the neurons, the output then has the expe
4242
As we don't use Dropout in test time, then the expected output of the layer is `x`. That doesn't match with the training phase. What we need to do is to make it matches the training phase expectation, so we scale the layer output with `p`.
4343

4444
```python
45-
4645
# Test time forward pass
47-
4846
h1 = X_train @ W1 + b1
4947
h1[h1 < 0] = 0
5048

5149
# Scale the hidden layer with p
52-
53-
h1 \*= p
50+
h1 *= p
5451
```
5552

5653
In practice, it's better to simplify things. It's cumbersome to maintain codes in two places. So, we move that scaling into the Dropout training itself.
5754

5855
```python
59-
6056
# Dropout training, notice the scaling of 1/p
61-
6257
u1 = np.random.binomial(1, p, size=h1.shape) / p
63-
h1 \*= u1
58+
h1 *= u1
6459
```
6560

6661
With that code, we essentially make the expectation of layer output to be `x` instead of `px`, because we scale it back with `1/p`. Hence in the test time, we don't need to do anything as the expected output of the layer is the same.
@@ -70,25 +65,25 @@ With that code, we essentially make the expectation of layer output to be `x` in
7065
During the backprop, what we need to do is just to consider the Dropout. The killed neurons don't contribute anything to the network, so we won't flow the gradient through them.
7166

7267
```python
73-
dh1 \*= u1
68+
dh1 *= u1
7469
```
7570

76-
For full example, please refer to: <https://github.com/wiseodd/hipsternet/blob/master/hipsternet/neuralnet.py>.
71+
For full example, please refer to: https://github.com/wiseodd/hipsternet/blob/master/hipsternet/neuralnet.py.
7772

7873
## Test and Comparison
7974

8075
Test time! But first, let's declare what kind of network we will use for testing.
8176

8277
```python
8378
def make_network(D, C, H=100):
84-
model = dict(
85-
W1=np.random.randn(D, H) / np.sqrt(D / 2.),
86-
W2=np.random.randn(H, H) / np.sqrt(H / 2.),
87-
W3=np.random.randn(H, C) / np.sqrt(H / 2.),
88-
b1=np.zeros((1, H)),
89-
b2=np.zeros((1, H)),
90-
b3=np.zeros((1, C))
91-
)
79+
model = dict(
80+
W1=np.random.randn(D, H) / np.sqrt(D / 2.),
81+
W2=np.random.randn(H, H) / np.sqrt(H / 2.),
82+
W3=np.random.randn(H, C) / np.sqrt(H / 2.),
83+
b1=np.zeros((1, H)),
84+
b2=np.zeros((1, H)),
85+
b3=np.zeros((1, C))
86+
)
9287

9388
return model
9489

@@ -147,9 +142,9 @@ We also implement Dropout in our model. Implementing Dropout in our neural net m
147142

148143
We then compare the Dropout network with non Dropout network. The result is nice: Dropout network performs consistenly better in test time compared to the non Dropout Network.
149144

150-
To see more about, check my full example in my Github page: <https://github.com/wiseodd/hipsternet>
145+
To see more about, check my full example in my Github page: https://github.com/wiseodd/hipsternet.
151146

152147
## References
153148

154-
- <http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf>
155-
- <http://cs231n.github.io/neural-networks-2/#reg>
149+
- http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
150+
- http://cs231n.github.io/neural-networks-2/#reg

0 commit comments

Comments
 (0)