You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
BatchNorm a.k.a Batch Normalization is a relatively new technique proposed by Ioffe & Szegedy in 2015. It promises us acceleration in training (deep) neural net.
9
11
10
12
One difficult thing about training a neural net is to choose the initial weights. BatchNorm promises the remedy: it makes the network less dependant to the initialization strategy. Another key points are that it enables us to use higher learning rate. They even go further to state that BatchNorm could reduce the dependency on Dropout.
@@ -19,34 +21,33 @@ It enables us to be less careful with weights initialization as we don't need to
19
21
20
22
The forward propagation of BatchNorm is shown below:
Pretty simple right? We just need to compute the activations mean and variance over the current minibatch and normalize the activations with that. What could go wrong.
25
27
26
28
Well, we're training neural net with Backpropagation here, so that algorithm is half the story. We still need to derive the backprop scheme for the BatchNorm layer. Which is given by this:
If the above derivation doesn't make any sense, you could try reading [this](https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html) in combination with computational graph approach of backprop in [CS231 lecture](https://www.youtube.com/playlist?list=PLLvH2FwAQhnpj1WEB-jHmPuUeQ8mX-XXG).
31
33
32
34
Now that we know how to do forward and backward propagation for BatchNorm, let's try to implement that.
33
35
34
36
## Training with BatchNorm
35
37
36
-
As always, I will reuse the code from previous posts. It's in my repo here: <https://github.com/wiseodd/hipsternet>.
38
+
As always, I will reuse the code from previous posts. It's in my repo here: https://github.com/wiseodd/hipsternet.
37
39
38
40
```python
39
41
defbatchnorm_forward(X, gamma, beta):
40
-
mu = np.mean(X, axis=0)
41
-
var = np.var(X, axis=0)
42
+
mu = np.mean(X, axis=0)
43
+
var = np.var(X, axis=0)
42
44
43
45
X_norm = (X - mu) / np.sqrt(var +1e-8)
44
46
out = gamma * X_norm + beta
45
47
46
48
cache = (X, X_norm, mu, var, gamma, beta)
47
49
48
50
return out, cache, mu, var
49
-
50
51
```
51
52
52
53
This is the forward propagation algorithm. It's simple. However, remember that we're normalizing each dimension of activations. So, if our activations over a minibatch is MxN matrix, then we want the mean and variance to be 1xN: one value of mean and variance for each dimension. So, if we normalize our activations matrix with that, each dimension will have zero mean and one variance.
@@ -56,17 +57,13 @@ At the end, we're also spitting out the intermediate variable used for normaliza
56
57
This is how we use that above method:
57
58
58
59
```python
59
-
60
60
# Input to hidden
61
-
62
61
h1 = X @ W1 + b1
63
62
64
63
# BatchNorm
65
-
66
64
h1, bn1_cache, mu, var = batchnorm_forward(h1, gamma1, beta1)
67
65
68
66
# ReLU
69
-
70
67
h1[h1 <0] =0
71
68
```
72
69
@@ -76,7 +73,7 @@ For the backprop, here's the implementation:
For the explanation of the code, refer to the derivation of the BatchNorm gradient in the last section. As we can see, we're also returning derivative of gamma and beta: the linear transform for BatchNorm. It will be used to update the model, so that the net could also learn them.
@@ -123,22 +114,18 @@ Remember, the order of backprop is important! We will get wrong result if we swa
123
114
One more thing we need to take care of is that we want to fix the normalization at test time. That means we don't want to normalize our activations with the test set. Hence, as we're essentially using SGD, which is stochastic, we're going to estimate the mean and variance of our activations using running average.
124
115
125
116
```python
126
-
127
117
# BatchNorm training forward propagation
128
-
129
118
h2, bn2*cache, mu, var = batchnorm_forward(h2, gamma2, beta2)
130
119
bn_params['bn2_mean'] =.9* bn*params['bn2_mean'] +.1* mu
131
-
bn*params['bn2_var'] =.9* bn*params['bn2_var'] +.1* var
120
+
bn_params['bn2_var'] =.9* bn*params['bn2_var'] +.1* var
132
121
```
133
122
134
123
There, we store each BatchNorm layer's running mean and variance while training. It's a decaying running average.
135
124
136
125
Then, at the test time, we just use that running average for the normalization:
Dropout is one of the recent advancement in Deep Learning that enables us to train deeper and deeper network. Essentially, Dropout act as a regularization, and what it does is to make the network less prone to overfitting.
9
11
10
12
As we already know, the deeper the network is, the more parameter it has. For example, VGGNet from ImageNet competition 2014, has some 148 million parameters. That's a lot. With that many parameters, the network could easily overfit, especially with small dataset.
@@ -13,7 +15,7 @@ Enter Dropout.
13
15
14
16
In training phase, with Dropout, at each hidden layer, with probability `p`, we kill the neuron. What it means by 'kill' is to set the neuron to 0. As neural net is a collection multiplicative operations, then those 0 neuron won't propagate anything to the rest of the network.
Let `n` be the number of neuron in a hidden layer, then the expectation of the number of neuron to be active at each Dropout is `p*n`, as we sample the neurons uniformly with probability `p`. Concretely, if we have 1024 neurons in hidden layer, if we set `p = 0.5`, then we can expect that only half of the neurons (512) would be active at each given time.
19
21
@@ -26,11 +28,9 @@ So, that's why Dropout will increase the test time performance: it improves gene
26
28
Let's see the concrete code for Dropout:
27
29
28
30
```python
29
-
30
31
# Dropout training
31
-
32
32
u1 = np.random.binomial(1, p, size=h1.shape)
33
-
h1 \*= u1
33
+
h1 *= u1
34
34
```
35
35
36
36
First, we sample an array of independent Bernoulli Distribution, which is just a collection of zero or one to indicate whether we kill the neuron or not. For example, the value of `u1` would be `np.array([1, 0, 0, 1, 1, 0, 1, 0])`. Then, if we multiply our hidden layer with this array, what we get is the originial value of the neuron if the array element is 1, and 0 if the array element is also 0.
@@ -42,25 +42,20 @@ Now, because we're only using `p*n` of the neurons, the output then has the expe
42
42
As we don't use Dropout in test time, then the expected output of the layer is `x`. That doesn't match with the training phase. What we need to do is to make it matches the training phase expectation, so we scale the layer output with `p`.
43
43
44
44
```python
45
-
46
45
# Test time forward pass
47
-
48
46
h1 = X_train @ W1 + b1
49
47
h1[h1 <0] =0
50
48
51
49
# Scale the hidden layer with p
52
-
53
-
h1 \*= p
50
+
h1 *= p
54
51
```
55
52
56
53
In practice, it's better to simplify things. It's cumbersome to maintain codes in two places. So, we move that scaling into the Dropout training itself.
57
54
58
55
```python
59
-
60
56
# Dropout training, notice the scaling of 1/p
61
-
62
57
u1 = np.random.binomial(1, p, size=h1.shape) / p
63
-
h1 \*= u1
58
+
h1 *= u1
64
59
```
65
60
66
61
With that code, we essentially make the expectation of layer output to be `x` instead of `px`, because we scale it back with `1/p`. Hence in the test time, we don't need to do anything as the expected output of the layer is the same.
@@ -70,25 +65,25 @@ With that code, we essentially make the expectation of layer output to be `x` in
70
65
During the backprop, what we need to do is just to consider the Dropout. The killed neurons don't contribute anything to the network, so we won't flow the gradient through them.
71
66
72
67
```python
73
-
dh1 \*= u1
68
+
dh1 *= u1
74
69
```
75
70
76
-
For full example, please refer to: <https://github.com/wiseodd/hipsternet/blob/master/hipsternet/neuralnet.py>.
71
+
For full example, please refer to: https://github.com/wiseodd/hipsternet/blob/master/hipsternet/neuralnet.py.
77
72
78
73
## Test and Comparison
79
74
80
75
Test time! But first, let's declare what kind of network we will use for testing.
81
76
82
77
```python
83
78
defmake_network(D, C, H=100):
84
-
model =dict(
85
-
W1=np.random.randn(D, H) / np.sqrt(D /2.),
86
-
W2=np.random.randn(H, H) / np.sqrt(H /2.),
87
-
W3=np.random.randn(H, C) / np.sqrt(H /2.),
88
-
b1=np.zeros((1, H)),
89
-
b2=np.zeros((1, H)),
90
-
b3=np.zeros((1, C))
91
-
)
79
+
model =dict(
80
+
W1=np.random.randn(D, H) / np.sqrt(D /2.),
81
+
W2=np.random.randn(H, H) / np.sqrt(H /2.),
82
+
W3=np.random.randn(H, C) / np.sqrt(H /2.),
83
+
b1=np.zeros((1, H)),
84
+
b2=np.zeros((1, H)),
85
+
b3=np.zeros((1, C))
86
+
)
92
87
93
88
return model
94
89
@@ -147,9 +142,9 @@ We also implement Dropout in our model. Implementing Dropout in our neural net m
147
142
148
143
We then compare the Dropout network with non Dropout network. The result is nice: Dropout network performs consistenly better in test time compared to the non Dropout Network.
149
144
150
-
To see more about, check my full example in my Github page: <https://github.com/wiseodd/hipsternet>
145
+
To see more about, check my full example in my Github page: https://github.com/wiseodd/hipsternet.
0 commit comments