This part covers an overview of neural network, vectorization, and backpropagation.
Neural network based models are considered nonlinear algorithms. As for regression problems using neural networks, the mean squared error (MSE) can still be utilized as a loss function. While, for classification tasks, the negative log likelihood function is commonly employed. In both cases, an optimization algorithm is necessary to find the optimal weight parameters. However, due to the nonlinearity of the hypothesis function, direct determination of the gradients is not feasible. Therefore, backpropagation (also known as auto-differentiation) algorithm, which leverages the chain rule of calculus, is widely employed to compute the gradients efficiently. Backpropagation enables the network to efficiently adjust its weight parameters by propagating the gradients backward through the network.
Neural networks can consist of multiple layers with high nonlinearity, achieved through the introduction of activation functions like ReLU (Rectified Linear Unit). This allows the network to capture complex underlying structures in the data by learning hierarchical representations.
Another significant advantage of neural networks is their ability to automatically learn features or representations from the data, eliminating the need for manual feature engineering by domain experts.
Additionally, the use of vectorization in neural network computations, which takes advantage of matrix algebra and optimized numerical linear algebra libraries, is crucial for efficient computation. This becomes particularly important when dealing with high-dimensional inputs and large datasets.
The MLP (Multi Layer Perceptron) is one of the basic type of neural network with the architecture composing of input, hidden, and output layer, where the role of hidden layers is to discover and capture complex features and pattern within the data. As for nonlinearity, each neuron within the layer could involve an activation function which transform the weighted sum of it's inputs. The role of the activation function is to incorporate nonlinearity to the network to better represent nonlinear underlying structure of the data.
In a very deep neural network, the vanishing gradient problem can occur when gradients become very small due to repeated multiplication of small gradients across multiple layers. This can result in negligible gradients, causing the weight parameters to remain unchanged and preventing effective learning from the training data. To address this issue, the residual connection technique, also known as skip connections, is used. By introducing shortcut connections that bypass certain layers, the gradients can flow more directly and efficiently during backpropagation. These skip connections enable the gradient to propagate through the network without being diminished by the multiplication of small gradients in intermediate layers.
The neural network is the composition of numerous blocks, also known as modules in order to solve the optimization problem. In a neural network, data flows forward through the network from input to output in the forward pass. During this phase, the network calculates the output for a given input, which includes computing the pre-activation (i.e., the weighted sum of inputs plus bias) and the activation (i.e., output after applying the activation function) for each neuron in each layer.
Then, during the backward pass, the network calculates the gradient of the loss function with respect to each parameter (weights and biases) using the chain rule, by starting from the output layer and working backward to the input layer. This involves calculating the partial derivatives of the loss function with respect to the pre-activations and activations at each layer. These gradients are used to update the parameters of the model in an attempt to minimize the loss function. This process is called backpropagation.
It's also worth noting that instead of calculating the gradient for the whole network at once (which would be computationally expensive), the network is broken down into smaller modules or layers, and the partial derivative for each is computed separately. The overall gradient will then be the combination of the output of these backward functions.
Let's go over an example of using chain rule to compute the partial derivatives of a composite function, such as J = f(g(z)), by multiplying the partial derivatives of the outer function (J) with respect to the intermediate variable (u) and the partial derivatives of the intermediate variable (u) with respect to the inner variable (z). The result is the partial derivative of J with respect to z, as expressed by ∂J/∂z = (∂J/∂u) * (∂u/∂z). This formula represents the application of the chain rule and enables us to compute the gradient efficiently during backpropagation in order to optimize the parameters of the neural network.
Vectorization is a valuable technique that enables parallel computation of activation values in neural networks. Instead of calculating activations for each training sample individually, multiple samples can be stacked together, and the activation values can be computed efficiently using matrix multiplication. This approach significantly improves computational efficiency during both the forward and backward passes of the network.
