🧠 Brain Made of Code, Created with Heart ❤︎

From Regression to Neural Nets: Learning with Gradient Descent & Beyond

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course prioritizes hands-on practice with real data in consulting scenarios.
All activities comply with the academic and ethical guidelines of PUC-SP.
Confidential information from this repository remains private in private repositories.

This project provides a comprehensive and hands-on guide to Machine Learning and Artificial Neural Networks (ANNs), combining theory, Python code, and visualizations.

It starts from the ground up:

Historical & Mathematical Background: Full explanations of the theory behind algorithms, including derivations, historical context, and all key formulas. Formulas are provided in LaTeX, ready to copy-paste for documentation or reports.
Foundations: understanding tensors as the core data structure in deep learning.
Datasets: loading and working with real datasets online (e.g., MNIST) for experimentation.
Regression & Optimization: training models with Batch Gradient Descent, Stochastic Gradient Descent, and Mini-batch GD.
Regularization: applying techniques like Elastic Net to reduce overfitting and improve generalization.
Model Evaluation: exploring accuracy metrics, detecting and handling outliers, and understanding error analysis.
Neural Networks: building progressively — from the artificial neuron model to Multilayer Perceptrons (MLP), forward and backward propagation, and optimization improvements like Momentum.

The repository provides:

Step-by-step explanations before each implementation.
Python code split into reproducible cells, ready for Jupyter Notebook or Google Colab.
Practical examples and visualizations to illustrate convergence, performance, overfitting behavior, and accuracy measurement.

Ideal for beginners and intermediate learners looking to build a solid foundation in machine learning optimization algorithms, and who want to go beyond running code to truly understand the underlying concepts of optimization, evaluation, and neural networks.

I - Artificial Neural Networks – From Perceptron to Modern Learning Algorithms

This repository provides a hands-on and structured overview of Artificial Neural Networks (ANNs) — starting from the Perceptron, moving through the Multilayer Perceptron (MLP), and exploring the learning algorithms that power Machine Learning, such as Gradient Descent and its variations.

The content is inspired by the official theoretical material Decreasing-Gradient.pdf from the Undergraduate Program in Humanistic AI and Data Science at PUC São Paulo, Brazil.

Motivation

The human brain processes information in a nonlinear, adaptive, and massively parallel way — very different from how conventional computers work.

For example, the brain can recognize a familiar face in milliseconds, even in a completely new environment. Meanwhile, traditional computers may take much longer to solve much simpler problems.

Taking inspiration from biology, Artificial Neural Networks are computational models designed to learn from data, adapt through experience, and replicate human-like problem solving.

Historical Context

McCulloch & Pitts (1943): Introduced the first neural network models.
Hebb (1949): Developed the basic model of self-organization.
Rosenblatt (1958): Introduced the perceptron, a supervised learning model.
Hopfield (1982), Rumelhart, Hinton & Williams: Revived the field with symmetric networks for optimization and the backpropagation method.

Artificial Neuron Mode

Each artificial neuron receives input signals $X_1, X_2, ..., X_p$ (binary or real values), each multiplied by a weight $w_1, w_2, ..., w_p$ (real values). The neuron computes a weighted sum (activity level):

$$ \Huge a = w_1 X_1 + w_2 X_2 + \cdots + w_p X_p $$

\a = w_1 X_1 + w_2 X_2 + \cdots + w_p X_p\

The output y is determined by an activation function, such as:

$$ \Huge y = \begin{cases} 1, & \text{if } a \geq t \\ 0, & \text{if } a < t \end{cases} $$

y =
\begin{cases}
1, & \text{if } a \geq t \\
0, & \text{if } a < t
\end{cases}

Key Benefits of ANNs

Adaptability through learning
Ability to operate with partial knowledge
Fault tolerance
Generalization
Contextual information processing
Input-output mapping

Application Areas

Pattern classification
Clustering/categorization
Function approximation
Prediction
Optimization
Content-addressable memory
Control systems

Learning Process

ANNs operate in two main phases:

Training Phase: The network learns by adjusting its free parameters (weights) to perform a specific function.
Application Phase: The trained network is used for its intended purpose (e.g., pattern or image classification).

The learning process involves:

Stimulation by the environment (input).
Modification of free parameters (weights) as a result.
The network responds differently due to internal changes.

Learning is governed by a set of pre-established rules (learning algorithm) and a learning paradigm (model).

Error Correction Learning

The output of neuron $k$ at iteration $n$ is $y_k(n)$, and the desired response is $d_k(n)$. The error signal is:

$$ \Huge e_k(n) = d_k(n) - y_k(n) $$

The goal is to minimize the cost function (performance index):

$$ \Huge E(n) = \frac{1}{2} e_k^2(n) $$

Weights are updated as:

$$ \Huge w_{kj}(n+1) = w_{kj}(n) + \Delta w_{kj}(n) $$

The Perceptron

The perceptron, proposed by Rosenblatt (1958), is the simplest type of ANN. It uses supervised learning and error correction to adjust the weight vector. For a perceptron with two inputs and a bias:

The bias allows the threshold value in the activation function to be set, and is updated like any other weight.

Nonlinearities and Activation Functions

Nonlinearities are inherent in most real-world problems.
Incorporated through nonlinear activation functions (e.g., sigmoid, tanh) and multiple layers.
MLPs use sigmoid functions in hidden layers and linear functions in the output layer.

MLP (MultiLayer Perceptron)

Composed of neurons with nonlinear activation functions in intermediate (hidden) layers.
Only the output layer receives a desired output during training.
The error for hidden layers is estimated by the effect they cause on the output error (backpropagation).

Two-Layer Perceptron Architecture

A two-layer perceptron (MLP with one hidden layer and one output layer) can approximate any function, linear or not (Cybenko, 1989).

Layer 1 (Hidden/Intermediate): Each neuron contributes lines (hyperplanes) to form surfaces in input space, "linearizing" the features.
Layer 2 (Output): Neurons combine these lines to form convex regions, enabling complex decision boundaries.

Number of Neurons:

The generalization capacity of the network increases with the number of neurons.
Empirically, 3–5 neurons per layer strike a good balance between modeling power and computational cost.

Layer Types:

Input Layer: Receives input patterns.
Hidden Layer(s): Main processing; feature extraction.
Output Layer: Produces the final result.

Main Concepts and Key Formulas

Neuron Activation:

$$ \Huge a = \sum_{i=1}^{p} w_i X_i $$

Output:

$$ \Huge y = f(a) $$

where $\Huge f$ is the activation function (e.g., sigmoid, tanh)

Error Calculation:

$$ \Huge e_k(n) = d_k(n) - y_k(n)$ $$

Cost Function (Mean Squared Error):

$$ \Huge E(n) = \frac{1}{2} e_k^2(n $$

Weight Update (Gradient Descent):

$$ \Huge w_{kj}(n+1) = w_{kj}(n) + \eta \frac{\partial E(n)}{\partial w_{kj}} $$

Backpropagation for Output Layer:

$$ \Huge \delta^{(2)}(t) = (d(t) - y(t)) \cdot f'^{(2)}(u) $$

Backpropagation for Hidden Layer:

$$ \Huge delta_j^(1)(t) = ( sum_k [ delta_k^(2) * w_kj^(2) ] ) * f'^(1)( u_j^(1)) $$

Training: Two-Phase Process

1. Forward Phase*

Initialize learning rate $\eta$ and weight matrix $w$ with random values.
Present input to the first layer.
Each neuron in layer $i$ computes its output, which is passed to the next layer.
The final output is compared to the desired output.
The error for each output neuron is calculated.

- Example Calculation:

Forward Computation Example

- For input values:

$\Large ( X_0 = 1 )$

$\Large ( X_1 = 0.43 )$

$\Large ( X_2 = 0.78 )$

- And example weights:

$\Large ( w^{(1)}_{00} = 0.45 )$

$\Large ( w^{(1)}_{01} = 0.89 )$

etc...

Compute the activations and outputs for each layer using an activation function (e.g., `tanh`):

Compute pre-activation (input to each hidden neuron):

$$ u_j^{(1)} = \sum_i X_i \cdot w_{ji}^{(1)} $$

Compute activation (output from each hidden neuron):

$y^{(1)}_j = \tanh(u^{(1)}_j)$
Compute output layer pre-activation:
$u^{(2)} = \sum_j y^{(1)}_j w^{(2)}_j$
Output of network:
$y^{(2)} = \tanh(u^{(2)})$
Calculate error:
$e = d - y^{(2)}$
$E = \frac{1}{2} e^2$

2. Backward Phase (Backpropagation)

Start from the output layer.
Each node adjusts its weight to reduce its error.
For hidden layers, the error is determined by the weighted errors of the next layer (chain rule).
Output layer weight update:

$w^{(2)}(t+1) = w^{(2)}(t) + \eta \delta^{(2)} y^{(1)}(t)$

where $\delta^{(2)}(t) = (d(t) - y(t)) \cdot f'^{(2)}(u)$
Hidden layer delta:

 $\delta^{(1)}_j(t) = \left( \sum_k \delta^{(2)}_k w^{(2)}_{kj} \right) \cdot f'^{(1)}(u_j)$

Example: Training a Two-Layer Perceptron

1. Initialize all weights randomly. 2. Present an input vector $X$. 3. Compute outputs for the first (hidden) layer:

$u_j^{(1)} = \sum_i X_i w_{ji}^{(1)}$

$y_j^{(1)} = \tanh(u_j^{(1)})$

4. Compute output for the second (output) layer:

$u^{(2)} = \sum_j y^{(1)}_j \cdot w^{(2)}_j$

$y^{(2)} = \tanh(u^{(2)})$

5. Calculate error:

$e = d - y^{(2)}$

$E = \frac{1}{2} e^2$

6. Backward phase:

Compute $\delta^{(2)}$ and update output weights.
Compute $\delta^{(1)}$ for each hidden neuron and update hidden weights.

Why Two Layers and 3–5 Neurons per Layer?

Theoretical Power: Two-layer MLPs can approximate any continuous function (universal approximation theorem).
Practical Simplicity: Most real-world problems rarely require more than two layers.
Cost-Benefit: 3–5 neurons per layer often provide sufficient capacity for generalization without excessive computational cost.

Local Maximum (Local Maxima)

In gradient descent training, the algorithm updates weights to reduce error by following the gradient of the cost function. However, the cost function may have multiple local maxima or minima.

Local Maximum: A point where the cost function has a peak relative to nearby points but is not the absolute highest point globally.
Gradient descent can get "stuck" in local maxima or minima, preventing the network from reaching the best possible solution.
Techniques such as random restarts, momentum, or advanced optimization algorithms help mitigate this problem.

Usage

Artificial Neural Networks, especially perceptrons and MLPs, are widely used in various domains due to their adaptability and ability to model complex nonlinear relationships.

Strengths

Ability to learn from examples and generalize to unseen data.
Fault tolerance and robustness to noisy inputs.
Flexibility to model complex, nonlinear functions.
Parallel processing capability.

Weaknesses

Training can be computationally expensive, especially for large networks.
Susceptible to getting stuck in local minima or maxima.
Requires careful tuning of hyperparameters (learning rate, number of neurons, layers).

Lack of interpretability compared to simpler models.

Additional Relevant Points

Learning Rate (η) Importance

The learning rate $\eta$ controls the step size during weight updates:

If $\eta$ is too large, the training may overshoot minima and fail to converge.
If $\eta$ is too small, training will be very slow and may get stuck in local minima.
Adaptive learning rate methods (e.g., learning rate decay, Adam optimizer) can improve convergence.

Activation Functions

While the document mentions sigmoid and tanh, it is useful to note:

ReLU (Rectified Linear Unit):
Widely used in modern neural networks for faster convergence and to mitigate vanishing gradient problems.
Softmax:
Commonly used in output layers for multi-class classification problems.

Overfitting and Regularization

Neural networks with too many parameters can overfit training data, performing poorly on unseen data.
Techniques such as early stopping, dropout, and L2 regularization help improve generalization.

Batch vs. Online Learning

The document discusses iterative weight updates per sample (online/stochastic gradient descent).
In practice, batch or mini-batch gradient descent is often used for computational efficiency and stability.

Practical Considerations

Data preprocessing (normalization, encoding) is crucial for effective training.
Initialization of weights affects convergence speed and final performance.
Monitoring training with validation sets helps detect overfitting.

Algorithms Used to Train Machine Learning Models

1.Gradient Descent

Gradient Descent is a mathematical optimization method primarily used for minimizing differentiable multivariate functions. It is a first-order iterative algorithm that adjusts model parameters to find the minimum value of a function, typically representing an error or cost to minimize.

The way gradient descent works can be explained as follows: Imagine standing on top of a hill wanting to reach the lowest point in a valley. In algorithm terms, you start with initial parameter values and calculate the slope (gradient) of the cost function with respect to these parameters. This slope shows the steepest ascent direction. To minimize the function, you take a step in the opposite direction, "descending the slope" toward the lowest point.

These steps are repeated iteratively, adjusting the model parameters opposite to the gradient direction until the algorithm converges to the minimum. The step size is controlled by a learning rate that defines how big the adjustments are at each iteration.

1 - Gradient Descent (Batch)

Gradient Descent is an iterative algorithm to minimize a cost function by adjusting parameters opposite to the gradient direction. Batch Gradient Descent calculates the gradient using the entire dataset each step, resulting in stable but sometimes slow parameter updates.

2 Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent updates parameters based on a single random sample per iteration. This yields noisier but faster updates, suitable for large datasets and deep learning models.

3. Elastic Net Regularization

Elastic Net combines L1 (Lasso) and L2 (Ridge) penalties to improve the performance of linear regression models, particularly when many variables are correlated. It helps prevent overfitting and performs automatic feature selection, making it a powerful tool for machine learning modeling.

4. Mini-batch Gradient Descent

Mini-batch Gradient Descent is a compromise between batch and stochastic descent. It updates parameters using small random batches, accelerating convergence with reduced noise.

5. Adam (Adaptive Moment Estimation)

An algorithm combining momentum and adaptive learning rates to improve convergence and training efficiency, especially in deep neural networks.

6. RMSProp

Adapts the learning rate for each parameter, useful to accelerate training and avoid oscillations.

II - Artificial Neural Networks (ANN) - Comprehensive Theory, Use Cases, and Python Code Guide

Required Libraries Installation

Before running any code, install the necessary Python libraries appropriate for your operating system and environment:

Cell 1 - Installation Commands

# macOS Terminal or Jupyter Notebook (IPython):

%pip install numpy matplotlib tensorflow scikit-learn tensorflow-datasets

# Windows Command Prompt or PowerShell:

pip install numpy matplotlib tensorflow scikit-learn tensorflow-datasets

# Linux Terminal or Jupyter Notebook (IPython):

%pip install numpy matplotlib tensorflow scikit-learn tensorflow-datasets

This will install:

numpy: Numerical computations

matplotlib: Plotting and visualization

tensorflow: Deep learning framework

scikit-learn: Machine learning utilities

tensorflow-datasets: Loading datasets like MNIST easily

0. Understanding Tensors and Loading MNIST Dataset

0.1 What is a Tensor?

- Concept:

A tensor generalizes scalars (0-D), vectors (1-D), and matrices (2-D) to n-dimensional arrays. Data in neural networks (inputs, weights, activations) are represented as tensors.
Understanding tensors is essential for deep learning frameworks.

- **Use Case

Manages multi-dimensional data like images (3D tensors with height, width, channels) or batches of images (4D tensors).

- Code:

import tensorflow as tf

# 0-D scalar

scalar = tf.constant(42)
print("Scalar:", scalar, "Shape:", scalar.shape)

# 1-D vector

vector = tf.constant()
print("Vector:", vector, "Shape:", vector.shape)

# 2-D matrix

matrix = tf.constant([, ])
print("Matrix:\n", matrix.numpy())
print("Shape:", matrix.shape)

# 3-D tensor (example: color image 2x2 pixels with 3 color channels)

tensor_3d = tf.constant([[, ],
[, ]])
print("3D tensor:\n", tensor_3d.numpy())
print("Shape:", tensor_3d.shape)

0.2 Loading MNIST Dataset from TensorFlow Datasets

- Concept:

MNIST dataset can be streamed using TensorFlow Datasets, cached automatically without manual download management.

Name		Name	Last commit message	Last commit date
Latest commit History 651 Commits
.github		.github
1-Gradient Descendin-Build a Brain		1-Gradient Descendin-Build a Brain
1-Gradient-Descent		1-Gradient-Descent
1c-project_7-Build-Regression-DescendingGradient-Stochastic-DG /Scatter Plots		1c-project_7-Build-Regression-DescendingGradient-Stochastic-DG /Scatter Plots
1d-Shortest Path Problem-Theory Dijkstra's Algorithm-Python		1d-Shortest Path Problem-Theory Dijkstra's Algorithm-Python
Building a Brain- NVIDEA		Building a Brain- NVIDEA
Code_for_Neuralearn_Courses		Code_for_Neuralearn_Courses
Design		Design
Exer_5-Factory Task/Exer_5a-Factory Task - Hungarian Method		Exer_5-Factory Task/Exer_5a-Factory Task - Hungarian Method
md's		md's
project-Build-Regression-DescendingGradient-Stochastic-DG		project-Build-Regression-DescendingGradient-Stochastic-DG
project-Build-Regression-DescendingGradient-Stochastic-DG		project-Build-Regression-DescendingGradient-Stochastic-DG
project_BuildingBrain-gd-sgd		project_BuildingBrain-gd-sgd
project_Predictive, PI, and Gradient Descent Control in TAB Converters for Electric Vehicles		project_Predictive, PI, and Gradient Descent Control in TAB Converters for Electric Vehicles
✍🏻 HandMade EXAM -LP -MathModels - Graphs + Transportation Problem		✍🏻 HandMade EXAM -LP -MathModels - Graphs + Transportation Problem
Stochastic Gradient Descent (SGD).md		Stochastic Gradient Descent (SGD).md
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODEOWNERS.txt		CODEOWNERS.txt
CONTRIBUTIBNG.md		CONTRIBUTIBNG.md
Gradient Descent Linear Regression from Scratch.md		Gradient Descent Linear Regression from Scratch.md
Graph_Convolutional_Networks.pdf		Graph_Convolutional_Networks.pdf
LICENSE		LICENSE
Mathematical Intro Deep Learning - Methods Implementations Theory.pdf		Mathematical Intro Deep Learning - Methods Implementations Theory.pdf
README.md		README.md
Teoria dos Grafos para Computacao.pdf		Teoria dos Grafos para Computacao.pdf
convolution arithmetic for deep learning.pdf		convolution arithmetic for deep learning.pdf
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
setup-python.yml		setup-python.yml
test_code.py		test_code.py

Uh oh!

License

Mindful-AI-Assistants/BrainCode-AI-ML-From-Tensors-to-ANNs

Folders and files

Latest commit

History

Repository files navigation

🧠 Brain Made of Code, Created with Heart ❤︎

From Regression to Neural Nets: Learning with Gradient Descent & Beyond

I - Artificial Neural Networks – From Perceptron to Modern Learning Algorithms

Each artificial neuron receives input signals $X_1, X_2, ..., X_p$ (binary or real values), each multiplied by a weight $w_1, w_2, ..., w_p$ (real values). The neuron computes a weighted sum (activity level):

Training: Two-Phase Process

1. Forward Phase*

- Example Calculation:

Forward Computation Example

- For input values:

- And example weights:

Compute the activations and outputs for each layer using an activation function (e.g.,** tanh**):

Compute pre-activation (input to each hidden neuron):

2. Backward Phase (Backpropagation)

Example: Training a Two-Layer Perceptron

Learning Rate (η) Importance

1.Gradient Descent

1 - Gradient Descent (Batch)

2 Stochastic Gradient Descent (SGD)

3. Elastic Net Regularization

4. Mini-batch Gradient Descent

5. Adam (Adaptive Moment Estimation)

6. RMSProp

II - Artificial Neural Networks (ANN) - Comprehensive Theory, Use Cases, and Python Code Guide

Required Libraries Installation

Cell 1 - Installation Commands

0. Understanding Tensors and Loading MNIST Dataset

0.1 What is a Tensor?

- Concept:

- **Use Case

- Code:

0.2 Loading MNIST Dataset from TensorFlow Datasets

- Concept:

- Use Case:

- Code:

1. Artificial Neuron Model

- Concept:

- Use Case:

- Code: - with MNIST-like input shape simplified to vector

2. Gradient Descent (Batch Gradient Descent)

- Concept:

- Use Case:

- Code:

3. Stochastic Gradient Descent (SGD)

- Concept:

- Use Case:

- Code:

4. Mini-batch Gradient Descent

- Concept:

- Use Case:

- Code:

5. Elastic Net Regularization with Scikit-learn

- Concept:

- Use Case:

- Code:

6. Multilayer Perceptron (MLP) Forward Pass

- Use Case:

- Code:

7. Backpropagation and Weight Updates

- Use Case:

- Code:

8. Momentum Optimization

- Use Case:

- Code:

9. Activation Functions: ReLU Example

- Use Case:

- Code:

10. Regularization & Dropout Example

- Use Case:

- Code:

11. Batch vs Mini-batch vs Online Learning

- Use Case:

- Code:

12. Data Preprocessing: Feature Normalization

Compute the activations and outputs for each layer using an activation function (e.g., `tanh`):