Skip to content

Commit ade6265

Browse files
committed
added content for calculus
1 parent 8231e14 commit ade6265

File tree

6 files changed

+586
-0
lines changed

6 files changed

+586
-0
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: "Chain Rule - The Engine of Backpropagation"
3+
sidebar_label: Chain Rule
4+
description: "Mastering the Chain Rule, the fundamental calculus tool for differentiating composite functions, and its direct application in the Backpropagation algorithm for training neural networks."
5+
tags:
6+
[
7+
chain-rule,
8+
calculus,
9+
mathematics-for-ml,
10+
backpropagation,
11+
composite-functions,
12+
neural-networks,
13+
gradient,
14+
]
15+
---
16+
17+
The **Chain Rule** is a formula used to compute the derivative of a **composite function**, a function that is composed of one function inside another. If a function is built like a chain, the Chain Rule shows us how to differentiate it link by link.
18+
19+
This is arguably the most important calculus concept for Deep Learning, as the entire structure of a neural network is one massive composite function.
20+
21+
## 1. What is a Composite Function?
22+
23+
A composite function is one where the output of an inner function becomes the input of an outer function.
24+
25+
If $y$ is a function of $u$, and $u$ is a function of $x$, then $y$ is ultimately a function of $x$.
26+
27+
$$
28+
y = f(u) \quad \text{where} \quad u = g(x)
29+
$$
30+
31+
The overall composite function is $y = f(g(x))$.
32+
33+
## 2. The Chain Rule Formula (Single Variable)
34+
35+
The Chain Rule states that the rate of change of $y$ with respect to $x$ is the product of the rate of change of $y$ with respect to $u$, and the rate of change of $u$ with respect to $x$.
36+
37+
$$
38+
\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}
39+
$$
40+
41+
### Example
42+
43+
Let $y = (x^2 + 1)^3$. This can be written as $y = u^3$ where $u = x^2 + 1$.
44+
45+
1. **Find $\frac{dy}{du}$ (Outer derivative):**
46+
$$
47+
\frac{dy}{du} = \frac{d}{du}(u^3) = 3u^2
48+
$$
49+
2. **Find $\frac{du}{dx}$ (Inner derivative):**
50+
$$
51+
\frac{du}{dx} = \frac{d}{dx}(x^2 + 1) = 2x
52+
$$
53+
3. **Apply the Chain Rule:**
54+
$$
55+
\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = (3u^2) \cdot (2x)
56+
$$
57+
4. **Substitute $u$ back:**
58+
$$
59+
\frac{dy}{dx} = 3(x^2 + 1)^2 \cdot 2x = 6x(x^2 + 1)^2
60+
$$
61+
62+
## 3. The Chain Rule with Multiple Variables (Partial Derivatives)
63+
64+
In neural networks, one variable can affect the final output through multiple different paths. This requires a slightly more complex version of the Chain Rule involving partial derivatives and summation.
65+
66+
If $z$ is a function of $x$ and $y$, and both $x$ and $y$ are functions of $t$: $z = f(x, y)$, where $x=g(t)$ and $y=h(t)$.
67+
68+
The total derivative of $z$ with respect to $t$ is:
69+
70+
$$
71+
\frac{dz}{dt} = \frac{\partial z}{\partial x} \frac{dx}{dt} + \frac{\partial z}{\partial y} \frac{dy}{dt}
72+
$$
73+
74+
## 4. The Chain Rule and Backpropagation
75+
76+
Backpropagation (short for "backward propagation of errors") is the algorithm used to train neural networks. It is nothing more than the repeated application of the multivariate Chain Rule.
77+
78+
### The Neural Network Chain
79+
80+
A neural network is a sequence of composite functions:
81+
82+
$$
83+
\text{Loss} \leftarrow \text{Output Layer} \leftarrow \text{Hidden Layer 2} \leftarrow \text{Hidden Layer 1} \leftarrow \text{Input}
84+
$$
85+
86+
The goal is to calculate how a small change in a parameter (weight $w$) in an **early layer** affects the final **Loss** $J$.
87+
88+
$$
89+
\frac{\partial J}{\partial w_{\text{early}}} = \left(\frac{\partial J}{\partial \text{Output}}\right) \cdot \left(\frac{\partial \text{Output}}{\partial \text{Layer } 2}\right) \cdot \left(\frac{\partial \text{Layer } 2}{\partial \text{Layer } 1}\right) \cdot \left(\frac{\partial \text{Layer } 1}{\partial w_{\text{early}}}\right)
90+
$$
91+
92+
:::important Backpropagation Flow
93+
1. **Forward Pass:** Calculate the prediction and the Loss $J$.
94+
2. **Backward Pass (Backprop):** Start at the end of the chain (the Loss $J$) and calculate the partial derivatives (gradients) layer by layer, multiplying them backward toward the input.
95+
3. **Update:** Use the final calculated gradient $\frac{\partial J}{\partial w}$ to update the weight $w$ via Gradient Descent.
96+
:::
97+
98+
## 5. Summary of Calculus for ML
99+
100+
You have now covered the three foundational concepts of Calculus required for Machine Learning:
101+
102+
| Concept | Mathematical Tool | ML Application |
103+
| :--- | :--- | :--- |
104+
| **Derivatives** | $\frac{df}{dx}$ | Measures the slope of the loss function. |
105+
| **Partial Derivatives** | $\nabla J$ (The Gradient) | Identifies the direction of steepest ascent in the high-dimensional loss surface. |
106+
| **Chain Rule** | $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$ | Propagates the gradient backward through all layers of a neural network to calculate parameter updates. |
107+
108+
---
109+
110+
With the mathematical foundations of Linear Algebra and Calculus established, we are now ready to tackle the core optimization algorithm that brings these concepts together: Gradient Descent.
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
title: "Derivatives - The Rate of Change"
3+
sidebar_label: Derivatives
4+
description: "An introduction to derivatives, their definition, rules, and their crucial role in calculating the slope of the loss function, essential for optimization algorithms like Gradient Descent."
5+
tags:
6+
[
7+
derivatives,
8+
calculus,
9+
mathematics-for-ml,
10+
rate-of-change,
11+
slope,
12+
gradient-descent,
13+
optimization,
14+
]
15+
---
16+
17+
Calculus is the mathematical foundation for optimization in Machine Learning. Specifically, **Derivatives** are the primary tool used to train almost every ML model, from Linear Regression to complex Neural Networks, via algorithms like Gradient Descent.
18+
19+
## 1. What is a Derivative?
20+
21+
The derivative of a function measures the **instantaneous rate of change** of that function. Geometrically, the derivative at any point on a curve is the **slope of the tangent line** to the curve at that point.
22+
23+
### Formal Definition
24+
25+
The derivative of a function $f(x)$ with respect to $x$ is defined using limits:
26+
27+
$$
28+
f'(x) = \frac{dy}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}
29+
$$
30+
31+
* $\frac{dy}{dx}$ is the common notation, read as "the derivative of $y$ with respect to $x$."
32+
* The expression $\frac{f(x+h) - f(x)}{h}$ is the slope of the secant line between $x$ and $x+h$.
33+
* Taking the limit as $h$ approaches zero gives the exact slope of the tangent line at $x$.
34+
35+
## 2. Derivatives in Machine Learning: Optimization
36+
37+
In Machine Learning, we define a **Loss Function** (or Cost Function) $J(\theta)$ which measures the error of our model, where $\theta$ represents the model's parameters (weights and biases).
38+
39+
The goal of training is to find the parameter values $\theta$ that **minimize** the loss function.
40+
41+
### A. Finding the Minimum
42+
43+
1. A function's minimum (or maximum) occurs where the slope is zero.
44+
2. The derivative tells us the slope.
45+
3. Therefore, by setting the derivative $\frac{dJ}{d\theta}$ to zero, we can find the optimal parameters $\theta$.
46+
47+
### B. Gradient Descent
48+
49+
For most complex ML models, the loss function is too complex to solve by setting the derivative to zero directly. Instead, we use an iterative process called **Gradient Descent**.
50+
51+
The derivative $\frac{dJ}{d\theta}$ tells us two things:
52+
53+
* **Magnitude:** How steep the slope is (how quickly the loss is changing).
54+
* **Direction (Sign):** Whether moving parameter $\theta$ in a positive direction will increase or decrease the loss.
55+
56+
In Gradient Descent, we update the parameter $\theta$ in the **opposite direction** of the derivative (down the slope) to find the minimum:
57+
58+
$$
59+
\theta_{\text{new}} = \theta_{\text{old}} - \alpha \frac{dJ}{d\theta}
60+
$$
61+
62+
* $\alpha$ (alpha) is the **learning rate** (a small scalar).
63+
* $\frac{dJ}{d\theta}$ is the derivative (the slope/gradient).
64+
65+
## 3. Basic Differentiation Rules
66+
67+
You must be familiar with the following rules to understand how derivatives are calculated for model training.
68+
69+
| Rule Name | Function $f(x)$ | Derivative $\frac{d}{dx}f(x)$ | Example |
70+
| :--- | :--- | :--- | :--- |
71+
| **Constant Rule** | $c$ | $0$ | $\frac{d}{dx}(5) = 0$ |
72+
| **Power Rule** | $x^n$ | $nx^{n-1}$ | $\frac{d}{dx}(x^3) = 3x^2$ |
73+
| **Constant Multiple** | $c \cdot f(x)$ | $c \cdot f'(x)$ | $\frac{d}{dx}(4x^2) = 8x$ |
74+
| **Sum/Difference** | $f(x) \pm g(x)$ | $f'(x) \pm g'(x)$ | $\frac{d}{dx}(x^2 - 3x) = 2x - 3$ |
75+
| **Exponential** | $e^x$ | $e^x$ | |
76+
77+
### Example: Quadratic Loss
78+
79+
Linear Regression often uses Mean Squared Error (MSE), which is a quadratic function of the weights $w$.
80+
81+
Let the simplified loss function be $J(w) = w^2 + 4w + 1$.
82+
We apply the Sum and Power Rules:
83+
84+
$$
85+
\frac{dJ}{dw} = \frac{d}{dw}(w^2) + \frac{d}{dw}(4w) + \frac{d}{dw}(1) = 2w + 4 + 0 = 2w + 4
86+
$$
87+
88+
If the current weight is $w=1$, the slope is $2(1) + 4 = 6$ (steep, positive).
89+
90+
## References and Resources
91+
92+
To solidify your understanding of differentiation, here are some excellent resources:
93+
94+
* **[Khan Academy - Differential Calculus](https://www.khanacademy.org/math/differential-calculus):** Comprehensive video tutorials covering limits, derivatives, and rules. Excellent for visual learners.
95+
* **Calculus: Early Transcendentals** by James Stewart (or any similar major textbook): Provides rigorous definitions and practice problems.
96+
* **The Calculus of Computation** by Lars Kristensen: A good resource that connects calculus directly to computational methods.
97+
98+
---
99+
100+
Most functions in ML depend on more than one parameter (e.g., $w_1, w_2, \text{bias}$). To find the slope in these multi-variable spaces, we must use Partial Derivatives.
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
title: "Gradients - The Direction of Steepest Ascent"
3+
sidebar_label: Gradients
4+
description: "Defining the Gradient vector, its mathematical composition from partial derivatives, its geometric meaning as the direction of maximum increase, and its role as the central mechanism for learning in Machine Learning."
5+
tags:
6+
[
7+
gradients,
8+
calculus,
9+
mathematics-for-ml,
10+
gradient-descent,
11+
vector-calculus,
12+
optimization,
13+
loss-function,
14+
]
15+
---
16+
17+
The **Gradient** is the ultimate expression of calculus in Machine Learning. It is the vector that consolidates all the partial derivatives of a multi-variable function (like our Loss Function) and points in the direction the function is increasing most rapidly.
18+
19+
Understanding the gradient is essential because the primary optimization algorithm, **Gradient Descent**, simply involves moving in the direction *opposite* to the gradient.
20+
21+
## 1. Defining the Gradient Vector
22+
23+
The gradient of a scalar-valued function $f$ of several variables ($\theta_1, \theta_2, \dots, \theta_n$) is a **vector** that contains all the function's partial derivatives.
24+
25+
### Notation
26+
27+
The gradient of a function $J(\mathbf{\theta})$ (our Loss Function, $J$) with respect to the parameter vector $\mathbf{\theta}$ is denoted by the $\nabla$ symbol (nabla or del):
28+
29+
$$
30+
\nabla J(\mathbf{\theta}) \quad \text{or} \quad \nabla_{\mathbf{\theta}} J
31+
$$
32+
33+
### Composition
34+
35+
If the loss function $J$ depends on $n$ parameters, $\mathbf{\theta} = (\theta_1, \theta_2, \dots, \theta_n)$, the gradient is the $n$-dimensional vector:
36+
37+
$$
38+
\nabla J(\mathbf{\theta}) = \begin{bmatrix}
39+
\frac{\partial J}{\partial \theta_1} \\
40+
\frac{\partial J}{\partial \theta_2} \\
41+
\vdots \\
42+
\frac{\partial J}{\partial \theta_n}
43+
\end{bmatrix}
44+
$$
45+
46+
## 2. Geometric Meaning
47+
48+
The Gradient $\nabla J$ is the vector that has two crucial geometric properties:
49+
50+
1. **Direction:** It points in the direction of the **steepest increase** (the fastest way uphill) on the function's surface.
51+
2. **Magnitude (Length):** The length of the gradient vector, $||\nabla J||$, indicates the **steepness** of the slope in that direction.
52+
53+
## 3. The Central Role in Gradient Descent
54+
55+
Since the goal of training an ML model is to **minimize** the Loss Function $J(\mathbf{\theta})$, we must adjust the parameters $\mathbf{\theta}$ to move *downhill*.
56+
57+
The most effective path downhill is to move in the exact opposite direction of the gradient.
58+
59+
### A. The Update Rule
60+
61+
The Gradient Descent update rule formalizes this movement:
62+
63+
$$
64+
\mathbf{\theta}_{\text{new}} = \mathbf{\theta}_{\text{old}} - \alpha \nabla J(\mathbf{\theta}_{\text{old}})
65+
$$
66+
67+
| Term | Role in Optimization | Calculation |
68+
| :--- | :--- | :--- |
69+
| $\mathbf{\theta}_{\text{old}}$ | Current position (weights/biases). | Vector of current model parameters. |
70+
| $\alpha$ (Alpha) | **Learning Rate** (a small scalar). | Hyperparameter defining the step size. |
71+
| $\nabla J(\mathbf{\theta})$ | **Gradient** of the Loss. | Vector of all partial derivatives. |
72+
| $-\nabla J(\mathbf{\theta})$ | **Negative Gradient**. | The direction of steepest descent (downhill). |
73+
74+
### B. Convergence
75+
76+
As the parameters approach the minimum (the "valley floor"), the slope of the Loss Function flattens.
77+
78+
* At the minimum point, the Loss is flat, so all partial derivatives are zero.
79+
* Therefore, the gradient $\nabla J$ is the zero vector ($\mathbf{0}$).
80+
* The update step becomes $\mathbf{\theta}_{\text{new}} = \mathbf{\theta}_{\text{old}} - \alpha \cdot \mathbf{0}$. The parameters stop changing, and the model has converged.
81+
82+
## 4. Analogy: Descending a Mountain
83+
84+
Imagine being blindfolded on a vast mountain range (the Loss Surface). Your goal is to reach the valley floor (the minimum loss).
85+
86+
* **You can't see the whole mountain:** You only know your local height and slope (your current loss $J(\mathbf{\theta})$).
87+
* **The Gradient ($\nabla J$):** A guide who tells you, "The fastest way to go **up** from here is to take 3 steps North and 1 step East."
88+
* **Gradient Descent:** You ignore the guide's direction and decide, "I will move the **opposite** of what you say," taking 3 steps South and 1 step West.
89+
* **Learning Rate ($\alpha$):** Determines if your step size is a cautious hop or a giant leap.
90+
91+
---
92+
93+
The Gradient is the core concept uniting all the calculus concepts we've covered. It moves the model from an initial, poor starting position to an optimal, converged solution.
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: "The Hessian Matrix"
3+
sidebar_label: Hessian
4+
description: "Understanding the Hessian matrix, second-order derivatives, and how the curvature of the loss surface impacts optimization and model stability."
5+
tags:
6+
[
7+
hessian,
8+
calculus,
9+
mathematics-for-ml,
10+
optimization,
11+
second-order-derivatives,
12+
curvature,
13+
]
14+
---
15+
16+
While the **Gradient** tells us the direction of the steepest slope, it doesn't tell us about the "shape" of the ground. Is the slope getting steeper or flatter? Are we in a narrow canyon or a wide, shallow bowl? To answer these questions, we need second-order derivatives, organized into the **Hessian Matrix**.
17+
18+
## 1. What is the Hessian?
19+
20+
The Hessian is a square matrix of **second-order partial derivatives** of a scalar-valued function. It describes the local **curvature** of the function.
21+
22+
If we have a function $f(x_1, x_2, \dots, x_n)$, the Hessian $\mathbf{H}$ is an $n \times n$ matrix:
23+
24+
$$
25+
\mathbf{H} = \begin{bmatrix}
26+
\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \dots \\
27+
\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \dots \\
28+
\vdots & \vdots & \ddots
29+
\end{bmatrix}
30+
$$
31+
32+
:::info Symmetry
33+
If the second derivatives are continuous, the Hessian is a **symmetric matrix** (i.e., $\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$). This makes it easier to work with using Linear Algebra tools like Eigen-decomposition.
34+
:::
35+
36+
## 2. Why does the Hessian matter in ML?
37+
38+
The Hessian helps us understand the "topography" of the Loss Function $J(\theta)$.
39+
40+
### A. Determining Maxima and Minima
41+
The gradient only tells us if the slope is zero ($\nabla J = 0$), but that could be a peak, a valley, or a saddle point. The Hessian tells us which one:
42+
* **Positive Definite Hessian:** The surface curves upward in all directions (a **Local Minimum**).
43+
* **Negative Definite Hessian:** The surface curves downward in all directions (a **Local Maximum**).
44+
* **Indefinite Hessian:** The surface curves up in some directions and down in others (a **Saddle Point**).
45+
46+
### B. Curvature and Learning Rates
47+
The Hessian determines the "width" of the valley:
48+
* **High Curvature:** A narrow, steep valley. If the learning rate is too high, Gradient Descent will bounce back and forth across the valley walls.
49+
* **Low Curvature:** A wide, flat valley. Gradient Descent will move very slowly toward the bottom.
50+
51+
## 3. Second-Order Optimization
52+
53+
Standard Gradient Descent is a **first-order** method; it only uses the gradient. There are **second-order** methods, like **Newton's Method**, that use the Hessian to take much more efficient steps.
54+
55+
Instead of just moving in the negative gradient direction, Newton's method scales the step by the inverse of the Hessian:
56+
57+
$$
58+
\theta_{new} = \theta_{old} - \mathbf{H}^{-1} \nabla J(\theta)
59+
$$
60+
61+
:::caution The Computational Catch
62+
In modern Deep Learning, the Hessian is rarely used directly. If a model has 10 million parameters, the Hessian matrix would have $10^{14}$ elements (100 trillion!), which is impossible to store in memory or invert. We use "quasi-Newton" methods or adaptive optimizers (like Adam) that approximate this curvature information.
63+
:::
64+
65+
## 4. Example Calculation
66+
67+
Let $f(x, y) = x^2 + 4xy + y^2$.
68+
69+
1. **First Partial Derivatives (Gradient):**
70+
* $f_x = 2x + 4y$
71+
* $f_y = 4x + 2y$
72+
2. **Second Partial Derivatives (Hessian):**
73+
* $f_{xx} = \frac{\partial}{\partial x}(2x + 4y) = 2$
74+
* $f_{yy} = \frac{\partial}{\partial y}(4x + 2y) = 2$
75+
* $f_{xy} = \frac{\partial}{\partial y}(2x + 4y) = 4$
76+
77+
The Hessian matrix is:
78+
$$
79+
\mathbf{H} = \begin{bmatrix} 2 & 4 \\ 4 & 2 \end{bmatrix}
80+
$$
81+
82+
---
83+
84+
Now that we have covered the mathematics of change (Calculus), we need to look at the mathematics of uncertainty. This allows us to handle noisy data and make predictions with confidence.

0 commit comments

Comments
 (0)