HenryNdubuaku
diff --git a/‎README.md‎
Lines changed: 2 additions & 3 deletions b/‎README.md‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎chapter 01: vectors/05. basis and duality.md‎
Lines changed: 2 additions & 2 deletions b/‎chapter 01: vectors/05. basis and duality.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎chapter 02: matrices/01. matrix properties.md‎
Lines changed: 24 additions & 24 deletions b/‎chapter 02: matrices/01. matrix properties.md‎
Lines changed: 24 additions & 24 deletions
diff --git a/‎chapter 02: matrices/02. matrix types.md‎
Lines changed: 22 additions & 22 deletions b/‎chapter 02: matrices/02. matrix types.md‎
Lines changed: 22 additions & 22 deletions
diff --git a/‎chapter 02: matrices/03. operations.md‎
Lines changed: 18 additions & 18 deletions b/‎chapter 02: matrices/03. operations.md‎
Lines changed: 18 additions & 18 deletions
@@ -45,7 +45,6 @@ Over the past years working in AI/ML, I filled notebooks with intuition first, r
 - Suggest topics via GitHub issues.
 - PR corrections and better intuition.
 - Create SVG images in `../images/` for all diagrams. 
-- For equations, use ` ```math ` fenced code blocks (NOT `$$`)
-- For display math — GitHub escapes `\\` inside `$$`, breaking matrices. 
-- Inline math `$...$` is fine for simple expressions but move anything with `\\` into a ` ```math ` block. 
+- For display math, use `$$...$$` blocks.
+- Inline math `$...$` is fine for simple expressions.
 - Use `\ast` instead of `*` for conjugate/adjoint in inline math.
@@ -35,9 +35,9 @@
 
 - For every basis $\{\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_n\}$, there is a corresponding **dual basis** $\{\mathbf{e}_1^\ast, \mathbf{e}_2^\ast, \ldots, \mathbf{e}_n^\ast\}$. Each dual basis vector extracts exactly one coordinate:
 
-```math
+$$
 \mathbf{e}_i^\ast(\mathbf{e}_j) = \delta_{ij} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}
-```
+$$
 
 - $\mathbf{e}_1^\ast$ returns 1 when applied to $\mathbf{e}_1$ and 0 for everything else. It perfectly isolates the first coordinate.
 
 
@@ -2,27 +2,27 @@
 
 - At its core, a **matrix** is a rectangular grid of numbers arranged in rows and columns. If a vector is a single list of numbers, a matrix is a table of them.
 
-```math
+$$
 A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}
-```
+$$
 
 - You can also think of a matrix as a stack of vectors.
 
 - If a single person is described by the vector $[\text{age}, \text{height}, \text{weight}]$, then three people form a matrix where each row is one person:
 
-```math
+$$
 \begin{bmatrix} 25 & 170 & 65 \\ 30 & 180 & 80 \\ 22 & 160 & 55 \end{bmatrix}
-```
+$$
 
 - This matrix has 3 rows and 3 columns, so we call it a $3 \times 3$ matrix.
 
 - Each number in the grid is called an **element** or **entry**, identified by its row and column: $A_{ij}$ is the element in row $i$, column $j$.
 
 - The **transpose** of a matrix flips it along its diagonal, turning rows into columns and columns into rows. If $A$ is $m \times n$, then $A^T$ is $n \times m$.
 
-```math
+$$
 A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \quad \Rightarrow \quad A^T = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}
-```
+$$
 
 - Multiplying a matrix by its transpose always gives a square matrix: $AA^T$ is $m \times m$ and $A^TA$ is $n \times n$.
 
@@ -38,15 +38,15 @@ A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \quad \Rightarrow \quad
 
 - For example, the following matrix has rank 2 because neither row is a multiple of the other:
 
-```math
+$$
 \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}
-```
+$$
 
 But this matrix has rank 1 because the second row is just twice the first, so it adds no new information:
 
-```math
+$$
 \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}
-```
+$$
 
 - A $5 \times 3$ matrix can have rank at most 3. If some rows are just scaled or combined versions of others, the rank drops. A matrix with maximum possible rank is called **full rank**.
 
@@ -66,17 +66,17 @@ But this matrix has rank 1 because the second row is just twice the first, so it
 
 - The **determinant** of a square matrix is a single number that captures how the matrix scales space. Think of a $2 \times 2$ matrix as transforming a unit square into a parallelogram. The determinant is the area of that parallelogram (with a sign).
 
-```math
+$$
 \det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc
-```
+$$
 
 ![Determinant: the area scaling factor of a linear transformation](../images/determinant.svg)
 
 - For example:
 
-```math
+$$
 \det\begin{bmatrix} 2 & 1 \\ 0 & 3 \end{bmatrix} = 2 \cdot 3 - 1 \cdot 0 = 6
-```
+$$
 
 The transformation stretches the unit square into a parallelogram with area 6.
 
@@ -94,9 +94,9 @@ The transformation stretches the unit square into a parallelogram with area 6.
 
 - For a $2 \times 2$ matrix, the inverse has a direct formula:
 
-```math
+$$
 \begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad - bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}
-```
+$$
 
 Notice the determinant in the denominator, which is why singular matrices (determinant zero) have no inverse.
 
@@ -106,31 +106,31 @@ Notice the determinant in the denominator, which is why singular matrices (deter
 
 - For example, the following matrix has condition number $10^8$. One direction is scaled normally while the other is nearly squashed to zero, so small perturbations along that direction get wildly distorted:
 
-```math
+$$
 \begin{bmatrix} 1 & 0 \\ 0 & 10^{-8} \end{bmatrix}
-```
+$$
 
 - Just as vectors have norms (length), matrices have **norms** that measure their "size." The most common is the **Frobenius norm**, which treats the matrix as a long vector and computes its length:
 
-```math
+$$
 \|A\|_F = \sqrt{\sum_{i}\sum_{j} A_{ij}^2}
-```
+$$
 
 - For example:
 
-```math
+$$
 \left\|\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}\right\|_F = \sqrt{1 + 4 + 9 + 16} = \sqrt{30} \approx 5.48
-```
+$$
 
 - The **spectral norm** $\|A\|_2$ is the largest singular value of $A$. It measures the maximum amount the matrix can stretch any unit vector. In ML, matrix norms are used for weight regularisation (penalising large weights) and monitoring training stability.
 
 - A symmetric matrix $A$ is **positive definite** if for every non-zero vector $\mathbf{x}$: $\mathbf{x}^T A \mathbf{x} > 0$. This quadratic form always produces a positive number.
 
 - For example, the following matrix is positive definite:
 
-```math
+$$
 A = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}
-```
+$$
 
 Pick any vector, say $\mathbf{x} = [1, -1]^T$: $\mathbf{x}^T A \mathbf{x} = 2 - 1 - 1 + 3 = 3 > 0$. No matter which non-zero $\mathbf{x}$ you try, you always get a positive result.
 
 
@@ -6,29 +6,29 @@
 
 - The **identity matrix** $I$ is a square matrix with 1s on the diagonal and 0s everywhere else. It is the "do nothing" transformation: $AI = IA = A$ for any compatible matrix $A$.
 
-```math
+$$
 I = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}
-```
+$$
 
 - The **zero matrix** $O$ has all elements equal to zero. It maps every vector to the zero vector, destroying all information.
 
 - A **diagonal matrix** is all zeros except on the main diagonal. Multiplying a vector by a diagonal matrix simply scales each component independently, making it very efficient.
 
-```math
+$$
 D = \begin{bmatrix} 3 & 0 \\ 0 & 7 \end{bmatrix}
-```
+$$
 
 - A **symmetric matrix** equals its own transpose: $A = A^T$, meaning $A_{ij} = A_{ji}$. Symmetric matrices have the special property that their eigenvectors are always perpendicular to each other. Covariance matrices are always symmetric.
 
-```math
+$$
 S = \begin{bmatrix} 3 & -1 \\ -1 & 6 \end{bmatrix}
-```
+$$
 
 - A **triangular matrix** has all zeros on one side of the diagonal. **Lower triangular** has zeros above, **upper triangular** has zeros below. They are essential for solving systems of equations efficiently through forward or back substitution.
 
-```math
+$$
 L = \begin{bmatrix} 2 & 0 & 0 \\ 1 & 3 & 0 \\ -1 & 2 & 4 \end{bmatrix} \qquad U = \begin{bmatrix} 5 & -1 & 2 \\ 0 & 1 & 3 \\ 0 & 0 & -2 \end{bmatrix}
-```
+$$
 
 - The determinant of a triangular matrix is simply the product of its diagonal elements.
 
@@ -50,23 +50,23 @@ L = \begin{bmatrix} 2 & 0 & 0 \\ 1 & 3 & 0 \\ -1 & 2 & 4 \end{bmatrix} \qquad U
 
 - For example, the matrix below moves element 3 to position 1, element 1 to position 2, and element 2 to position 3:
 
-```math
+$$
 P = \begin{bmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix}
-```
+$$
 
 - A **Toeplitz matrix** has the same value along every diagonal (upper-left to lower-right). Notice how each diagonal is constant:
 
-```math
+$$
 T = \begin{bmatrix} a & b & c \\ d & a & b \\ e & d & a \end{bmatrix}
-```
+$$
 
 - This structure appears in signal processing and convolution, because sliding a fixed filter across a signal is equivalent to multiplying by a Toeplitz matrix.
 
 - A **circulant matrix** is a special Toeplitz matrix where each row is a cyclic shift of the one above. When a row reaches the end, it wraps around:
 
-```math
+$$
 C = \begin{bmatrix} 1 & 3 & 2 \\ 2 & 1 & 3 \\ 3 & 2 & 1 \end{bmatrix}
-```
+$$
 
 - Circulant matrices are closely connected to the discrete Fourier transform (DFT) and are central to how circular convolution works.
 
@@ -80,31 +80,31 @@ C = \begin{bmatrix} 1 & 3 & 2 \\ 2 & 1 & 3 \\ 3 & 2 & 1 \end{bmatrix}
 
 - A **nilpotent matrix** satisfies $A^k = O$ (the zero matrix) for some power $k$. Apply the transformation enough times and everything collapses to zero. For example:
 
-```math
+$$
 \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}^2 = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}
-```
+$$
 
 - A **Boolean matrix** (or binary matrix) contains only 0s and 1s. It represents yes/no relationships. For example, in a graph with 3 nodes, the **adjacency matrix** records which nodes are connected:
 
-```math
+$$
 B = \begin{bmatrix} 0 & 1 & 1 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \end{bmatrix}
-```
+$$
 
 - Here, node 1 connects to nodes 2 and 3, but nodes 2 and 3 are not connected to each other.
 
 - A **Vandermonde matrix** is built from consecutive powers of a set of values. Given values $x_1, x_2, x_3$:
 
-```math
+$$
 V = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ 1 & x_3 & x_3^2 \end{bmatrix}
-```
+$$
 
 - This structure appears in polynomial interpolation: finding the unique polynomial that passes through a given set of points.
 
 - A **Hessenberg matrix** is "almost" triangular, with zeros below the first subdiagonal:
 
-```math
+$$
 H = \begin{bmatrix} 4 & 2 & 1 \\ 3 & 5 & -1 \\ 0 & 1 & 6 \end{bmatrix}
-```
+$$
 
 - It is a useful intermediate form for computing eigenvalues efficiently. Reducing a matrix to Hessenberg form first makes iterative algorithms converge faster.
 
 
@@ -4,21 +4,21 @@
 
 - For addition, both matrices must have the same dimensions, and you add element by element:
 
-```math
+$$
 \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} + \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 6 & 8 \\ 10 & 12 \end{bmatrix}
-```
+$$
 
 - For scalar multiplication, you multiply every element by the scalar:
 
-```math
+$$
 3 \times \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 3 & 6 \\ 9 & 12 \end{bmatrix}
-```
+$$
 
 - The simplest thing you can do with a matrix is multiply it by a vector. **Matrix-vector multiplication** $A\mathbf{x}$ combines the columns of $A$ using the entries of $\mathbf{x}$ as weights:
 
-```math
+$$
 \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \begin{bmatrix} 5 \\ 6 \end{bmatrix} = 5 \begin{bmatrix} 1 \\ 3 \end{bmatrix} + 6 \begin{bmatrix} 2 \\ 4 \end{bmatrix} = \begin{bmatrix} 17 \\ 39 \end{bmatrix}
-```
+$$
 
 - This is the core operation in ML. Every neural network layer computes $A\mathbf{x} + \mathbf{b}$: a matrix times an input vector, plus a bias.
 
@@ -34,9 +34,9 @@ $$C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}$$
 
 - A useful special case: multiplying a matrix by its transpose always gives a square matrix. $AA^T$ is $m \times m$ and $A^TA$ is $n \times n$:
 
-```math
+$$
 \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix} = \begin{bmatrix} 14 & 32 \\ 32 & 77 \end{bmatrix}
-```
+$$
 
 - Matrix multiplication has important rules:
 
@@ -50,17 +50,17 @@ $$C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}$$
 
 - The **Hadamard product** (element-wise product) multiplies two matrices of the same size entry by entry, written $A \odot B$:
 
-```math
+$$
 \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \odot \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 5 & 12 \\ 21 & 32 \end{bmatrix}
-```
+$$
 
 - Unlike standard matrix multiplication, the Hadamard product is commutative ($A \odot B = B \odot A$) and requires both matrices to have the same dimensions. It is used heavily in ML for gating: multiplying element-wise by a mask of values between 0 and 1 controls how much of each entry "passes through."
 
 - The **outer product** of two vectors $\mathbf{u}$ and $\mathbf{v}$ produces a matrix: $\mathbf{u}\mathbf{v}^T$. Each entry is the product of one element from $\mathbf{u}$ and one from $\mathbf{v}$:
 
-```math
+$$
 \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \begin{bmatrix} 4 & 5 \end{bmatrix} = \begin{bmatrix} 4 & 5 \\ 8 & 10 \\ 12 & 15 \end{bmatrix}
-```
+$$
 
 - The result always has rank 1, because every row is a scaled version of $\mathbf{v}^T$. Any matrix can be written as a sum of rank-1 outer products, which is exactly what SVD does (covered in decompositions).
 
@@ -74,19 +74,19 @@ $$C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}$$
 
 - For example, the matrix:
 
-```math
+$$
 A = \begin{bmatrix} 5 & 0 & 0 & 2 \\ 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & -1 \end{bmatrix}
-```
+$$
 
 - Is stored as: values = [5, 2, 3, -1], columns = [0, 3, 2, 3], row offsets = [0, 2, 3, 4]. This skips all the zeros and makes sparse operations much faster.
 
 - A core use of matrices is solving **systems of linear equations**. The system $A\mathbf{x} = \mathbf{b}$ asks: "what vector $\mathbf{x}$, when transformed by $A$, produces $\mathbf{b}$?"
 
 - For example, say you are buying fruit. Apples cost $x_1$ dollars each and bananas cost $x_2$ dollars each. You know that 2 apples and 1 banana cost \$5, and 1 apple and 3 bananas cost \$10. In matrix form:
 
-```math
+$$
 \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 5 \\ 10 \end{bmatrix}
-```
+$$
 
 - Multiplying the matrix by the vector row by row (each row dotted with $[x_1, x_2]^T$) gives two equations:
 
@@ -96,9 +96,9 @@ $$2x_1 + 1x_2 = 5 \qquad \text{(row 1)} \qquad \qquad x_1 + 3x_2 = 10 \qquad \te
 
 - Verify — it checks out:
 
-```math
+$$
 \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} 1 \\ 3 \end{bmatrix} = \begin{bmatrix} 2 + 3 \\ 1 + 9 \end{bmatrix} = \begin{bmatrix} 5 \\ 10 \end{bmatrix}
-```
+$$
 
 - If $A$ has an inverse, the solution is simply $\mathbf{x} = A^{-1}\mathbf{b}$. But computing the inverse directly is expensive and numerically unstable. In practice, we use decompositions instead.