Mathematical Foundation

🧮 Mathematical Foundation

Understanding the Math Behind Logistic Regression

Math Theory

📐 Core Concepts

🎯 Sigmoid Activation Function	💰 Cost Function Loss Measurement	⚡ Gradient Optimization Direction	🔄 Update Rule Parameter Learning

1️⃣ The Sigmoid Function

Definition

The sigmoid function (also called logistic function) maps any real number to a value between 0 and 1:

$$σ(z) = \frac{1}{1 + e^{-z}}$$

Visual Representation

    σ(z)
    1 ┤        ┌────────
      │       /
  0.5 ┤------/
      │     /
    0 ┤────┘
      └────┴────┴────┴─── z
        -∞  0   ∞

Properties

| Property | Description | Value | |:---------|: ------------|:------| | Range | Output values | (0, 1) | | Domain | Input values | (-∞, +∞) | | Midpoint | σ(0) | 0.5 | | Symmetry | σ(-z) | 1 - σ(z) |

Python Implementation

def sigmoid(z):
    """
    Compute sigmoid function
    
    Parameters:
    z :  array-like, input values
    
    Returns:
    Sigmoid of z
    """
    return 1 / (1 + np.exp(-z))

Key Insights

💡 **Why sigmoid? ** It transforms linear combinations into probabilities!

💡 Interpretation: Output represents probability of belonging to class 1

💡 Decision boundary: When σ(z) ≥ 0.5, predict class 1; otherwise class 0

2️⃣ Linear Combination (Hypothesis)

The hypothesis function combines features linearly before applying sigmoid:

$$h_θ(x) = σ(θ^T x) = σ(θ_0 + θ_1x_1 + θ_2x_2 + ... + θ_nx_n)$$

Where:

θ (theta) = weight parameters (also denoted as w)
x = input features
θ₀ = bias term (also denoted as b)

Matrix Form

$$z = Xθ = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} & ... & x_n^{(1)} \\ x_1^{(2)} & x_2^{(2)} & ... & x_n^{(2)} \\ ... & ... & ... & ... \\ x_1^{(m)} & x_2^{(m)} & ... & x_n^{(m)} \end{bmatrix} \begin{bmatrix} θ_1 \\ θ_2 \\ ... \\ θ_n \end{bmatrix}$$

Python Implementation

def linear_combination(X, weights, bias):
    """
    Compute linear combination
    
    Parameters:
    X : array, shape (m, n) - training examples
    weights : array, shape (n,) - feature weights
    bias : float - bias term
    
    Returns:
    z : array, shape (m,) - linear combinations
    """
    return np. dot(X, weights) + bias

3️⃣ Cost Function (Binary Cross-Entropy)

The cost function measures how well our model performs:

Single Example Cost

$$L(h_θ(x), y) = -y \log(h_θ(x)) - (1-y) \log(1-h_θ(x))$$

Full Dataset Cost

$$J(θ) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_θ(x^{(i)})) + (1-y^{(i)}) \log(1-h_θ(x^{(i)}))]$$

Where:

m = number of training examples
y = actual label (0 or 1)
h_θ(x) = predicted probability

Intuition

Scenario	y (actual)	h_θ(x) (predicted)	Cost
Perfect prediction	1	1. 0	0 (low)
Wrong prediction	1	0.0	∞ (high)
Perfect prediction	0	0.0	0 (low)
Wrong prediction	0	1.0	∞ (high)

Python Implementation

def compute_cost(y_true, y_pred):
    """
    Compute binary cross-entropy cost
    
    Parameters:
    y_true :  array, shape (m,) - true labels
    y_pred : array, shape (m,) - predicted probabilities
    
    Returns: 
    cost : float - average cost
    """
    m = len(y_true)
    epsilon = 1e-15  # Prevent log(0)
    
    # Clip predictions to avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    cost = -1/m * np.sum(
        y_true * np.log(y_pred) + 
        (1 - y_true) * np.log(1 - y_pred)
    )
    
    return cost

4️⃣ Gradient Descent

Gradient descent is the optimization algorithm that minimizes the cost function.

Gradient Computation

For each parameter θⱼ:

$$\frac{∂J(θ)}{∂θ_j} = \frac{1}{m} \sum_{i=1}^{m} (h_θ(x^{(i)}) - y^{(i)}) x_j^{(i)}$$

Vectorized Form

$$\nabla J(θ) = \frac{1}{m} X^T (h_θ(X) - y)$$

Update Rule

$$θ_j := θ_j - α \frac{∂J(θ)}{∂θ_j}$$

Where:

α (alpha) = learning rate
: = means "update to"

Python Implementation

def gradient_descent(X, y, weights, bias, learning_rate, iterations):
    """
    Perform gradient descent optimization
    
    Parameters:
    X : array, shape (m, n)
    y : array, shape (m,)
    weights : array, shape (n,)
    bias : float
    learning_rate : float
    iterations : int
    
    Returns:
    weights, bias, cost_history
    """
    m = len(y)
    cost_history = []
    
    for i in range(iterations):
        # Forward propagation
        z = linear_combination(X, weights, bias)
        predictions = sigmoid(z)
        
        # Compute cost
        cost = compute_cost(y, predictions)
        cost_history.append(cost)
        
        # Backward propagation (gradients)
        dz = predictions - y
        dw = (1/m) * np.dot(X.T, dz)
        db = (1/m) * np.sum(dz)
        
        # Update parameters
        weights -= learning_rate * dw
        bias -= learning_rate * db
    
    return weights, bias, cost_history

5️⃣ Learning Rate

The learning rate (α) controls how big steps we take during optimization.

Visualization

Cost
 │
 │  α too large        α just right       α too small
 │     ╱╲╱╲                ──╲              ──────╲
 │   ╱╲    ╱╲            ────╲           ────────╲
 │  ╱  ╲  ╱  ╲         ──────╲        ────────────╲
 └──────────────────────────────────────────────── Iterations
    (oscillates)        (converges)    (slow convergence)

Choosing Learning Rate

| Value | Effect | Recommendation | |: ------|:-------|:---------------| | Too high (α > 1) | Oscillation, divergence | ❌ Avoid | | Good (0.001 - 0.1) | Smooth convergence | ✅ Use | | Too low (α < 0.0001) | Very slow learning | ⚠️ Inefficient |

6️⃣ Complete Algorithm

Pseudocode

Algorithm:  Logistic Regression Training
────────────────────────────────────────
Input:  X (features), y (labels), α (learning rate), iterations
Output: θ (parameters)

1. Initialize θ randomly or to zeros
2. Repeat for number of iterations:
   a.  Compute z = Xθ
   b. Compute predictions: h = σ(z)
   c. Compute cost: J(θ)
   d. Compute gradients: ∇J(θ)
   e. Update: θ := θ - α∇J(θ)
3. Return θ

🎯 Decision Boundary

The decision boundary separates classes in feature space.

Mathematical Definition

Decision boundary occurs when:

$$h_θ(x) = 0. 5 ⟹ θ^T x = 0$$

For 2D Features

$$θ_0 + θ_1x_1 + θ_2x_2 = 0$$

Rearranging:

$$x_2 = -\frac{θ_0}{θ_2} - \frac{θ_1}{θ_2}x_1$$

This is a straight line!

Visualization

 x₂
  │     Class 1 (y=1)
  │   ○ ○ ○ ○
  │  ○ ○ ○ ○ ○
  │ ──────────── ← Decision Boundary (θᵀx = 0)
  │  × × × ×
  │ × × × ×
  │   Class 0 (y=0)
  └─────────────── x₁

📊 Summary Table

Component	Formula	Purpose
Sigmoid	σ(z) = 1/(1+e⁻ᶻ)	Convert to probability
Hypothesis	h_θ(x) = σ(θᵀx)	Make predictions
Cost	J(θ) = -1/m Σ[y log(h) + (1-y)log(1-h)]	Measure error
Gradient	∇J = 1/m Xᵀ(h-y)	Find update direction
Update	θ := θ - α∇J	Learn parameters

🧪 Practice Problems

Problem 1: Calculate Sigmoid

Calculate σ(0), σ(2), σ(-2)

Solution:

σ(0) = 1/(1+e⁰) = 1/2 = 0.5
σ(2) = 1/(1+e⁻²) ≈ 0.88
σ(-2) = 1/(1+e²) ≈ 0.12

Problem 2: Cost Function

If y=1 and h_θ(x)=0.7, what is the cost?

Solution: L = -[1×log(0.7) + 0×log(0.3)] = -log(0.7) ≈ 0.357

🔗 Next Steps

Now that you understand the math, let's see the code!

← Getting Started | Implementation Guide →

Mathematical Foundation

🧮 Mathematical Foundation

Understanding the Math Behind Logistic Regression

📐 Core Concepts

🎯 Sigmoid

💰 Cost Function

⚡ Gradient

🔄 Update Rule

1️⃣ The Sigmoid Function

Definition

Visual Representation

Properties

Python Implementation

Key Insights

2️⃣ Linear Combination (Hypothesis)

Matrix Form

Python Implementation

3️⃣ Cost Function (Binary Cross-Entropy)

Single Example Cost

Full Dataset Cost

Intuition

Python Implementation

4️⃣ Gradient Descent

Gradient Computation

Vectorized Form

Update Rule

Python Implementation

5️⃣ Learning Rate

Visualization

Choosing Learning Rate

6️⃣ Complete Algorithm

Pseudocode

🎯 Decision Boundary

Mathematical Definition

For 2D Features

Visualization

📊 Summary Table

🧪 Practice Problems

🔗 Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally