Skip to content

Mathematical Foundation

SRIJA DE CHOWDHURY edited this page Jan 4, 2026 · 1 revision

🧮 Mathematical Foundation

Understanding the Math Behind Logistic Regression

Math Theory


📐 Core Concepts

🎯 Sigmoid

Activation Function

💰 Cost Function

Loss Measurement

⚡ Gradient

Optimization Direction

🔄 Update Rule

Parameter Learning


1️⃣ The Sigmoid Function

Definition

The sigmoid function (also called logistic function) maps any real number to a value between 0 and 1:

$$σ(z) = \frac{1}{1 + e^{-z}}$$

Visual Representation

    σ(z)
    1 ┤        ┌────────
      │       /
  0.5 ┤------/
      │     /
    0 ┤────┘
      └────┴────┴────┴─── z
        -∞  0   ∞

Properties

| Property | Description | Value | |:---------|: ------------|:------| | Range | Output values | (0, 1) | | Domain | Input values | (-∞, +∞) | | Midpoint | σ(0) | 0.5 | | Symmetry | σ(-z) | 1 - σ(z) |

Python Implementation

def sigmoid(z):
    """
    Compute sigmoid function
    
    Parameters:
    z :  array-like, input values
    
    Returns:
    Sigmoid of z
    """
    return 1 / (1 + np.exp(-z))

Key Insights

💡 **Why sigmoid? ** It transforms linear combinations into probabilities!

💡 Interpretation: Output represents probability of belonging to class 1

💡 Decision boundary: When σ(z) ≥ 0.5, predict class 1; otherwise class 0


2️⃣ Linear Combination (Hypothesis)

The hypothesis function combines features linearly before applying sigmoid:

$$h_θ(x) = σ(θ^T x) = σ(θ_0 + θ_1x_1 + θ_2x_2 + ... + θ_nx_n)$$

Where:

  • θ (theta) = weight parameters (also denoted as w)
  • x = input features
  • θ₀ = bias term (also denoted as b)

Matrix Form

$$z = Xθ = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} & ... & x_n^{(1)} \\ x_1^{(2)} & x_2^{(2)} & ... & x_n^{(2)} \\ ... & ... & ... & ... \\ x_1^{(m)} & x_2^{(m)} & ... & x_n^{(m)} \end{bmatrix} \begin{bmatrix} θ_1 \\ θ_2 \\ ... \\ θ_n \end{bmatrix}$$

Python Implementation

def linear_combination(X, weights, bias):
    """
    Compute linear combination
    
    Parameters:
    X : array, shape (m, n) - training examples
    weights : array, shape (n,) - feature weights
    bias : float - bias term
    
    Returns:
    z : array, shape (m,) - linear combinations
    """
    return np. dot(X, weights) + bias

3️⃣ Cost Function (Binary Cross-Entropy)

The cost function measures how well our model performs:

Single Example Cost

$$L(h_θ(x), y) = -y \log(h_θ(x)) - (1-y) \log(1-h_θ(x))$$

Full Dataset Cost

$$J(θ) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_θ(x^{(i)})) + (1-y^{(i)}) \log(1-h_θ(x^{(i)}))]$$

Where:

  • m = number of training examples
  • y = actual label (0 or 1)
  • h_θ(x) = predicted probability

Intuition

Scenario y (actual) h_θ(x) (predicted) Cost
Perfect prediction 1 1. 0 0 (low)
Wrong prediction 1 0.0 ∞ (high)
Perfect prediction 0 0.0 0 (low)
Wrong prediction 0 1.0 ∞ (high)

Python Implementation

def compute_cost(y_true, y_pred):
    """
    Compute binary cross-entropy cost
    
    Parameters:
    y_true :  array, shape (m,) - true labels
    y_pred : array, shape (m,) - predicted probabilities
    
    Returns: 
    cost : float - average cost
    """
    m = len(y_true)
    epsilon = 1e-15  # Prevent log(0)
    
    # Clip predictions to avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    cost = -1/m * np.sum(
        y_true * np.log(y_pred) + 
        (1 - y_true) * np.log(1 - y_pred)
    )
    
    return cost

4️⃣ Gradient Descent

Gradient descent is the optimization algorithm that minimizes the cost function.

Gradient Computation

For each parameter θⱼ:

$$\frac{∂J(θ)}{∂θ_j} = \frac{1}{m} \sum_{i=1}^{m} (h_θ(x^{(i)}) - y^{(i)}) x_j^{(i)}$$

Vectorized Form

$$\nabla J(θ) = \frac{1}{m} X^T (h_θ(X) - y)$$

Update Rule

$$θ_j := θ_j - α \frac{∂J(θ)}{∂θ_j}$$

Where:

  • α (alpha) = learning rate
  • : = means "update to"

Python Implementation

def gradient_descent(X, y, weights, bias, learning_rate, iterations):
    """
    Perform gradient descent optimization
    
    Parameters:
    X : array, shape (m, n)
    y : array, shape (m,)
    weights : array, shape (n,)
    bias : float
    learning_rate : float
    iterations : int
    
    Returns:
    weights, bias, cost_history
    """
    m = len(y)
    cost_history = []
    
    for i in range(iterations):
        # Forward propagation
        z = linear_combination(X, weights, bias)
        predictions = sigmoid(z)
        
        # Compute cost
        cost = compute_cost(y, predictions)
        cost_history.append(cost)
        
        # Backward propagation (gradients)
        dz = predictions - y
        dw = (1/m) * np.dot(X.T, dz)
        db = (1/m) * np.sum(dz)
        
        # Update parameters
        weights -= learning_rate * dw
        bias -= learning_rate * db
    
    return weights, bias, cost_history

5️⃣ Learning Rate

The learning rate (α) controls how big steps we take during optimization.

Visualization

Cost
 │
 │  α too large        α just right       α too small
 │     ╱╲╱╲                ──╲              ──────╲
 │   ╱╲    ╱╲            ────╲           ────────╲
 │  ╱  ╲  ╱  ╲         ──────╲        ────────────╲
 └──────────────────────────────────────────────── Iterations
    (oscillates)        (converges)    (slow convergence)

Choosing Learning Rate

| Value | Effect | Recommendation | |: ------|:-------|:---------------| | Too high (α > 1) | Oscillation, divergence | ❌ Avoid | | Good (0.001 - 0.1) | Smooth convergence | ✅ Use | | Too low (α < 0.0001) | Very slow learning | ⚠️ Inefficient |


6️⃣ Complete Algorithm

Pseudocode

Algorithm:  Logistic Regression Training
────────────────────────────────────────
Input:  X (features), y (labels), α (learning rate), iterations
Output: θ (parameters)

1. Initialize θ randomly or to zeros
2. Repeat for number of iterations:
   a.  Compute z = Xθ
   b. Compute predictions: h = σ(z)
   c. Compute cost: J(θ)
   d. Compute gradients: ∇J(θ)
   e. Update: θ := θ - α∇J(θ)
3. Return θ

🎯 Decision Boundary

The decision boundary separates classes in feature space.

Mathematical Definition

Decision boundary occurs when:

$$h_θ(x) = 0. 5 ⟹ θ^T x = 0$$

For 2D Features

$$θ_0 + θ_1x_1 + θ_2x_2 = 0$$

Rearranging:

$$x_2 = -\frac{θ_0}{θ_2} - \frac{θ_1}{θ_2}x_1$$

This is a straight line!

Visualization

 x₂
  │     Class 1 (y=1)
  │   ○ ○ ○ ○
  │  ○ ○ ○ ○ ○
  │ ──────────── ← Decision Boundary (θᵀx = 0)
  │  × × × ×
  │ × × × ×
  │   Class 0 (y=0)
  └─────────────── x₁

📊 Summary Table

Component Formula Purpose
Sigmoid σ(z) = 1/(1+e⁻ᶻ) Convert to probability
Hypothesis h_θ(x) = σ(θᵀx) Make predictions
Cost J(θ) = -1/m Σ[y log(h) + (1-y)log(1-h)] Measure error
Gradient ∇J = 1/m Xᵀ(h-y) Find update direction
Update θ := θ - α∇J Learn parameters

🧪 Practice Problems

Problem 1: Calculate Sigmoid

Calculate σ(0), σ(2), σ(-2)

Solution:

  • σ(0) = 1/(1+e⁰) = 1/2 = 0.5
  • σ(2) = 1/(1+e⁻²) ≈ 0.88
  • σ(-2) = 1/(1+e²) ≈ 0.12
Problem 2: Cost Function

If y=1 and h_θ(x)=0.7, what is the cost?

Solution: L = -[1×log(0.7) + 0×log(0.3)] = -log(0.7) ≈ 0.357


🔗 Next Steps

Now that you understand the math, let's see the code!

Clone this wiki locally