-
Notifications
You must be signed in to change notification settings - Fork 0
Mathematical Foundation
|
Activation Function |
Loss Measurement |
Optimization Direction |
Parameter Learning |
The sigmoid function (also called logistic function) maps any real number to a value between 0 and 1:
σ(z)
1 ┤ ┌────────
│ /
0.5 ┤------/
│ /
0 ┤────┘
└────┴────┴────┴─── z
-∞ 0 ∞
| Property | Description | Value | |:---------|: ------------|:------| | Range | Output values | (0, 1) | | Domain | Input values | (-∞, +∞) | | Midpoint | σ(0) | 0.5 | | Symmetry | σ(-z) | 1 - σ(z) |
def sigmoid(z):
"""
Compute sigmoid function
Parameters:
z : array-like, input values
Returns:
Sigmoid of z
"""
return 1 / (1 + np.exp(-z))💡 **Why sigmoid? ** It transforms linear combinations into probabilities!
💡 Interpretation: Output represents probability of belonging to class 1
💡 Decision boundary: When σ(z) ≥ 0.5, predict class 1; otherwise class 0
The hypothesis function combines features linearly before applying sigmoid:
Where:
- θ (theta) = weight parameters (also denoted as w)
- x = input features
- θ₀ = bias term (also denoted as b)
def linear_combination(X, weights, bias):
"""
Compute linear combination
Parameters:
X : array, shape (m, n) - training examples
weights : array, shape (n,) - feature weights
bias : float - bias term
Returns:
z : array, shape (m,) - linear combinations
"""
return np. dot(X, weights) + biasThe cost function measures how well our model performs:
Where:
- m = number of training examples
- y = actual label (0 or 1)
- h_θ(x) = predicted probability
| Scenario | y (actual) | h_θ(x) (predicted) | Cost |
|---|---|---|---|
| Perfect prediction | 1 | 1. 0 | 0 (low) |
| Wrong prediction | 1 | 0.0 | ∞ (high) |
| Perfect prediction | 0 | 0.0 | 0 (low) |
| Wrong prediction | 0 | 1.0 | ∞ (high) |
def compute_cost(y_true, y_pred):
"""
Compute binary cross-entropy cost
Parameters:
y_true : array, shape (m,) - true labels
y_pred : array, shape (m,) - predicted probabilities
Returns:
cost : float - average cost
"""
m = len(y_true)
epsilon = 1e-15 # Prevent log(0)
# Clip predictions to avoid log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
cost = -1/m * np.sum(
y_true * np.log(y_pred) +
(1 - y_true) * np.log(1 - y_pred)
)
return costGradient descent is the optimization algorithm that minimizes the cost function.
For each parameter θⱼ:
Where:
- α (alpha) = learning rate
- : = means "update to"
def gradient_descent(X, y, weights, bias, learning_rate, iterations):
"""
Perform gradient descent optimization
Parameters:
X : array, shape (m, n)
y : array, shape (m,)
weights : array, shape (n,)
bias : float
learning_rate : float
iterations : int
Returns:
weights, bias, cost_history
"""
m = len(y)
cost_history = []
for i in range(iterations):
# Forward propagation
z = linear_combination(X, weights, bias)
predictions = sigmoid(z)
# Compute cost
cost = compute_cost(y, predictions)
cost_history.append(cost)
# Backward propagation (gradients)
dz = predictions - y
dw = (1/m) * np.dot(X.T, dz)
db = (1/m) * np.sum(dz)
# Update parameters
weights -= learning_rate * dw
bias -= learning_rate * db
return weights, bias, cost_historyThe learning rate (α) controls how big steps we take during optimization.
Cost
│
│ α too large α just right α too small
│ ╱╲╱╲ ──╲ ──────╲
│ ╱╲ ╱╲ ────╲ ────────╲
│ ╱ ╲ ╱ ╲ ──────╲ ────────────╲
└──────────────────────────────────────────────── Iterations
(oscillates) (converges) (slow convergence)
| Value | Effect | Recommendation |
|: ------|:-------|:---------------|
| Too high (α > 1) | Oscillation, divergence | ❌ Avoid |
| Good (0.001 - 0.1) | Smooth convergence | ✅ Use |
| Too low (α < 0.0001) | Very slow learning |
Algorithm: Logistic Regression Training
────────────────────────────────────────
Input: X (features), y (labels), α (learning rate), iterations
Output: θ (parameters)
1. Initialize θ randomly or to zeros
2. Repeat for number of iterations:
a. Compute z = Xθ
b. Compute predictions: h = σ(z)
c. Compute cost: J(θ)
d. Compute gradients: ∇J(θ)
e. Update: θ := θ - α∇J(θ)
3. Return θ
The decision boundary separates classes in feature space.
Decision boundary occurs when:
Rearranging:
This is a straight line!
x₂
│ Class 1 (y=1)
│ ○ ○ ○ ○
│ ○ ○ ○ ○ ○
│ ──────────── ← Decision Boundary (θᵀx = 0)
│ × × × ×
│ × × × ×
│ Class 0 (y=0)
└─────────────── x₁
| Component | Formula | Purpose |
|---|---|---|
| Sigmoid | σ(z) = 1/(1+e⁻ᶻ) | Convert to probability |
| Hypothesis | h_θ(x) = σ(θᵀx) | Make predictions |
| Cost | J(θ) = -1/m Σ[y log(h) + (1-y)log(1-h)] | Measure error |
| Gradient | ∇J = 1/m Xᵀ(h-y) | Find update direction |
| Update | θ := θ - α∇J | Learn parameters |
Problem 1: Calculate Sigmoid
Calculate σ(0), σ(2), σ(-2)
Solution:
- σ(0) = 1/(1+e⁰) = 1/2 = 0.5
- σ(2) = 1/(1+e⁻²) ≈ 0.88
- σ(-2) = 1/(1+e²) ≈ 0.12
Problem 2: Cost Function
If y=1 and h_θ(x)=0.7, what is the cost?
Solution: L = -[1×log(0.7) + 0×log(0.3)] = -log(0.7) ≈ 0.357
Now that you understand the math, let's see the code!