Deep Dive: How the Mathematics Actually Works

This document explains the mechanisms and mechanics of every mathematical operation in train.py and predict.py.

PART 1: NORMALIZATION MECHANICS

What Actually Happens When We Normalize

Let's say we have raw mileage data: [10000, 50000, 100000, 200000, 300000]

Step 1: Find the minimum and maximum mileage_min = 10000 mileage_max = 300000

Step 2: Apply the normalization formula to EACH value normalized_value = (raw_value - min) / (max - min)

Let's normalize each value:

Value 1: (10000 - 10000) / (300000 - 10000) = 0 / 290000 = 0.0
Value 2: (50000 - 10000) / (300000 - 10000) = 40000 / 290000 = 0.138
Value 3: (100000 - 10000) / (300000 - 10000) = 90000 / 290000 = 0.310
Value 4: (200000 - 10000) / (300000 - 10000) = 190000 / 290000 = 0.655
Value 5: (300000 - 10000) / (300000 - 10000) = 290000 / 290000 = 1.0

Result: [0.0, 0.138, 0.310, 0.655, 1.0]

Why This Formula Works - The Mechanism

The formula (value - min) / (max - min) works because:

(value - min) shifts everything so the minimum becomes 0
- Before: [10000, 50000, 100000, 200000, 300000]
- After subtraction: [0, 40000, 90000, 190000, 290000]
/ (max - min) scales everything so the maximum becomes 1
- The denominator is 290000 (the range)
- Dividing by the range compresses all values into [0, 1]
- The smallest value (0) divided by range = 0
- The largest value (290000) divided by range = 1

In Code:

mileage_norm = (mileage_raw - mileage_min) / (mileage_max - mileage_min)

This operation happens to EVERY element in the array simultaneously (NumPy broadcasts it).

PART 2: GRADIENT DESCENT MECHANICS - The Core Algorithm

What We're Trying to Do

We start with random guesses for θ0 and θ1: θ0 = 0.0 θ1 = 0.0

These are terrible guesses. We need to adjust them to make better predictions.

The Prediction Mechanism

For each data point, we calculate what our model predicts: predictions = theta0 + theta1 * mileage

If we have 5 data points with normalized mileage [0.0, 0.138, 0.310, 0.655, 1.0]:

Prediction 1: 0.0 + 0.0 * 0.0 = 0.0
Prediction 2: 0.0 + 0.0 * 0.138 = 0.0
Prediction 3: 0.0 + 0.0 * 0.310 = 0.0
Prediction 4: 0.0 + 0.0 * 0.655 = 0.0
Prediction 5: 0.0 + 0.0 * 1.0 = 0.0

All predictions are 0 because we started with θ0=0 and θ1=0. This is obviously wrong.

The Error Calculation Mechanism

Now we compare predictions to actual values. Let's say actual normalized prices are [0.1, 0.3, 0.5, 0.7, 0.9]:

error = predictions - price

Error 1: 0.0 - 0.1 = -0.1
Error 2: 0.0 - 0.3 = -0.3
Error 3: 0.0 - 0.5 = -0.5
Error 4: 0.0 - 0.7 = -0.7
Error 5: 0.0 - 0.9 = -0.9

Negative errors mean we predicted too low. Positive errors mean we predicted too high.

The Gradient Calculation Mechanism - θ0 Gradient

grad0 = np.sum(error) / m

Where m = len(mileage) = 5

Let's calculate: sum(error) = -0.1 + (-0.3) + (-0.5) + (-0.7) + (-0.9) = -2.5 grad0 = -2.5 / 5 = -0.5

What does this number mean?

The gradient tells us: "If we increase θ0 by 1, the average error will change by -0.5"

Since the gradient is negative (-0.5), it means:

If we increase θ0, the error gets smaller (becomes more negative)
We should increase θ0 to reduce error

The Gradient Calculation Mechanism - θ1 Gradient

grad1 = np.sum(error * mileage) / m

This is trickier. We multiply each error by its corresponding mileage value:

error * mileage: -0.1 * 0.0 = 0.0 -0.3 * 0.138 = -0.0414 -0.5 * 0.310 = -0.155 -0.7 * 0.655 = -0.4585 -0.9 * 1.0 = -0.9

sum = 0.0 + (-0.0414) + (-0.155) + (-0.4585) + (-0.9) = -1.5549 grad1 = -1.5549 / 5 = -0.31098

Why multiply by mileage?

This is the key insight of linear regression. We weight the error by the input value because:

For small mileage values (close to 0), the error doesn't tell us much about θ1
For large mileage values (close to 1), the error tells us a lot about θ1
By multiplying, we give more weight to errors at larger mileage values

The Parameter Update Mechanism

Now we have:

grad0 = -0.5
grad1 = -0.31098
learning_rate = 0.1

tmp0 = learning_rate * grad0 = 0.1 * (-0.5) = -0.05 tmp1 = learning_rate * grad1 = 0.1 * (-0.31098) = -0.031098

theta0 -= tmp0 → theta0 = 0.0 - (-0.05) = 0.05 theta1 -= tmp1 → theta1 = 0.0 - (-0.031098) = 0.031098

What just happened?

We moved θ0 and θ1 in the direction that reduces error. The learning_rate (0.1) controls how big the step is.

Why We Subtract the Gradient

The gradient points in the direction of INCREASING error. We want to DECREASE error, so we subtract:

new_value = old_value - learning_rate * gradient

This is like walking downhill: the gradient points uphill, so we go the opposite direction.

The Iteration Loop

We repeat this process 1000 times:

Iteration 1:

θ0 = 0.0, θ1 = 0.0
predictions = [0.0, 0.0, 0.0, 0.0, 0.0]
errors = [-0.1, -0.3, -0.5, -0.7, -0.9]
grad0 = -0.5, grad1 = -0.31098
θ0 becomes 0.05, θ1 becomes 0.031098

Iteration 2:

θ0 = 0.05, θ1 = 0.031098
predictions = [0.05 + 0.0310980.0, 0.05 + 0.0310980.138, ...]
predictions = [0.05, 0.0543, 0.0596, 0.0704, 0.0811]
errors = [0.05-0.1, 0.0543-0.3, ...] = [-0.05, -0.2457, ...]
grad0 and grad1 are recalculated (smaller now)
θ0 and θ1 are updated again

Iteration 3, 4, 5, ... 1000:

Each iteration, the errors get smaller
The gradients get smaller
The parameter updates get smaller
Eventually, we converge to optimal values

The Convergence Mechanism

After many iterations, the gradients approach zero. When grad0 ≈ 0 and grad1 ≈ 0:

tmp0 ≈ 0
tmp1 ≈ 0
θ0 and θ1 stop changing significantly

This is when we've found the best fit line.

PART 3: DENORMALIZATION MECHANICS - Reversing the Transformation

The Problem We're Solving

After gradient descent, we have θ0_norm and θ1_norm that work on normalized data (0-1 range).

But when we use the model in predict.py, we get real mileage values like 150,000 km, not normalized values like 0.5.

We need to convert θ0_norm and θ1_norm into θ0 and θ1 that work with real data.

Understanding the Transformation Chain

Original data: x_raw = 150000 km y_raw = ? (what we want to predict)

After normalization: x_norm = (x_raw - x_min) / (x_max - x_min) = (150000 - 10000) / (300000 - 10000) = 140000 / 290000 = 0.483

y_norm = (y_raw - y_min) / (y_max - y_min)

Our model works on normalized data: y_norm = θ0_norm + θ1_norm * x_norm

The Denormalization Derivation - Step by Step

We need to reverse the normalization. Let's work backwards.

Start with the normalized model: y_norm = θ0_norm + θ1_norm * x_norm

Substitute the normalization formulas: (y_raw - y_min) / (y_max - y_min) = θ0_norm + θ1_norm * ((x_raw - x_min) / (x_max - x_min))

Multiply both sides by (y_max - y_min): y_raw - y_min = θ0_norm * (y_max - y_min) + θ1_norm * (x_raw - x_min) * ((y_max - y_min) / (x_max - x_min))

Add y_min to both sides: y_raw = y_min + θ0_norm * (y_max - y_min) + θ1_norm * (x_raw - x_min) * ((y_max - y_min) / (x_max - x_min))

Expand the last term: y_raw = y_min + θ0_norm * (y_max - y_min) + θ1_norm * (y_max - y_min) * ((x_raw - x_min) / (x_max - x_min))

Rearrange to match y = θ0 + θ1 * x form: y_raw = [y_min + θ0_norm * (y_max - y_min) - θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min))] + [θ1_norm * (y_max - y_min) / (x_max - x_min)] * x_raw

Therefore: θ0 = y_min + θ0_norm * (y_max - y_min) - θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min)) θ1 = θ1_norm * (y_max - y_min) / (x_max - x_min)

Breaking Down Each Component

For θ1 (the slope): θ1 = θ1_norm * (y_max - y_min) / (x_max - x_min)

This is a scaling factor. Let's say:

θ1_norm = 0.5 (in normalized space)
y_max - y_min = 40000 (price range: $50k - $10k)
x_max - x_min = 290000 (mileage range: 300k - 10k)

θ1 = 0.5 * 40000 / 290000 = 20000 / 290000 = 0.069

What does this mean?

In normalized space: for every 1 unit increase in normalized mileage, price increases by 0.5 units
In real space: for every 290,000 km increase in mileage, price increases by 40,000 dollars
So for every 1 km increase, price changes by 40,000/290,000 = 0.138 dollars per km

For θ0 (the intercept): θ0 = y_min + θ0_norm * (y_max - y_min) - θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min))

Let's break this into three parts:

Part 1: y_min y_min = 10000 This is the baseline price (minimum price in the dataset).

Part 2: θ0_norm * (y_max - y_min) θ0_norm * (y_max - y_min) = 0.3 * 40000 = 12000 This scales the normalized intercept back to the price range.

Part 3: θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min)) θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min)) = 0.5 * 40000 * (10000 / 290000) = 20000 * 0.0345 = 690

This is a correction term. It accounts for the fact that when x_raw = 0, the normalized x would be negative (since x_min = 10000).

Final θ0: θ0 = 10000 + 12000 - 690 = 21310

Verification - Does It Work?

Let's verify with a data point. Say we have x_raw = 150000:

Using denormalized model: y_raw = θ0 + θ1 * x_raw = 21310 + 0.069 * 150000 = 21310 + 10350 = 31660

Using normalized model (should give same result): x_norm = (150000 - 10000) / 290000 = 0.483 y_norm = θ0_norm + θ1_norm * x_norm = 0.3 + 0.5 * 0.483 = 0.3 + 0.2415 = 0.5415

y_raw = y_min + y_norm * (y_max - y_min) = 10000 + 0.5415 * 40000 = 10000 + 21660 = 31660

Perfect! Both methods give the same answer. The denormalization formulas are correct.

PART 4: PREDICTION MECHANICS

The Simple Part

Once we have θ0 and θ1 in the original scale, prediction is straightforward:

price = theta0 + (theta1 * mileage)

This is just the equation of a line.

Example: θ0 = 21310 θ1 = 0.069 mileage = 150000

price = 21310 + (0.069 * 150000) = 21310 + 10350 = 31660

Why This Works

The line we found during training is the best fit through all the data points. It minimizes the total squared error. So when we plug in a new mileage value, we get the best prediction based on the pattern we learned.

PART 5: COMPLETE MATHEMATICAL FLOW

Iteration 1: Raw Data

mileage_raw = [10000, 50000, 100000, 200000, 300000] price_raw = [10000, 20000, 30000, 40000, 50000]

Iteration 2: Normalization

mileage_norm = [0.0, 0.138, 0.310, 0.655, 1.0] price_norm = [0.0, 0.333, 0.667, 1.0, 1.333] (wait, this goes above 1!)

Actually, let me recalculate with correct data: price_raw = [10000, 20000, 30000, 40000, 50000] price_min = 10000 price_max = 50000 price_norm = [(10000-10000)/(50000-10000), (20000-10000)/(50000-10000), ...] = [0.0, 0.25, 0.5, 0.75, 1.0]

Iteration 3: Gradient Descent (simplified - just 2 iterations)

Iteration 1:

θ0 = 0.0, θ1 = 0.0
predictions = [0.0, 0.0, 0.0, 0.0, 0.0]
errors = [0.0, 0.25, 0.5, 0.75, 1.0]
grad0 = (0.0 + 0.25 + 0.5 + 0.75 + 1.0) / 5 = 0.5
grad1 = (0.00.0 + 0.250.138 + 0.50.310 + 0.750.655 + 1.0*1.0) / 5 = (0 + 0.0345 + 0.155 + 0.4912 + 1.0) / 5 = 1.6807 / 5 = 0.3361
tmp0 = 0.1 * 0.5 = 0.05
tmp1 = 0.1 * 0.3361 = 0.03361
θ0 = 0.0 - 0.05 = -0.05
θ1 = 0.0 - 0.03361 = -0.03361

Wait, we got negative values. This means our initial errors were positive (we predicted too low), so we need to increase θ0 and θ1. But we subtracted, which made them negative. Let me reconsider...

Actually, the logic is correct. When errors are positive, the gradient is positive. Subtracting a positive gradient moves us in the negative direction. But that seems wrong...

Let me recalculate more carefully:

Iteration 1:

θ0 = 0.0, θ1 = 0.0
predictions = [0.0, 0.0, 0.0, 0.0, 0.0]
actual = [0.0, 0.25, 0.5, 0.75, 1.0]
errors = predictions - actual = [0.0, -0.25, -0.5, -0.75, -1.0]
grad0 = (-0.0 - 0.25 - 0.5 - 0.75 - 1.0) / 5 = -2.5 / 5 = -0.5
grad1 = (-0.00.0 - 0.250.138 - 0.50.310 - 0.750.655 - 1.0*1.0) / 5 = (0 - 0.0345 - 0.155 - 0.4912 - 1.0) / 5 = -1.6807 / 5 = -0.3361
tmp0 = 0.1 * (-0.5) = -0.05
tmp1 = 0.1 * (-0.3361) = -0.03361
θ0 = 0.0 - (-0.05) = 0.05
θ1 = 0.0 - (-0.03361) = 0.03361

Now it makes sense! Negative errors mean we predicted too low. Negative gradients mean we should increase the parameters. Subtracting a negative is adding, so we increase.

Iteration 2:

θ0 = 0.05, θ1 = 0.03361
predictions = [0.05 + 0.033610.0, 0.05 + 0.033610.138, ...] = [0.05, 0.0546, 0.0604, 0.0720, 0.0836]
errors = [0.05 - 0.0, 0.0546 - 0.25, 0.0604 - 0.5, ...] = [0.05, -0.1954, -0.4396, -0.6780, -0.9164]
grad0 = (0.05 - 0.1954 - 0.4396 - 0.6780 - 0.9164) / 5 = -2.1294 / 5 = -0.4259
grad1 = (0.050.0 - 0.19540.138 - 0.43960.310 - 0.67800.655 - 0.9164*1.0) / 5 = (0 - 0.0270 - 0.1363 - 0.4441 - 0.9164) / 5 = -1.5238 / 5 = -0.3048
tmp0 = 0.1 * (-0.4259) = -0.04259
tmp1 = 0.1 * (-0.3048) = -0.03048
θ0 = 0.05 - (-0.04259) = 0.09259
θ1 = 0.03361 - (-0.03048) = 0.06409

The errors are getting smaller! The gradients are getting smaller! This is convergence happening.

Iteration 4: Denormalization

After 1000 iterations, let's say we end up with: θ0_norm = 0.5 θ1_norm = 1.0

Now denormalize: θ1 = θ1_norm * (y_max - y_min) / (x_max - x_min) = 1.0 * (50000 - 10000) / (300000 - 10000) = 1.0 * 40000 / 290000 = 0.1379

θ0 = y_min + θ0_norm * (y_max - y_min) - θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min)) = 10000 + 0.5 * 40000 - 1.0 * 40000 * (10000 / 290000) = 10000 + 20000 - 40000 * 0.0345 = 10000 + 20000 - 1379 = 28621

Iteration 5: Prediction

User enters mileage = 150000: price = 28621 + 0.1379 * 150000 = 28621 + 20685 = 49306

PART 6: The Mathematical Intuition

Why Gradient Descent Works

Imagine a 3D landscape where:

The x-axis is θ0
The y-axis is θ1
The z-axis (height) is the total error

We start at some random point on this landscape. The gradient tells us which direction is steepest uphill. We go the opposite direction (downhill). We repeat until we reach the valley (minimum error).

Why Normalization Helps

Without normalization, the landscape would be extremely skewed:

One direction (θ1) would have a very steep slope
Another direction (θ0) would have a very gentle slope
We'd zigzag inefficiently

With normalization, the landscape is more symmetric, so we descend straight to the minimum.

Why Denormalization is Necessary

The model learned on normalized data, but the user provides real data. Denormalization translates the learned parameters from normalized space back to real space, so the model works correctly.

PART 7: Key Mathematical Insights

The Gradient is the Derivative

grad0 = ∂(error) / ∂θ0 grad1 = ∂(error) / ∂θ1

These are partial derivatives. They tell us how much the error changes when we change each parameter by a tiny amount.

The Learning Rate Controls Step Size

learning_rate = 0.1 means we take 10% of the gradient as our step
If learning_rate = 1.0, we'd take 100% of the gradient (might overshoot)
If learning_rate = 0.01, we'd take 1% of the gradient (very slow)

The Normalization Formula is Linear

x_norm = (x_raw - x_min) / (x_max - x_min)

This is a linear transformation: x_norm = a * x_raw + b where:

a = 1 / (x_max - x_min)
b = -x_min / (x_max - x_min)

Linear transformations preserve the linear relationship between x and y, which is why we can denormalize the parameters.

The Denormalization Formula Accounts for Both Scaling and Shifting

θ1 is scaled by the ratio of output range to input range
θ0 is adjusted for both the scaling and the shift caused by the minimum values

This ensures that y_raw = θ0 + θ1 * x_raw produces the same predictions as y_norm = θ0_norm + θ1_norm * x_norm.

FilesExpand file tree

DEEP_MATHEMATICS_MECHANICS.md

Latest commit

History

DEEP_MATHEMATICS_MECHANICS.md

File metadata and controls

Deep Dive: How the Mathematics Actually Works

PART 1: NORMALIZATION MECHANICS

What Actually Happens When We Normalize

Why This Formula Works - The Mechanism

In Code:

PART 2: GRADIENT DESCENT MECHANICS - The Core Algorithm

What We're Trying to Do

The Prediction Mechanism

The Error Calculation Mechanism

The Gradient Calculation Mechanism - θ0 Gradient

The Gradient Calculation Mechanism - θ1 Gradient

The Parameter Update Mechanism

Why We Subtract the Gradient

The Iteration Loop

The Convergence Mechanism

PART 3: DENORMALIZATION MECHANICS - Reversing the Transformation

The Problem We're Solving

Understanding the Transformation Chain

The Denormalization Derivation - Step by Step

Breaking Down Each Component

Verification - Does It Work?

PART 4: PREDICTION MECHANICS

The Simple Part

Why This Works

PART 5: COMPLETE MATHEMATICAL FLOW

Iteration 1: Raw Data

Iteration 2: Normalization

Iteration 3: Gradient Descent (simplified - just 2 iterations)

Iteration 4: Denormalization

Iteration 5: Prediction

PART 6: The Mathematical Intuition

Why Gradient Descent Works

Why Normalization Helps

Why Denormalization is Necessary

PART 7: Key Mathematical Insights

The Gradient is the Derivative

The Learning Rate Controls Step Size

The Normalization Formula is Linear

The Denormalization Formula Accounts for Both Scaling and Shifting