This document explains the mechanisms and mechanics of every mathematical operation in train.py and predict.py.
Let's say we have raw mileage data: [10000, 50000, 100000, 200000, 300000]
Step 1: Find the minimum and maximum mileage_min = 10000 mileage_max = 300000
Step 2: Apply the normalization formula to EACH value normalized_value = (raw_value - min) / (max - min)
Let's normalize each value:
- Value 1: (10000 - 10000) / (300000 - 10000) = 0 / 290000 = 0.0
- Value 2: (50000 - 10000) / (300000 - 10000) = 40000 / 290000 = 0.138
- Value 3: (100000 - 10000) / (300000 - 10000) = 90000 / 290000 = 0.310
- Value 4: (200000 - 10000) / (300000 - 10000) = 190000 / 290000 = 0.655
- Value 5: (300000 - 10000) / (300000 - 10000) = 290000 / 290000 = 1.0
Result: [0.0, 0.138, 0.310, 0.655, 1.0]
The formula (value - min) / (max - min) works because:
-
(value - min)shifts everything so the minimum becomes 0- Before: [10000, 50000, 100000, 200000, 300000]
- After subtraction: [0, 40000, 90000, 190000, 290000]
-
/ (max - min)scales everything so the maximum becomes 1- The denominator is 290000 (the range)
- Dividing by the range compresses all values into [0, 1]
- The smallest value (0) divided by range = 0
- The largest value (290000) divided by range = 1
mileage_norm = (mileage_raw - mileage_min) / (mileage_max - mileage_min)
This operation happens to EVERY element in the array simultaneously (NumPy broadcasts it).
We start with random guesses for θ0 and θ1: θ0 = 0.0 θ1 = 0.0
These are terrible guesses. We need to adjust them to make better predictions.
For each data point, we calculate what our model predicts: predictions = theta0 + theta1 * mileage
If we have 5 data points with normalized mileage [0.0, 0.138, 0.310, 0.655, 1.0]:
- Prediction 1: 0.0 + 0.0 * 0.0 = 0.0
- Prediction 2: 0.0 + 0.0 * 0.138 = 0.0
- Prediction 3: 0.0 + 0.0 * 0.310 = 0.0
- Prediction 4: 0.0 + 0.0 * 0.655 = 0.0
- Prediction 5: 0.0 + 0.0 * 1.0 = 0.0
All predictions are 0 because we started with θ0=0 and θ1=0. This is obviously wrong.
Now we compare predictions to actual values. Let's say actual normalized prices are [0.1, 0.3, 0.5, 0.7, 0.9]:
error = predictions - price
- Error 1: 0.0 - 0.1 = -0.1
- Error 2: 0.0 - 0.3 = -0.3
- Error 3: 0.0 - 0.5 = -0.5
- Error 4: 0.0 - 0.7 = -0.7
- Error 5: 0.0 - 0.9 = -0.9
Negative errors mean we predicted too low. Positive errors mean we predicted too high.
grad0 = np.sum(error) / m
Where m = len(mileage) = 5
Let's calculate: sum(error) = -0.1 + (-0.3) + (-0.5) + (-0.7) + (-0.9) = -2.5 grad0 = -2.5 / 5 = -0.5
What does this number mean?
The gradient tells us: "If we increase θ0 by 1, the average error will change by -0.5"
Since the gradient is negative (-0.5), it means:
- If we increase θ0, the error gets smaller (becomes more negative)
- We should increase θ0 to reduce error
grad1 = np.sum(error * mileage) / m
This is trickier. We multiply each error by its corresponding mileage value:
error * mileage: -0.1 * 0.0 = 0.0 -0.3 * 0.138 = -0.0414 -0.5 * 0.310 = -0.155 -0.7 * 0.655 = -0.4585 -0.9 * 1.0 = -0.9
sum = 0.0 + (-0.0414) + (-0.155) + (-0.4585) + (-0.9) = -1.5549 grad1 = -1.5549 / 5 = -0.31098
Why multiply by mileage?
This is the key insight of linear regression. We weight the error by the input value because:
- For small mileage values (close to 0), the error doesn't tell us much about θ1
- For large mileage values (close to 1), the error tells us a lot about θ1
- By multiplying, we give more weight to errors at larger mileage values
Now we have:
- grad0 = -0.5
- grad1 = -0.31098
- learning_rate = 0.1
tmp0 = learning_rate * grad0 = 0.1 * (-0.5) = -0.05 tmp1 = learning_rate * grad1 = 0.1 * (-0.31098) = -0.031098
theta0 -= tmp0 → theta0 = 0.0 - (-0.05) = 0.05 theta1 -= tmp1 → theta1 = 0.0 - (-0.031098) = 0.031098
What just happened?
We moved θ0 and θ1 in the direction that reduces error. The learning_rate (0.1) controls how big the step is.
The gradient points in the direction of INCREASING error. We want to DECREASE error, so we subtract:
new_value = old_value - learning_rate * gradient
This is like walking downhill: the gradient points uphill, so we go the opposite direction.
We repeat this process 1000 times:
Iteration 1:
- θ0 = 0.0, θ1 = 0.0
- predictions = [0.0, 0.0, 0.0, 0.0, 0.0]
- errors = [-0.1, -0.3, -0.5, -0.7, -0.9]
- grad0 = -0.5, grad1 = -0.31098
- θ0 becomes 0.05, θ1 becomes 0.031098
Iteration 2:
- θ0 = 0.05, θ1 = 0.031098
- predictions = [0.05 + 0.0310980.0, 0.05 + 0.0310980.138, ...]
- predictions = [0.05, 0.0543, 0.0596, 0.0704, 0.0811]
- errors = [0.05-0.1, 0.0543-0.3, ...] = [-0.05, -0.2457, ...]
- grad0 and grad1 are recalculated (smaller now)
- θ0 and θ1 are updated again
Iteration 3, 4, 5, ... 1000:
- Each iteration, the errors get smaller
- The gradients get smaller
- The parameter updates get smaller
- Eventually, we converge to optimal values
After many iterations, the gradients approach zero. When grad0 ≈ 0 and grad1 ≈ 0:
- tmp0 ≈ 0
- tmp1 ≈ 0
- θ0 and θ1 stop changing significantly
This is when we've found the best fit line.
After gradient descent, we have θ0_norm and θ1_norm that work on normalized data (0-1 range).
But when we use the model in predict.py, we get real mileage values like 150,000 km, not normalized values like 0.5.
We need to convert θ0_norm and θ1_norm into θ0 and θ1 that work with real data.
Original data: x_raw = 150000 km y_raw = ? (what we want to predict)
After normalization: x_norm = (x_raw - x_min) / (x_max - x_min) = (150000 - 10000) / (300000 - 10000) = 140000 / 290000 = 0.483
y_norm = (y_raw - y_min) / (y_max - y_min)
Our model works on normalized data: y_norm = θ0_norm + θ1_norm * x_norm
We need to reverse the normalization. Let's work backwards.
Start with the normalized model: y_norm = θ0_norm + θ1_norm * x_norm
Substitute the normalization formulas: (y_raw - y_min) / (y_max - y_min) = θ0_norm + θ1_norm * ((x_raw - x_min) / (x_max - x_min))
Multiply both sides by (y_max - y_min): y_raw - y_min = θ0_norm * (y_max - y_min) + θ1_norm * (x_raw - x_min) * ((y_max - y_min) / (x_max - x_min))
Add y_min to both sides: y_raw = y_min + θ0_norm * (y_max - y_min) + θ1_norm * (x_raw - x_min) * ((y_max - y_min) / (x_max - x_min))
Expand the last term: y_raw = y_min + θ0_norm * (y_max - y_min) + θ1_norm * (y_max - y_min) * ((x_raw - x_min) / (x_max - x_min))
Rearrange to match y = θ0 + θ1 * x form: y_raw = [y_min + θ0_norm * (y_max - y_min) - θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min))] + [θ1_norm * (y_max - y_min) / (x_max - x_min)] * x_raw
Therefore: θ0 = y_min + θ0_norm * (y_max - y_min) - θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min)) θ1 = θ1_norm * (y_max - y_min) / (x_max - x_min)
For θ1 (the slope): θ1 = θ1_norm * (y_max - y_min) / (x_max - x_min)
This is a scaling factor. Let's say:
- θ1_norm = 0.5 (in normalized space)
- y_max - y_min = 40000 (price range: $50k - $10k)
- x_max - x_min = 290000 (mileage range: 300k - 10k)
θ1 = 0.5 * 40000 / 290000 = 20000 / 290000 = 0.069
What does this mean?
- In normalized space: for every 1 unit increase in normalized mileage, price increases by 0.5 units
- In real space: for every 290,000 km increase in mileage, price increases by 40,000 dollars
- So for every 1 km increase, price changes by 40,000/290,000 = 0.138 dollars per km
For θ0 (the intercept): θ0 = y_min + θ0_norm * (y_max - y_min) - θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min))
Let's break this into three parts:
Part 1: y_min y_min = 10000 This is the baseline price (minimum price in the dataset).
Part 2: θ0_norm * (y_max - y_min) θ0_norm * (y_max - y_min) = 0.3 * 40000 = 12000 This scales the normalized intercept back to the price range.
Part 3: θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min)) θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min)) = 0.5 * 40000 * (10000 / 290000) = 20000 * 0.0345 = 690
This is a correction term. It accounts for the fact that when x_raw = 0, the normalized x would be negative (since x_min = 10000).
Final θ0: θ0 = 10000 + 12000 - 690 = 21310
Let's verify with a data point. Say we have x_raw = 150000:
Using denormalized model: y_raw = θ0 + θ1 * x_raw = 21310 + 0.069 * 150000 = 21310 + 10350 = 31660
Using normalized model (should give same result): x_norm = (150000 - 10000) / 290000 = 0.483 y_norm = θ0_norm + θ1_norm * x_norm = 0.3 + 0.5 * 0.483 = 0.3 + 0.2415 = 0.5415
y_raw = y_min + y_norm * (y_max - y_min) = 10000 + 0.5415 * 40000 = 10000 + 21660 = 31660
Perfect! Both methods give the same answer. The denormalization formulas are correct.
Once we have θ0 and θ1 in the original scale, prediction is straightforward:
price = theta0 + (theta1 * mileage)
This is just the equation of a line.
Example: θ0 = 21310 θ1 = 0.069 mileage = 150000
price = 21310 + (0.069 * 150000) = 21310 + 10350 = 31660
The line we found during training is the best fit through all the data points. It minimizes the total squared error. So when we plug in a new mileage value, we get the best prediction based on the pattern we learned.
mileage_raw = [10000, 50000, 100000, 200000, 300000] price_raw = [10000, 20000, 30000, 40000, 50000]
mileage_norm = [0.0, 0.138, 0.310, 0.655, 1.0] price_norm = [0.0, 0.333, 0.667, 1.0, 1.333] (wait, this goes above 1!)
Actually, let me recalculate with correct data: price_raw = [10000, 20000, 30000, 40000, 50000] price_min = 10000 price_max = 50000 price_norm = [(10000-10000)/(50000-10000), (20000-10000)/(50000-10000), ...] = [0.0, 0.25, 0.5, 0.75, 1.0]
Iteration 1:
- θ0 = 0.0, θ1 = 0.0
- predictions = [0.0, 0.0, 0.0, 0.0, 0.0]
- errors = [0.0, 0.25, 0.5, 0.75, 1.0]
- grad0 = (0.0 + 0.25 + 0.5 + 0.75 + 1.0) / 5 = 0.5
- grad1 = (0.00.0 + 0.250.138 + 0.50.310 + 0.750.655 + 1.0*1.0) / 5 = (0 + 0.0345 + 0.155 + 0.4912 + 1.0) / 5 = 1.6807 / 5 = 0.3361
- tmp0 = 0.1 * 0.5 = 0.05
- tmp1 = 0.1 * 0.3361 = 0.03361
- θ0 = 0.0 - 0.05 = -0.05
- θ1 = 0.0 - 0.03361 = -0.03361
Wait, we got negative values. This means our initial errors were positive (we predicted too low), so we need to increase θ0 and θ1. But we subtracted, which made them negative. Let me reconsider...
Actually, the logic is correct. When errors are positive, the gradient is positive. Subtracting a positive gradient moves us in the negative direction. But that seems wrong...
Let me recalculate more carefully:
Iteration 1:
- θ0 = 0.0, θ1 = 0.0
- predictions = [0.0, 0.0, 0.0, 0.0, 0.0]
- actual = [0.0, 0.25, 0.5, 0.75, 1.0]
- errors = predictions - actual = [0.0, -0.25, -0.5, -0.75, -1.0]
- grad0 = (-0.0 - 0.25 - 0.5 - 0.75 - 1.0) / 5 = -2.5 / 5 = -0.5
- grad1 = (-0.00.0 - 0.250.138 - 0.50.310 - 0.750.655 - 1.0*1.0) / 5 = (0 - 0.0345 - 0.155 - 0.4912 - 1.0) / 5 = -1.6807 / 5 = -0.3361
- tmp0 = 0.1 * (-0.5) = -0.05
- tmp1 = 0.1 * (-0.3361) = -0.03361
- θ0 = 0.0 - (-0.05) = 0.05
- θ1 = 0.0 - (-0.03361) = 0.03361
Now it makes sense! Negative errors mean we predicted too low. Negative gradients mean we should increase the parameters. Subtracting a negative is adding, so we increase.
Iteration 2:
- θ0 = 0.05, θ1 = 0.03361
- predictions = [0.05 + 0.033610.0, 0.05 + 0.033610.138, ...] = [0.05, 0.0546, 0.0604, 0.0720, 0.0836]
- errors = [0.05 - 0.0, 0.0546 - 0.25, 0.0604 - 0.5, ...] = [0.05, -0.1954, -0.4396, -0.6780, -0.9164]
- grad0 = (0.05 - 0.1954 - 0.4396 - 0.6780 - 0.9164) / 5 = -2.1294 / 5 = -0.4259
- grad1 = (0.050.0 - 0.19540.138 - 0.43960.310 - 0.67800.655 - 0.9164*1.0) / 5 = (0 - 0.0270 - 0.1363 - 0.4441 - 0.9164) / 5 = -1.5238 / 5 = -0.3048
- tmp0 = 0.1 * (-0.4259) = -0.04259
- tmp1 = 0.1 * (-0.3048) = -0.03048
- θ0 = 0.05 - (-0.04259) = 0.09259
- θ1 = 0.03361 - (-0.03048) = 0.06409
The errors are getting smaller! The gradients are getting smaller! This is convergence happening.
After 1000 iterations, let's say we end up with: θ0_norm = 0.5 θ1_norm = 1.0
Now denormalize: θ1 = θ1_norm * (y_max - y_min) / (x_max - x_min) = 1.0 * (50000 - 10000) / (300000 - 10000) = 1.0 * 40000 / 290000 = 0.1379
θ0 = y_min + θ0_norm * (y_max - y_min) - θ1_norm * (y_max - y_min) * (x_min / (x_max - x_min)) = 10000 + 0.5 * 40000 - 1.0 * 40000 * (10000 / 290000) = 10000 + 20000 - 40000 * 0.0345 = 10000 + 20000 - 1379 = 28621
User enters mileage = 150000: price = 28621 + 0.1379 * 150000 = 28621 + 20685 = 49306
Imagine a 3D landscape where:
- The x-axis is θ0
- The y-axis is θ1
- The z-axis (height) is the total error
We start at some random point on this landscape. The gradient tells us which direction is steepest uphill. We go the opposite direction (downhill). We repeat until we reach the valley (minimum error).
Without normalization, the landscape would be extremely skewed:
- One direction (θ1) would have a very steep slope
- Another direction (θ0) would have a very gentle slope
- We'd zigzag inefficiently
With normalization, the landscape is more symmetric, so we descend straight to the minimum.
The model learned on normalized data, but the user provides real data. Denormalization translates the learned parameters from normalized space back to real space, so the model works correctly.
grad0 = ∂(error) / ∂θ0 grad1 = ∂(error) / ∂θ1
These are partial derivatives. They tell us how much the error changes when we change each parameter by a tiny amount.
- learning_rate = 0.1 means we take 10% of the gradient as our step
- If learning_rate = 1.0, we'd take 100% of the gradient (might overshoot)
- If learning_rate = 0.01, we'd take 1% of the gradient (very slow)
x_norm = (x_raw - x_min) / (x_max - x_min)
This is a linear transformation: x_norm = a * x_raw + b where:
- a = 1 / (x_max - x_min)
- b = -x_min / (x_max - x_min)
Linear transformations preserve the linear relationship between x and y, which is why we can denormalize the parameters.
- θ1 is scaled by the ratio of output range to input range
- θ0 is adjusted for both the scaling and the shift caused by the minimum values
This ensures that y_raw = θ0 + θ1 * x_raw produces the same predictions as y_norm = θ0_norm + θ1_norm * x_norm.