NannyML
diff --git a/‎book/10-bias-fairness.tex‎
Lines changed: 38 additions & 0 deletions b/‎book/10-bias-fairness.tex‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎book/2-regression.tex‎
Lines changed: 25 additions & 25 deletions b/‎book/2-regression.tex‎
Lines changed: 25 additions & 25 deletions
diff --git a/‎book/3-classification.tex‎
Lines changed: 75 additions & 0 deletions b/‎book/3-classification.tex‎
Lines changed: 75 additions & 0 deletions
@@ -53,6 +53,14 @@ \subsection{Demographic Parity}
 
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=\textwidth]{figures/Demographic_Parity.png}
+\end{figure*}
+
+\textbf{Figure. Demographic Parity.} \underline{Left:} All groups receive positive predictions at roughly equal rates (~40\%), satisfying parity.
+\underline{Right:} Group A is selected at 65\% while Groups B and C at 30\% and 20\%, violating parity.
+
 \orangebox{Did you know that...}
 {Demographic parity goes by several other names in the literature, often referred to as statistical parity, group fairness, or even
 independence criterion. Different research communities picked different terms, but they all describe the same idea.}
@@ -112,6 +120,14 @@ \subsection{Equality of Opportunity}
 
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=\textwidth]{figures/Equality_of_Opportunity.png}
+\end{figure*}
+
+\textbf{Figure. Equality of Opportunity.} \underline{Left:} All groups have similar TPR (~80\%), meaning qualified individuals are equally likely to be identified.
+\underline{Right:} Group A has 90\% TPR while Group C has only 45\%, meaning qualified individuals in Group C are frequently overlooked.
+
 \orangebox{Did you know that...}
 {Equality of Opportunity was popularized in the 2016 paper “Equality of Opportunity in Supervised Learning” by Hardt, Price, and Srebro. 
 In that paper, they introduced both Equality of Opportunity and its stricter sibling, Equalized Odds. The terms have since become standard
@@ -175,6 +191,13 @@ \subsection{Equality of Odds}
 
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=\textwidth]{figures/Equality_of_Odds.png}
+\end{figure*}
+
+\textbf{Figure. Equality of Odds.} \underline{Left:} Both TPR and FPR are consistent across groups (fair). \underline{Right:} Group A has high TPR but low FPR, while Group C has low TPR and high FPR — the model performs well only for Group A.
+
 \orangebox{Did you know that...}
 {Equality of Odds was popularized in the 2016 paper “Equality of Opportunity in Supervised Learning” by Hardt, Price, and Srebro. 
 In that paper, they introduced both Equality of Odds and its less strict sibling, Equality of Opportunity. The terms have since become standard
@@ -241,6 +264,14 @@ \subsection{Predictive Parity}
 
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=\textwidth]{figures/Predictive_Parity.png}
+\end{figure*}
+
+\textbf{Figure. Predictive Parity.} \underline{Left:} All groups have similar PPV (~74\%), meaning a positive prediction is equally trustworthy across groups.
+\underline{Right:} A positive prediction for Group A is correct 85\% of the time but only 40\% for Group C.
+
 \orangebox{Did you know that...}
 {Predictive parity gained attention during the debate around the COMPAS recidivism tool. ProPublica’s 2016 investigation argued that COMPAS was
 unfair because it did not satisfy equalized odds, while its developers countered that the tool did satisfy predictive parity, illustrating how
@@ -298,6 +329,13 @@ \subsection{Calibration within Groups}
 
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=0.6\textwidth]{figures/Calibration_within_Groups.png}
+\end{figure*}
+
+\textbf{Figure. Calibration within Groups.} Group A's calibration curve closely follows the diagonal (well calibrated), while Group B's curve deviates — a predicted score of 0.3 actually corresponds to a 40\% positive rate for Group B, meaning scores have different meanings across groups.
+
 \orangebox{Did you know that...}
 {Probability calibration first became popular in weather forecasting. In the 1950s, meteorologists asked whether a 
 “70\% chance of rain” really meant that it rained on 7 out of 10 such days. This led to the creation of the Brier Score in 1950, one of the
 
@@ -6,8 +6,9 @@ \chapter{Regression}
 \section{MAE}
 \subsection{Mean Absolute Error}
 
-MAE is one of the most popular regression accuracy metrics. It is calculated as the sum of absolute errors divided by the sample size. 
-It is a scale-dependent accuracy measure which means that it uses the same scale as the data being measured.
+MAE is one of the most intuitive regression metrics: it tells you, on average, how far off your predictions are in the same units as your target.
+If you're predicting house prices and your MAE is \$15,000, each prediction is off by \$15K on average. This directness makes MAE the go-to metric
+when you need to communicate model performance to non-technical stakeholders.
 
 % equation
 \begin{center}
@@ -25,23 +26,23 @@ \subsection{Mean Absolute Error}
     }
 \end{center}
 
-The smaller the MAE, the closer the model's predictions are to the actual targets.
-Theoretically, MAE belongs in the 0 to +infinity range. One of the aspects that makes MAE popular is that it is easy to understand and compute.
+MAE ranges from 0 (perfect) to +infinity. To interpret it in context, compare MAE to the standard deviation of your target:
+MAE $\ll$ std($Y$) suggests a useful model, while MAE $\approx$ std($Y$) means your model is barely better than predicting the mean.
 
 \textbf{When to use MAE?}
 
-Use MAE when you need an interpretable, robust metric that penalizes all errors equally.
-Avoid using it when larger errors need more significant penalization.
+Use MAE for demand forecasting, inventory planning, or any task where over-predicting by 5 is exactly as bad as under-predicting by 5.
+Avoid MAE when large errors are disproportionately costly (use RMSE), when you need scale-free comparison across different targets (use MAPE),
+or when you need a differentiable loss function for training (use MSE).
 
 % strength and weakness box
 \coloredboxes{
-    \item MAE provides an easy-to-understand value since it represents the average error in the same units as the data.
-    \item MAE treats under-predictions and over-predictions equally. Bear in mind that this may not be desirable in all contexts.
+    \item Robust to outliers. Unlike MSE, a single bad prediction doesn't dominate the score. If 99 predictions are off by 1 and one is off by 100, MAE = 1.99 while RMSE = 10.0.
+    \item Directly interpretable. MAE = 5.2 means ``on average, predictions miss by 5.2 units.'' No square roots or percentage conversions needed.
 }
 {
-    \item MAE can be biased when the distribution of errors is skewed, as it does not account for the direction of the error.
-    \item The absolute value function used in MAE is not differentiable at zero, which can pose challenges in optimization
-    and gradient-based learning algorithms.
+    \item All errors are weighted equally. A model with many small errors gets the same MAE as one with fewer but larger errors. If large errors are costly, use RMSE.
+    \item Not differentiable at zero, so it cannot be directly used as a loss function in gradient descent. In practice, Huber loss or smooth L1 are used instead.
 }
 
 \clearpage
@@ -51,31 +52,30 @@ \subsection{Mean Absolute Error}
 \begin{figure*}[ht!]
     \centering
     \includegraphics[width=0.6\textwidth]{figures/MAE_3d_surface.png}
-    % \caption{Caption}
 \end{figure*}
 
 \begin{wrapfigure}{r}{0.5\textwidth}
     \centering
-    \vspace{-10pt} % Adjust vertical alignment if needed
-    \includegraphics[width=0.45\textwidth]{figures/MAE_cross_section.png} % Your figure goes here
-    \vspace{-10pt} % Adjust vertical alignment if needed
+    \vspace{-10pt}
+    \includegraphics[width=0.45\textwidth]{figures/MAE_cross_section.png}
+    \vspace{-10pt}
 \end{wrapfigure}
 
-% Left text with the image on the right
-\textbf{Figure 3.1 MAE.} \underline{Top:} The rate of change of MAE is linear. 
-Each error contributes proportionally to the total error. 
-\underline{Right:} We can see that MAE is always non-negative, symmetrical,
-and centered around zero. By looking at this plot it is clear that MAE is not differentiable at zero.
+\textbf{Figure. MAE.} \underline{Top:} The rate of change of MAE is linear —
+each error contributes proportionally to the total.
+\underline{Right:} MAE is always non-negative, symmetrical, and centered around zero.
+Unlike MSE's parabola, the V-shape means all errors are penalized equally regardless of magnitude.
 
 \orangebox{Did you know that...}
-{A forecast method that minimizes MAE
-will lead to forecasts of the median.}
-
+{Minimizing MAE leads to predicting the \textbf{median}, while minimizing MSE leads to predicting the \textbf{mean}.
+This is why MAE is more robust to outliers --- the median is less affected by extreme values than the mean. If your target distribution
+is skewed (e.g., income, claim amounts), this distinction matters: MAE and MSE will favor different models.}
 
 \textbf{Other related metrics}
 
-Other metrics commonly explored alongside MAE are Mean Squared Error (MSE), Root Mean Squared Error (RMSE), 
-and Mean Absolute Percentage Error (MAPE).
+If your MAE and RMSE differ significantly, it signals that some predictions have large errors (RMSE amplifies them). A rule of thumb:
+RMSE/MAE close to 1.0 means errors are uniform; RMSE/MAE $\gg$ 1.0 means a few predictions are far off.
+For percentage-based comparison across different-scale targets, use MAPE or sMAPE.
 
 % ---------- MSE ----------
 \clearpage
 
@@ -396,6 +396,13 @@ \subsection{True Negative Rate (Specificity)}
 \clearpage
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=0.6\textwidth]{figures/TNR_3d_surface.png}
+\end{figure*}
+
+\textbf{Figure.} TNR (Specificity) as a function of TN and FP. TNR increases as more actual negatives are correctly identified.
+
 \orangebox{%
 Did you know that...}
 {
@@ -512,6 +519,13 @@ \subsection{Balanced Accuracy}
 \clearpage
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=\textwidth]{figures/Balanced_Accuracy_comparison.png}
+\end{figure*}
+
+\textbf{Figure.} On imbalanced data (left), a naive "always negative" classifier gets 95\% accuracy but only 50\% balanced accuracy, exposing its failure. On balanced data (right), both metrics agree.
+
 \orangebox{Did you know that...}
 {Balanced Accuracy is mathematically equivalent to the macro-averaged recall. In scikit-learn, you can verify this:
 \texttt{balanced\_accuracy\_score(y, pred)} always equals \texttt{recall\_score(y, pred, average='macro')}.}
@@ -564,6 +578,13 @@ \subsection{Precision}
 \clearpage
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=0.6\textwidth]{figures/Precision_3d_surface.png}
+\end{figure*}
+
+\textbf{Figure.} Precision as a function of TP and FP. Precision is highest when FP is low relative to TP.
+
 \orangebox{%
 Did you know that...}
 {
@@ -695,6 +716,13 @@ \subsection{F-beta}
 \clearpage
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=0.6\textwidth]{figures/F_beta_curves.png}
+\end{figure*}
+
+\textbf{Figure.} F-beta vs Recall at fixed Precision=0.8. With $\beta=0.5$ (favors precision), the curve is highest. With $\beta=2$ (favors recall), the score increases more steeply with recall.
+
 \orangebox{Did you know that...}
 {Common choices for $\beta$ are: $\beta = 2$ (F2-score), which weights recall twice as much as precision, useful in medical screening where missing a disease
 is worse than a false alarm; and $\beta = 0.5$ (F0.5-score), which weights precision twice as much, useful in search engines where irrelevant results are costly.}
@@ -752,6 +780,13 @@ \subsection{Area Under the Receiver Operating Characteristic Curve}
 \item ROC AUC doesn't inform about precision and negative predicted value.
 }
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=0.6\textwidth]{figures/ROC_AUC_curves.png}
+\end{figure*}
+
+\textbf{Figure.} ROC curves for a good model (AUC~0.95, blue) and a weak model (AUC~0.62, red). The dashed diagonal represents random guessing (AUC=0.50). The area under each curve is the AUC.
+
 \orangebox{Did you know that...}
 {The ROC curve was developed during World War II for analyzing radar signals. Radar operators needed to distinguish between enemy aircraft and noise,
 leading to the development of signal detection theory and the receiver operating characteristic.}
@@ -806,6 +841,13 @@ \subsection{Area Under the Precision-Recall Curve}
 \clearpage
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=0.6\textwidth]{figures/PR_AUC_curves.png}
+\end{figure*}
+
+\textbf{Figure.} PR curves for a good model (PR AUC~0.85) and a weak model (PR AUC~0.30). A good model maintains high precision even at high recall.
+
 \orangebox{Did you know that...}
 {Unlike ROC AUC where a random classifier always scores 0.5, the baseline for PR AUC depends on the class distribution. For a dataset with 10\% positive class,
 a random classifier's PR AUC baseline is approximately 0.1, not 0.5. This makes PR AUC particularly sensitive to class imbalance.}
@@ -989,6 +1031,13 @@ \subsection{Jaccard Index}
 \clearpage
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=\textwidth]{figures/Jaccard_overlap.png}
+\end{figure*}
+
+\textbf{Figure.} The Jaccard Index measures the overlap between two sets. High overlap (left) yields a high score. Low overlap (right) yields a low score.
+
 \orangebox{Did you know that...}
 {The Jaccard Index was introduced by the Swiss botanist Paul Jaccard in 1901 to compare the similarity of plant species across different regions.
 It is also known as Intersection over Union (IoU) in computer vision, where it is the standard metric for object detection and image segmentation tasks.}
@@ -1042,6 +1091,13 @@ \subsection{D-squared Log Loss Score}
 \clearpage
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=0.6\textwidth]{figures/D2_Log_Loss_curve.png}
+\end{figure*}
+
+\textbf{Figure.} D-squared Log Loss Score vs model log loss. D²=1 at zero loss (perfect), D²=0 at the null model's log loss, and negative for worse-than-baseline models.
+
 \orangebox{Did you know that...}
 {The D-squared framework generalizes R-squared to any deviance function, not just squared error. In scikit-learn, \texttt{d2\_log\_loss\_score} uses log loss as the deviance,
 but you can also compute D-squared with other losses like absolute error or Poisson deviance.}
@@ -1096,6 +1152,13 @@ \subsection{P4-metric}
 \clearpage
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=0.6\textwidth]{figures/P4_vs_F1.png}
+\end{figure*}
+
+\textbf{Figure.} P4 vs F1 across scenarios. When a model ignores true negatives, F1 stays high but P4 drops significantly, revealing the imbalance that F1 misses.
+
 \orangebox{Did you know that...}
 {The P4-metric was introduced by Tharwat (2020) as a response to F1-score's blindness to true negatives. The name P4 comes from the fact that it
 considers all four Probabilities: Precision, Recall (or sensitivity), Specificity, and Negative Predictive Value.}
@@ -1147,6 +1210,13 @@ \subsection{Cohen's Kappa}
 \clearpage
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=0.6\textwidth]{figures/Cohens_Kappa_levels.png}
+\end{figure*}
+
+\textbf{Figure.} Interpretation scale for Cohen's Kappa, from poor agreement (negative) to almost perfect agreement (0.8--1.0).
+
 \orangebox{Did you know that...}
 {Cohen's Kappa was introduced by Jacob Cohen in 1960 in his paper \textit{A Coefficient of Agreement for Nominal Scales}. Despite its popularity, Cohen himself
 acknowledged its limitations and later introduced weighted kappa (1968) to handle ordinal categories where some disagreements are worse than others.}
@@ -1264,6 +1334,11 @@ \subsection{Expected Cost}
 \clearpage
 \thispagestyle{customstyle}
 
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=0.45\textwidth]{figures/EC_cost_matrix.png}
+\end{figure*}
+
 Consider a loan approval scenario with two true classes: $H_1$ (Creditworthy) and $H_2$ (Not Creditworthy), and two decisions: $D_1$ (Approve Loan) and $D_2$ (Reject Loan).
 
 Cost Matrix: Here we have defined a cost matrix C, where we want to heavily penalize approving loans for non-creditworthy customers (false positives).