You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
\textbf{Figure. Equality of Opportunity.} \underline{Left:} All groups have similar TPR (~80\%), meaning qualified individuals are equally likely to be identified.
129
+
\underline{Right:} Group A has 90\% TPR while Group C has only 45\%, meaning qualified individuals in Group C are frequently overlooked.
130
+
115
131
\orangebox{Did you know that...}
116
132
{Equality of Opportunity was popularized in the 2016 paper “Equality of Opportunity in Supervised Learning” by Hardt, Price, and Srebro.
117
133
In that paper, they introduced both Equality of Opportunity and its stricter sibling, Equalized Odds. The terms have since become standard
@@ -175,6 +191,13 @@ \subsection{Equality of Odds}
\textbf{Figure. Equality of Odds.} \underline{Left:} Both TPR and FPR are consistent across groups (fair). \underline{Right:} Group A has high TPR but low FPR, while Group C has low TPR and high FPR — the model performs well only for Group A.
200
+
178
201
\orangebox{Did you know that...}
179
202
{Equality of Odds was popularized in the 2016 paper “Equality of Opportunity in Supervised Learning” by Hardt, Price, and Srebro.
180
203
In that paper, they introduced both Equality of Odds and its less strict sibling, Equality of Opportunity. The terms have since become standard
\textbf{Figure. Predictive Parity.} \underline{Left:} All groups have similar PPV (~74\%), meaning a positive prediction is equally trustworthy across groups.
273
+
\underline{Right:} A positive prediction for Group A is correct 85\% of the time but only 40\% for Group C.
274
+
244
275
\orangebox{Did you know that...}
245
276
{Predictive parity gained attention during the debate around the COMPAS recidivism tool. ProPublica’s 2016 investigation argued that COMPAS was
246
277
unfair because it did not satisfy equalized odds, while its developers countered that the tool did satisfy predictive parity, illustrating how
@@ -298,6 +329,13 @@ \subsection{Calibration within Groups}
\textbf{Figure. Calibration within Groups.} Group A's calibration curve closely follows the diagonal (well calibrated), while Group B's curve deviates — a predicted score of 0.3 actually corresponds to a 40\% positive rate for Group B, meaning scores have different meanings across groups.
338
+
301
339
\orangebox{Did you know that...}
302
340
{Probability calibration first became popular in weather forecasting. In the 1950s, meteorologists asked whether a
303
341
“70\% chance of rain” really meant that it rained on 7 out of 10 such days. This led to the creation of the Brier Score in 1950, one of the
The smaller the MAE, the closer the model's predictions are to the actual targets.
29
-
Theoretically, MAE belongs in the 0 to +infinity range. One of the aspects that makes MAE popular is that it is easy to understand and compute.
29
+
MAE ranges from 0 (perfect) to +infinity. To interpret it in context, compare MAE to the standard deviation of your target:
30
+
MAE $\ll$ std($Y$) suggests a useful model, while MAE $\approx$ std($Y$) means your model is barely better than predicting the mean.
30
31
31
32
\textbf{When to use MAE?}
32
33
33
-
Use MAE when you need an interpretable, robust metric that penalizes all errors equally.
34
-
Avoid using it when larger errors need more significant penalization.
34
+
Use MAE for demand forecasting, inventory planning, or any task where over-predicting by 5 is exactly as bad as under-predicting by 5.
35
+
Avoid MAE when large errors are disproportionately costly (use RMSE), when you need scale-free comparison across different targets (use MAPE),
36
+
or when you need a differentiable loss function for training (use MSE).
35
37
36
38
% strength and weakness box
37
39
\coloredboxes{
38
-
\itemMAE provides an easy-to-understand value since it represents the average error in the same units as the data.
39
-
\item MAE treats under-predictions and over-predictions equally. Bear in mind that this may not be desirable in all contexts.
40
+
\itemRobust to outliers. Unlike MSE, a single bad prediction doesn't dominate the score. If 99 predictions are off by 1 and one is off by 100, MAE = 1.99 while RMSE = 10.0.
41
+
\itemDirectly interpretable. MAE = 5.2 means ``on average, predictions miss by 5.2 units.'' No square roots or percentage conversions needed.
40
42
}
41
43
{
42
-
\item MAE can be biased when the distribution of errors is skewed, as it does not account for the direction of the error.
43
-
\item The absolute value function used in MAE is not differentiable at zero, which can pose challenges in optimization
44
-
and gradient-based learning algorithms.
44
+
\item All errors are weighted equally. A model with many small errors gets the same MAE as one with fewer but larger errors. If large errors are costly, use RMSE.
45
+
\item Not differentiable at zero, so it cannot be directly used as a loss function in gradient descent. In practice, Huber loss or smooth L1 are used instead.
\textbf{Figure.} On imbalanced data (left), a naive "always negative" classifier gets 95\% accuracy but only 50\% balanced accuracy, exposing its failure. On balanced data (right), both metrics agree.
528
+
515
529
\orangebox{Did you know that...}
516
530
{Balanced Accuracy is mathematically equivalent to the macro-averaged recall. In scikit-learn, you can verify this:
\textbf{Figure.} F-beta vs Recall at fixed Precision=0.8. With $\beta=0.5$ (favors precision), the curve is highest. With $\beta=2$ (favors recall), the score increases more steeply with recall.
725
+
698
726
\orangebox{Did you know that...}
699
727
{Common choices for $\beta$ are: $\beta = 2$ (F2-score), which weights recall twice as much as precision, useful in medical screening where missing a disease
700
728
is worse than a false alarm; and $\beta = 0.5$ (F0.5-score), which weights precision twice as much, useful in search engines where irrelevant results are costly.}
@@ -752,6 +780,13 @@ \subsection{Area Under the Receiver Operating Characteristic Curve}
752
780
\item ROC AUC doesn't inform about precision and negative predicted value.
\textbf{Figure.} ROC curves for a good model (AUC~0.95, blue) and a weak model (AUC~0.62, red). The dashed diagonal represents random guessing (AUC=0.50). The area under each curve is the AUC.
789
+
755
790
\orangebox{Did you know that...}
756
791
{The ROC curve was developed during World War II for analyzing radar signals. Radar operators needed to distinguish between enemy aircraft and noise,
757
792
leading to the development of signal detection theory and the receiver operating characteristic.}
@@ -806,6 +841,13 @@ \subsection{Area Under the Precision-Recall Curve}
\textbf{Figure.} PR curves for a good model (PR AUC~0.85) and a weak model (PR AUC~0.30). A good model maintains high precision even at high recall.
850
+
809
851
\orangebox{Did you know that...}
810
852
{Unlike ROC AUC where a random classifier always scores 0.5, the baseline for PR AUC depends on the class distribution. For a dataset with 10\% positive class,
811
853
a random classifier's PR AUC baseline is approximately 0.1, not 0.5. This makes PR AUC particularly sensitive to class imbalance.}
\textbf{Figure.} The Jaccard Index measures the overlap between two sets. High overlap (left) yields a high score. Low overlap (right) yields a low score.
1040
+
992
1041
\orangebox{Did you know that...}
993
1042
{The Jaccard Index was introduced by the Swiss botanist Paul Jaccard in 1901 to compare the similarity of plant species across different regions.
994
1043
It is also known as Intersection over Union (IoU) in computer vision, where it is the standard metric for object detection and image segmentation tasks.}
@@ -1042,6 +1091,13 @@ \subsection{D-squared Log Loss Score}
\textbf{Figure.} D-squared Log Loss Score vs model log loss. D²=1 at zero loss (perfect), D²=0 at the null model's log loss, and negative for worse-than-baseline models.
1100
+
1045
1101
\orangebox{Did you know that...}
1046
1102
{The D-squared framework generalizes R-squared to any deviance function, not just squared error. In scikit-learn, \texttt{d2\_log\_loss\_score} uses log loss as the deviance,
1047
1103
but you can also compute D-squared with other losses like absolute error or Poisson deviance.}
\textbf{Figure.} P4 vs F1 across scenarios. When a model ignores true negatives, F1 stays high but P4 drops significantly, revealing the imbalance that F1 misses.
1161
+
1099
1162
\orangebox{Did you know that...}
1100
1163
{The P4-metric was introduced by Tharwat (2020) as a response to F1-score's blindness to true negatives. The name P4 comes from the fact that it
1101
1164
considers all four Probabilities: Precision, Recall (or sensitivity), Specificity, and Negative Predictive Value.}
\textbf{Figure.} Interpretation scale for Cohen's Kappa, from poor agreement (negative) to almost perfect agreement (0.8--1.0).
1219
+
1150
1220
\orangebox{Did you know that...}
1151
1221
{Cohen's Kappa was introduced by Jacob Cohen in 1960 in his paper \textit{A Coefficient of Agreement for Nominal Scales}. Despite its popularity, Cohen himself
1152
1222
acknowledged its limitations and later introduced weighted kappa (1968) to handle ordinal categories where some disagreements are worse than others.}
Consider a loan approval scenario with two true classes: $H_1$ (Creditworthy) and $H_2$ (Not Creditworthy), and two decisions: $D_1$ (Approve Loan) and $D_2$ (Reject Loan).
1268
1343
1269
1344
Cost Matrix: Here we have defined a cost matrix C, where we want to heavily penalize approving loans for non-creditworthy customers (false positives).
0 commit comments