text edits

florian-huber · florian-huber · commit a6d28e3e96e8 · 2025-05-14T21:44:49.000+02:00
diff --git a/notebooks/19_machine_learning_techniques.ipynb b/notebooks/19_machine_learning_techniques.ipynb
@@ -79,43 +79,42 @@
     "\n",
     "Precision and recall are metrics that provide more insight into the accuracy of positive predictions and the classifier's ability to recover all relevant instances, respectively.\n",
     "\n",
-    "- **Precision (Positive Predictive Value)**: The ratio of correctly predicted positive observations to the total predicted positives.\n",
+    "**Precision (Positive Predictive Value)**:  \n",
+    "The ratio of correctly predicted positive observations to the total predicted positives.\n",
     "\n",
-    "  $$\n",
-    "  Precision = \\frac{TP}{TP + FP}\n",
-    "  $$\n",
-    "  \n",
-    "  High precision indicates a low false positive rate, which is crucial in scenarios where the cost of a false positive is high, such as in email spam classification.\n",
+    "$$\n",
+    "Precision = \\frac{TP}{TP + FP}\n",
+    "$$\n",
+    "\n",
+    "High precision indicates a low false positive rate, which is crucial in scenarios where the cost of a false positive is high, such as in email spam classification.\n",
     "\n",
-    "- **Recall (Sensitivity, True Positive Rate)**: The ratio of correctly predicted positive observations to all observations in actual class.\n",
+    "**Recall (Sensitivity, True Positive Rate)**:  \n",
+    "The ratio of correctly predicted positive observations to all observations in actual class.\n",
     "\n",
-    "  $$\n",
-    "  Recall = \\frac{TP}{TP + FN}\n",
-    "  $$\n",
+    "$$\n",
+    "Recall = \\frac{TP}{TP + FN}\n",
+    "$$\n",
     "\n",
-    "  High recall is vital in medical scenarios or fraud detection, where failing to detect an anomaly can have severe consequences.\n",
+    "High recall is vital in medical scenarios or fraud detection, where failing to detect an anomaly can have severe consequences.\n",
     "\n",
     "#### 3. F1 Score\n",
     "\n",
     "The F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. It is particularly useful when the class distribution is uneven (biased data).\n",
     "\n",
-    "- Formula:\n",
     "\n",
-    "  $$\n",
-    "  F_1 = 2 \\cdot \\frac{Precision \\cdot Recall}{Precision + Recall}\n",
-    "  $$\n",
+    "$$\n",
+    "F_1 = 2 \\cdot \\frac{Precision \\cdot Recall}{Precision + Recall}\n",
+    "$$\n",
     "  \n",
-    "  The F1 score is an excellent measure to use if you need to seek a balance between Precision and Recall and there is an uneven class distribution (as in the case of your fraudulent vs. non-fraudulent calls scenario).\n",
+    "The F1 score is an excellent measure to use if you need to seek a balance between Precision and Recall and there is an uneven class distribution (as in the case of your fraudulent vs. non-fraudulent calls scenario).\n",
     "\n",
     "#### 4. Accuracy\n",
     "\n",
     "As previously discussed, accuracy is the ratio of correctly predicted observations to the total observations and can be misleading in the presence of an imbalanced dataset.\n",
     "\n",
-    "- Formula:\n",
-    "\n",
-    "  $$\n",
-    "  Accuracy = \\frac{TP + TN}{TP + TN + FP + FN}\n",
-    "  $$\n",
+    "$$\n",
+    "Accuracy = \\frac{TP + TN}{TP + TN + FP + FN}\n",
+    "$$\n",
     "\n",
     "\n",
     "### Detailed Model Evaluation Metrics (Regression)\n",
@@ -125,25 +124,21 @@
     "\n",
     "MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.\n",
     "\n",
-    "- Formula:\n",
+    "$$\n",
+    "\\text{MAE} = \\frac{1}{n} \\sum_{i=1}^n |y_i - \\hat{y}_i|\n",
+    "$$\n",
     "\n",
-    "  $$\n",
-    "  \\text{MAE} = \\frac{1}{n} \\sum_{i=1}^n |y_i - \\hat{y}_i|\n",
-    "  $$\n",
-    "\n",
-    "  where $y_i$ are the actual values and $\\hat{y}_i$ are the predicted values.\n",
+    "where $y_i$ are the actual values and $\\hat{y}_i$ are the predicted values.\n",
     "\n",
     "#### 2. Mean Squared Error (MSE)\n",
     "\n",
     "MSE is like MAE but squares the difference before summing them all instead of using the absolute value. This has the effect of heavily penalizing larger errors.\n",
     "\n",
-    "- Formula:\n",
-    "\n",
-    "  $$\n",
-    "  \\text{MSE} = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2\n",
-    "  $$\n",
+    "$$\n",
+    "\\text{MSE} = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2\n",
+    "$$\n",
     "\n",
-    "  MSE is more sensitive to outliers than MAE and tends to emphasize larger differences.\n",
+    "MSE is more sensitive to outliers than MAE and tends to emphasize larger differences.\n",
     "\n",
     "\n",
     "There are many more metrics available, such as `RMSE` or `R-squared`, but we will not cover them in this course.\n",
@@ -184,7 +179,6 @@
     "  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n",
     "  ```\n",
     "\n",
-    "\n",
     "**2. Train-Validation-Test Split**\n",
     "When tuning hyperparameters or making decisions about model architecture, it’s crucial to have a third split: the validation set.\n",
     "\n",
@@ -193,19 +187,20 @@
     "\n",
     "### Cross-Validation: Enhancing Model Validation\n",
     "\n",
-    "**Cross-validation** is a robust method for estimating the effectiveness of your model which is especially useful when dealing with limited data.\n",
+    "Cross-validation is a robust method for estimating the effectiveness of your model, which is especially useful when dealing with limited data.\n",
     "\n",
-    "- **K-Fold Cross-Validation**: The data set is divided into 'k' smaller sets. The model is trained on 'k-1' of these folds, with the remaining part used as the test fold. This process is repeated 'k' times with each of the 'k' folds used exactly once as the test set.\n",
+    "A **k-Fold Cross-Validation** means that the dataset is divided into 'k' smaller sets. The model is trained on 'k-1' of these folds, with the remaining part used as the test fold. This process is repeated 'k' times, with each of the 'k' folds used exactly once as the test set.\n",
     "\n",
-    "  ```python\n",
-    "  from sklearn.model_selection import cross_val_score, KFold\n",
-    "  kf = KFold(n_splits=5, random_state=42, shuffle=True)\n",
-    "  scores = cross_val_score(model, X, y, cv=kf)\n",
-    "  average_score = scores.mean()\n",
-    "  ```\n",
+    "Using Scikit-Learn this can be realized like this:\n",
+    "```python\n",
+    "from sklearn.model_selection import cross_val_score, KFold\n",
     "\n",
+    "kf = KFold(n_splits=5, random_state=42, shuffle=True)\n",
+    "scores = cross_val_score(model, X, y, cv=kf)\n",
+    "average_score = scores.mean()\n",
+    "```\n",
     "\n",
-    "- **Stratified K-Fold**: A variation of k-fold which is used when one has imbalanced classes. It ensures that each fold of the dataset has the same proportion of examples in each class as the complete set.\n",
+    "A variation of a k-fold is the **stratified K-Fold** that is used for highly imbalanced classes. It ensures that each fold of the dataset has the same proportion of examples in each class as the complete set.\n",
     "\n",
     "\n",
     "```{figure} ../images/fig_cross_validation_training.png\n",
@@ -217,6 +212,7 @@
     "### Advanced Techniques for Unbalanced Data\n",
     "\n",
     "When dealing with imbalanced datasets, traditional training and testing strategies may not suffice, as they might lead to models biased towards the majority class.\n",
+    "Common strategies to work with highly imbalanced data are:\n",
     "\n",
     "- **Oversampling the Minority Class**: Increasing the number of instances in the minority class by duplicating them to prevent the model from being biased toward the majority class.\n",
     "- **Undersampling the Majority Class**: Reducing the number of instances in the majority class to balance the dataset.\n",
@@ -230,7 +226,7 @@
     "\n",
     "\n",
     "### Beyond this course:\n",
-    "Here, we did the sampling, i.e. the selection of data points for either training or testing, fully randomly. In many real world cases, however, the situation can be more complex. For instance, we might have many medical observations in a dataset with sometimes multiple entries for the same patient. In such a case we would have to split the data by patient and not fully randomly (see for instance here: {cite}`tougui2021impact`). In other cases, we will have strong biases and we would often like to compensate for this. For example to avoid that -purely by chance- a certain population or class is not well represented in a certain data split. The process to compensate for this is called **stratification**."
+    "Here, we did the sampling, that is the selection of data points, for either training or testing, fully randomly. In many real-world cases, however, the situation can be more complex. For instance, we might have many medical observations in a dataset with sometimes multiple entries for the same patient. In such a case we would have to split the data by patient and not fully randomly (see, for instance, here: {cite}`tougui2021impact`). In other cases, we will have strong biases and we would often like to compensate for this. For example, to avoid that -purely by chance- a certain population or class is not well represented in a certain data split. The process to compensate for this is called **stratification**."
    ]
   },
   {