edits, updates

florian-huber · florian-huber · commit 8f629961de60 · 2025-05-13T17:00:28.000+02:00
diff --git a/notebooks/16_machine_learning_algorithms.ipynb b/notebooks/16_machine_learning_algorithms.ipynb
@@ -14,7 +14,13 @@
   {
    "cell_type": "markdown",
    "id": "ccfadb82-4be0-407d-a54e-2198f75c2bf1",
-   "metadata": {},
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
    "source": [
     "## Introduction\n",
     "**$k$-nearest neighbors** is, for very good reasons, one of the most commonly known machine learning algorithms. It is relatively intuitive and simple, yet still powerful enough to find plenty of use cases even today (despite having much fancier techniques on the market).\n",
@@ -34,40 +40,80 @@
     "\n",
     "---\n",
     "\n",
-    "### Pros, Cons, Caveats\n",
+    "## Pros, Cons, Caveats\n",
     "Conceptually, the k-nearest neighbors algorithm is rather simple and intuitive. However, there are a few important aspects to consider when applying this algorithm.\n",
     "\n",
-    "First of all, k-nearest kneighbors is a distance-based algorithm. This means that we have to ensure that closer really means \"more similar\" which is not as simple as it may sound. We have to decide on a *distance metric*, that is the measure (or function) by which we calculate the distance between data points. We can use common metrics like the Euclidean distance, but there are many different options to choose from.\n",
+    "\n",
+    "### Caveats\n",
+    "Let's consider a situation as in {numref}`fig_knn_caveats`A. Here we see that a change in $k$ can lead to entirely different predictions for certain data points. In general, kNN predictions can be highly unstable close to border regions, and they also tend to be highly sensitive to the local density of data points. The latter can be a problem if we have far more points in one category than in another.\n",
+    "\n",
+    "```{figure} ../images/fig_knn_caveats.png\n",
+    ":name: fig_knn_caveats\n",
+    "\n",
+    "k-nearest neighbors has a few important caveats. **A** its predictions can change with changing $k$, and generally are very density sensitive. **B** it suffers (as many machine learning models) from overconfidence, which simply means that it will confidently output predictions even for data points that are entirely different from the training data (or even physically impossible).\n",
+    "```\n",
+    "\n",
+    "Another common problem with kNN -but also many other models- is called **over-confidence** ({numref}`fig_knn_caveats`B). The algorithm described here creates its predictions on the $k$ closest neighbors. But for very unusual inputs or even entirely impossible inputs, the algorithm will still find $k$ closest neighbors and make a prediction. So if you ask for the shoe size of a person of 6.20m and 840 kg, your model might confidently answer your question and say: 48 (if nothing bigger occurred in the data). So much for the \"intelligent\" in *artificial intelligence* ...\n",
+    "\n",
+    "Finally, k-nearest neighbors is a distance-based algorithm. This means that we have to ensure that closer really means \"more similar\" which is not as simple as it may sound. We have to decide on a *distance metric*, that is, the measure (or function) by which we calculate the distance between data points. We can use common metrics like the Euclidean distance, but there are many different options to choose from.\n",
     "Even more critical is the proper *scaling* of our features. Just think of an example. We want to predict the shoe size of a person based on the person's height (measured in $m$) and weight (measured in $kg$). This means that we have two features here: height and weight. For a prediction on a new person, we simply need his/her height and weight. Then k-NN will compare those values to all known (\"learned\") data points in our model and find the closest $k$ other people. If we now use the Euclidean distance, the distance $d$ will simply be\n",
     "\n",
     "$$\n",
     " d = \\sqrt{(w_1 - w_2) ^ 2 + (h_1 - h_2) ^ 2}\n",
     "$$\n",
-    "where $w$ and $h$ are the weights and heights of person 1 and 2.\n",
+    "where $w$ and $h$ are the weights and heights of person 1 and 2. Let's say we have person-1 with 1.73m and 81kg and person-2 with 1.89m and 79kg.\n",
     "\n",
-    "Try to answer the following question: What is the problem here?\n",
+    "Take a moment to answer the following question: What is the problem here?\n",
     "\n",
     "...?"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "54f4d3ba-3770-44c2-98be-c554df69cd96",
-   "metadata": {},
+   "id": "be9b139e-42f1-4a51-aeef-c2e65757b906",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "toggle"
+    ]
+   },
    "source": [
-    "Ok. The issue here is, that the weights are in kilograms ($kg$), so we are talking about values like 50, 60, 80, 100. The height, however, is measured in meters ($m$) such that values are many times smaller. As a result, having two people differ one meter in height (which is a lot) will count no more than one kilogram difference (which is close to nothing). Clearly not what we intuitively mean by \"nearest neighbors\"!\n",
+    "Ok. The issue here is, that the weights are in kilograms ($kg$), so we are talking about values like 50, 60, 80, 100. The height, however, is measured in meters ($m$) such that values are many times smaller. As a result, having two people differ one meter in height (which is a lot) will count no more than one kilogram difference (which is close to nothing). Clearly not what we intuitively mean by \"nearest neighbors\"!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fc59bfe8-681d-4a38-a2fa-71e540558e57",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "source": [
+    "### Data Scaling\n",
     "\n",
     "The solution to this is a proper **scaling** of our data. Often, we will simply apply one of the following two scaling methods:\n",
-    "1. MinMax Scaling - this means we linearly rescale our data such that the lowest occurring value becomes 0 and the highest value becomes 1.\n",
-    "2. Standard Scaling - here we rescale our data such that the mean value will be 0 and the standard deviation will be 1.\n",
+    "1. **MinMax Scaling** - We linearly rescale our data such that the lowest occurring value becomes 0 and the highest value becomes 1.\n",
+    "2. **Standard Scaling** - We rescale our data such that the mean value will be 0 and the standard deviation will be 1.\n",
     "\n",
-    "Both methods might give you values that look awkward at first. Standard scaling, for instance, gives both positive and negative values so that our height values in the example could be -1.04 or +0.27. But don't worry, the scaling is really only meant to be used for the machine learning algorithm itself."
+    "Both methods might produce values that look awkward at first. Standard scaling, for instance, gives positive and negative values so that our height values in the example could be -1.04 or +0.27. But don't worry, the scaling is only meant to be used for the machine learning algorithm itself. For manual inspection or visualizations of the data we would still use the data in its original scaling."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "97cf5881-c242-4d4b-8dd9-e5fdc7723f1a",
-   "metadata": {},
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
    "source": [
     "Once we scaled our data, and maybe also picked the right distance metric (or used a good default, which will do for a start), we are technically good to apply k-NN.\n",
     "\n",
@@ -77,24 +123,19 @@
     "This is the model's main parameter and we are free to choose any value we like. And there is no simple best choice that always works. In practice, the choice of $k$ will depend on the number of data points we have, but also on the distribution of data and the number of classes or parameter ranges. We usually want to pick odd values here to avoid draws as much as possible (imagine two nearest neighbors are \"spam\" and two are \"no-spam\"). But whether 3, 5, 7, or 13 is the best choice will depend on our specific task at hand. \n",
     "\n",
     "\n",
-    "In machine learning, we call such a thing a **hyperparameter** (or: fitting parameter). These are parameters that are not *learned* by the model, but have to be defined by us. We are free to change its value, and it might have a considerable impact on the quality of our predictions, or our \"model performance\". Ideally, we would compare several different models with different parameters and pick the one that performed best. We will later see, that machine learning models often have many hyperparameters. Luckily, not all of them are equally sensitive so they will often already be set to more-or-less ok-ish default values in libraries such as Scikit-Learn.\n",
-    "\n",
-    "#### Caveats\n",
-    "Let's consider a situation as in {numref}`fig_knn_caveats`A. Here we see that a change in $k$ can lead to entirely different predictions for certain data points. In general, kNN predictions can be highly unstable close to border regions, and they also tend to be highly sensitive to the local density of data points. The latter can be a problem if we have far more points in one category than in another.\n",
-    "\n",
-    "```{figure} ../images/fig_knn_caveats.png\n",
-    ":name: fig_knn_caveats\n",
-    "\n",
-    "k-nearest neighbors has a few important caveats. **A** its predictions can change with changing $k$, and generally are very density sensitive. **B** it suffers (as many machine learning models) from overconfidence, which simply means that it will confidently output predictions even for data points that are entirely different from the training data (or even physically impossible).\n",
-    "```\n",
-    "\n",
-    "Finally, another common problem with kNN -but also many other models- is called **over-confidence** ({numref}`fig_knn_caveats`B). The algorithm described here creates its predictions on the $k$ closest neighbors. But for very unusual inputs or even entirely impossible inputs, the algorithm will still find $k$ closest neighbors and make a prediction. So if you ask for the shoe size of a person of 6.20m and 840 kg, your model might confidently answer your question and say: 48 (if nothing bigger occurred in the data). So much for the \"intelligent\" in *artificial intelligence* ..."
+    "In machine learning, we call such a thing a **hyperparameter** (or: fitting parameter). These are parameters that are not *learned* by the model, but have to be defined by us. We are free to change its value, and it might have a considerable impact on the quality of our predictions, or our \"model performance\". Ideally, we would compare several different models with different parameters and pick the one that performed best. We will later see, that machine learning models often have many hyperparameters. Luckily, not all of them are equally sensitive so they will often already be set to more-or-less ok-ish default values in libraries such as Scikit-Learn."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "f8dc2519-24d0-482b-ad15-71ba3671cb79",
-   "metadata": {},
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
    "source": [
     "In summary, k-NN has a number of Pros and Cons:\n",
     "\n",