Update notebooks

ArturoAmorQ · ArturoAmorQ · commit 73e12fb0a744 · 2025-07-15T16:06:16.000+02:00
diff --git a/notebooks/03_categorical_pipeline_visualization.ipynb b/notebooks/03_categorical_pipeline_visualization.ipynb
@@ -166,14 +166,15 @@
     "        (\"preprocessor\", preprocessor),\n",
     "        (\"classifier\", LogisticRegression()),\n",
     "    ]\n",
-    ")"
+    ")\n",
+    "model"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's visualize it!"
+    "Let's fit it!"
    ]
   },
   {
@@ -182,14 +183,49 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "model"
+    "model.fit(data, target)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Notice that the diagram changes color once the estimator is fit.\n",
+    "\n",
+    "So far we used `Pipeline` and `ColumnTransformer`, which allows us to custom\n",
+    "the names of the steps in the pipeline. An alternative is to use\n",
+    "`make_column_transformer` and `make_pipeline`, they do not require, and do not\n",
+    "permit, naming the estimators. Instead, their names are set to the lowercase\n",
+    "of their types automatically."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.compose import make_column_transformer\n",
+    "from sklearn.pipeline import make_pipeline\n",
+    "\n",
+    "numeric_transformer = make_pipeline(\n",
+    "    SimpleImputer(strategy=\"median\"), StandardScaler()\n",
+    ")\n",
+    "categorical_transformer = OneHotEncoder(handle_unknown=\"ignore\")\n",
+    "\n",
+    "preprocessor = make_column_transformer(\n",
+    "    (numeric_transformer, numeric_features),\n",
+    "    (categorical_transformer, categorical_features),\n",
+    ")\n",
+    "model = make_pipeline(preprocessor, LogisticRegression())\n",
+    "model.fit(data, target)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Finally we score the model"
+    "## Finally we can score the model using cross-validation:"
    ]
   },
   {
diff --git a/notebooks/ensemble_bagging.ipynb b/notebooks/ensemble_bagging.ipynb
@@ -113,7 +113,7 @@
     "lines_to_next_cell": 2
    },
    "source": [
-    "Let's see how we can use bootstraping to learn several trees.\n",
+    "Let's see how we can use bootstrapping to learn several trees.\n",
     "\n",
     "## Bootstrap resampling\n",
     "\n",
diff --git a/notebooks/parameter_tuning_grid_search.ipynb b/notebooks/parameter_tuning_grid_search.ipynb
@@ -371,8 +371,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "With only 2 parameters, we might want to visualize the grid-search as a\n",
-    "heatmap. We need to transform our `cv_results` into a dataframe where:\n",
+    "Given that we are tuning only 2 parameters, we can visualize the results as a\n",
+    "heatmap. To do so, we first need to reshape the `cv_results` into a dataframe\n",
+    "where:\n",
     "\n",
     "- the rows correspond to the learning-rate values;\n",
     "- the columns correspond to the maximum number of leaf;\n",
@@ -398,7 +399,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can use a heatmap representation to show the above dataframe visually."
+    "Now that we have the data in the right format, we can create the heatmap as\n",
+    "follows:"
    ]
   },
   {
@@ -424,6 +426,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "The heatmap above shows the mean test accuracy (i.e., the average over\n",
+    "cross-validation splits) for each combination of hyperparameters, where darker\n",
+    "colors indicate better performance. However, notice that using colors only\n",
+    "allows us to visually compare the mean test score, but does not carry any\n",
+    "information on the standard deviation over splits, making it difficult to say\n",
+    "if different scores coming from different combinations lead to a significantly\n",
+    "better model or not.\n",
+    "\n",
     "The above tables highlights the following things:\n",
     "\n",
     "* for too high values of `learning_rate`, the generalization performance of\n",