INRIA
diff --git a/‎notebooks/01_tabular_data_exploration.ipynb‎
Lines changed: 52 additions & 26 deletions b/‎notebooks/01_tabular_data_exploration.ipynb‎
Lines changed: 52 additions & 26 deletions
diff --git a/‎notebooks/02_numerical_pipeline_cross_validation.ipynb‎
Lines changed: 5 additions & 3 deletions b/‎notebooks/02_numerical_pipeline_cross_validation.ipynb‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎notebooks/02_numerical_pipeline_hands_on.ipynb‎
Lines changed: 6 additions & 4 deletions b/‎notebooks/02_numerical_pipeline_hands_on.ipynb‎
Lines changed: 6 additions & 4 deletions
diff --git a/‎notebooks/02_numerical_pipeline_introduction.ipynb‎
Lines changed: 4 additions & 2 deletions b/‎notebooks/02_numerical_pipeline_introduction.ipynb‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎notebooks/02_numerical_pipeline_scaling.ipynb‎
Lines changed: 0 additions & 11 deletions b/‎notebooks/02_numerical_pipeline_scaling.ipynb‎
Lines changed: 0 additions & 11 deletions
diff --git a/‎notebooks/03_categorical_pipeline.ipynb‎
Lines changed: 10 additions & 4 deletions b/‎notebooks/03_categorical_pipeline.ipynb‎
Lines changed: 10 additions & 4 deletions
diff --git a/‎notebooks/03_categorical_pipeline_column_transformer.ipynb‎
Lines changed: 1 addition & 18 deletions b/‎notebooks/03_categorical_pipeline_column_transformer.ipynb‎
Lines changed: 1 addition & 18 deletions
diff --git a/‎notebooks/video_pipeline.ipynb‎ renamed to ‎notebooks/03_categorical_pipeline_visualization.ipynb‎ b/‎notebooks/video_pipeline.ipynb‎ renamed to ‎notebooks/03_categorical_pipeline_visualization.ipynb‎
diff --git a/‎notebooks/ensemble_bagging.ipynb‎
Lines changed: 10 additions & 14 deletions b/‎notebooks/ensemble_bagging.ipynb‎
Lines changed: 10 additions & 14 deletions
@@ -30,7 +30,7 @@
     "<http://www.openml.org/d/1590>\n",
     "\n",
     "The dataset is available as a CSV (Comma-Separated Values) file and we will\n",
-    "use pandas to read it.\n",
+    "use `pandas` to read it.\n",
     "\n",
     "<div class=\"admonition note alert alert-info\">\n",
     "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
@@ -67,7 +67,7 @@
    "source": [
     "## The variables (columns) in the dataset\n",
     "\n",
-    "The data are stored in a pandas dataframe. A dataframe is a type of structured\n",
+    "The data are stored in a `pandas` dataframe. A dataframe is a type of structured\n",
     "data composed of 2 dimensions. This type of data is also referred as tabular\n",
     "data.\n",
     "\n",
@@ -105,7 +105,8 @@
     "The column named **class** is our target variable (i.e., the variable which\n",
     "we want to predict). The two possible classes are `<=50K` (low-revenue) and\n",
     "`>50K` (high-revenue). The resulting prediction problem is therefore a\n",
-    "binary classification problem, while we will use the other columns as input\n",
+    "binary classification problem as `class` has only two possible values.\n",
+    "We will use the left-over columns (any column other than `class`) as input\n",
     "variables for our model."
    ]
   },
@@ -125,8 +126,9 @@
    "source": [
     "<div class=\"admonition note alert alert-info\">\n",
     "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
-    "<p>Classes are slightly imbalanced, meaning there are more samples of one or\n",
-    "more classes compared to others. Class imbalance happens often in practice\n",
+    "<p>Here, classes are slightly imbalanced, meaning there are more samples of one or\n",
+    "more classes compared to others. In this case, we have many more samples with\n",
+    "<tt class=\"docutils literal\">\" &lt;=50K\"</tt> than with <tt class=\"docutils literal\">\" &gt;50K\"</tt>. Class imbalance happens often in practice\n",
     "and may need special techniques when building a predictive model.</p>\n",
     "<p class=\"last\">For example in a medical setting, if we are trying to predict whether\n",
     "subjects will develop a rare disease, there will be a lot more healthy\n",
@@ -150,11 +152,22 @@
    "outputs": [],
    "source": [
     "numerical_columns = [\n",
-    "    \"age\", \"education-num\", \"capital-gain\", \"capital-loss\",\n",
-    "    \"hours-per-week\"]\n",
+    "    \"age\",\n",
+    "    \"education-num\",\n",
+    "    \"capital-gain\",\n",
+    "    \"capital-loss\",\n",
+    "    \"hours-per-week\",\n",
+    "]\n",
     "categorical_columns = [\n",
-    "    \"workclass\", \"education\", \"marital-status\", \"occupation\",\n",
-    "    \"relationship\", \"race\", \"sex\", \"native-country\"]\n",
+    "    \"workclass\",\n",
+    "    \"education\",\n",
+    "    \"marital-status\",\n",
+    "    \"occupation\",\n",
+    "    \"relationship\",\n",
+    "    \"race\",\n",
+    "    \"sex\",\n",
+    "    \"native-country\",\n",
+    "]\n",
     "all_columns = numerical_columns + categorical_columns + [target_column]\n",
     "\n",
     "adult_census = adult_census[all_columns]"
@@ -174,8 +187,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(f\"The dataset contains {adult_census.shape[0]} samples and \"\n",
-    "      f\"{adult_census.shape[1]} columns\")"
+    "print(\n",
+    "    f\"The dataset contains {adult_census.shape[0]} samples and \"\n",
+    "    f\"{adult_census.shape[1]} columns\"\n",
+    ")"
    ]
   },
   {
@@ -275,17 +290,26 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Note that there is an important imbalance on the data collection concerning\n",
-    "the number of male/female samples. Be aware that any kind of data imbalance\n",
-    "will impact the generalizability of a model trained on it. Moreover, it can\n",
-    "lead to\n",
+    "Note that the data collection process resulted in an important imbalance\n",
+    "between the number of male/female samples.\n",
+    "\n",
+    "Be aware that training a model with such data imbalance can cause\n",
+    "disproportioned prediction errors for the under-represented groups. This is a\n",
+    "typical cause of\n",
     "[fairness](https://docs.microsoft.com/en-us/azure/machine-learning/concept-fairness-ml#what-is-machine-learning-fairness)\n",
-    "problems if used naively when deploying a real life setting.\n",
+    "problems if used naively when deploying a machine learning based system in a\n",
+    "real life setting.\n",
     "\n",
     "We recommend our readers to refer to [fairlearn.org](https://fairlearn.org)\n",
     "for resources on how to quantify and potentially mitigate fairness\n",
     "issues related to the deployment of automated decision making\n",
-    "systems that relying on machine learning components."
+    "systems that rely on machine learning components.\n",
+    "\n",
+    "Studying why the data collection process of this dataset lead to such an\n",
+    "unexpected gender imbalance is beyond the scope of this MOOC but we should\n",
+    "keep in mind that this dataset is not representative of the US population\n",
+    "before drawing any conclusions based on its statistics or the predictions of\n",
+    "models trained on it."
    ]
   },
   {
@@ -323,8 +347,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This shows that `\"education\"` and `\"education-num\"` give you the same\n",
-    "information. For example, `\"education-num\"=2` is equivalent to\n",
+    "For every entry in `\\\"education\\\"`, there is only one single corresponding\n",
+    "value in `\\\"education-num\\\"`. This shows that `\"education\"` and `\"education-num\"`\n",
+    "give you the same information. For example, `\"education-num\"=2` is equivalent to\n",
     "`\"education\"=\"1st-4th\"`. In practice that means we can remove\n",
     "`\"education-num\"` without losing information. Note that having redundant (or\n",
     "highly correlated) columns can be a problem for machine learning algorithms."
@@ -463,20 +488,21 @@
     "will choose the \"best\" splits based on data without human intervention or\n",
     "inspection. Decision trees will be covered more in detail in a future module.\n",
     "\n",
-    "Note that machine learning is really interesting when creating rules by hand\n",
-    "is not straightforward, for example because we are in high dimension (many\n",
-    "features) or because there are no simple and obvious rules that separate the\n",
-    "two classes as in the top-right region of the previous plot.\n",
+    "Note that machine learning is often used when creating rules by hand\n",
+    "is not straightforward. For example because we are in high dimension (many\n",
+    "features in a table) or because there are no simple and obvious rules that\n",
+    "separate the two classes as in the top-right region of the previous plot.\n",
     "\n",
     "To sum up, the important thing to remember is that in a machine-learning\n",
-    "setting, a model automatically creates the \"rules\" from the data in order to\n",
-    "make predictions on new unseen data."
+    "setting, a model automatically creates the \"rules\" from the existing data in\n",
+    "order to make predictions on new unseen data."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Notebook Recap\n",
     "\n",
     "In this notebook we:\n",
     "\n",
@@ -487,7 +513,7 @@
     "  you to decide whether using machine learning is appropriate for your data\n",
     "  and to highlight potential peculiarities in your data.\n",
     "\n",
-    "Ideas which will be discussed more in detail later:\n",
+    "We made important observations (which will be discussed later in more detail):\n",
     "\n",
     "* if your target variable is imbalanced (e.g., you have more samples from one\n",
     "  target category than another), you may need special techniques for training\n",
 
@@ -153,9 +153,9 @@
    "source": [
     "The output of `cross_validate` is a Python dictionary, which by default\n",
     "contains three entries:\n",
-    "- (i) the time to train the model on the training data for each fold,\n",
-    "- (ii) the time to predict with the model on the testing data for each fold,\n",
-    "- (iii) the default score on the testing data for each fold.\n",
+    "- (i) the time to train the model on the training data for each fold, `fit_time`\n",
+    "- (ii) the time to predict with the model on the testing data for each fold, `score_time`\n",
+    "- (iii) the default score on the testing data for each fold, `test_score`.\n",
     "\n",
     "Setting `cv=5` created 5 distinct splits to get 5 variations for the training\n",
     "and testing sets. Each training set is used to fit one model which is then\n",
@@ -215,6 +215,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Notebook recap\n",
+    "\n",
     "In this notebook we assessed the generalization performance of our model via\n",
     "**cross-validation**."
    ]
 
@@ -132,8 +132,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We seem to have only two data types. We can make sure by checking the unique\n",
-    "data types."
+    "We seem to have only two data types: `int64` and `object`. We can make\n",
+    "sure by checking for unique data types."
    ]
   },
   {
@@ -149,7 +149,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Indeed, the only two types in the dataset are integer and object.\n",
+    "Indeed, the only two types in the dataset are integer `int64` and `object`.\n",
     "We can look at the first few lines of the dataframe to understand the\n",
     "meaning of the `object` data type."
    ]
@@ -379,9 +379,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Notebook recap\n",
+    "\n",
     "In scikit-learn, the `score` method of a classification model returns the accuracy,\n",
     "i.e. the fraction of correctly classified samples. In this case, around\n",
-    "8 / 10 of the times, the logistic regression predicts the right income of a\n",
+    "8 / 10 of the times the logistic regression predicts the right income of a\n",
     "person. Now the real question is: is this generalization performance relevant\n",
     "of a good predictive model? Find out by solving the next exercise!\n",
     "\n",
 
@@ -402,7 +402,7 @@
     "\n",
     "It shows the importance to always testing the generalization performance of\n",
     "predictive models on a different set than the one used to train these models.\n",
-    "We will discuss later in more details how predictive models should be\n",
+    "We will discuss later in more detail how predictive models should be\n",
     "evaluated."
    ]
   },
@@ -417,7 +417,7 @@
     "prediction of a model and the true targets. Equivalent terms for\n",
     "<strong>generalization performance</strong> are predictive performance and statistical\n",
     "performance. We will refer to <strong>computational performance</strong> of a predictive\n",
-    "model when accessing the computational costs of training a predictive model\n",
+    "model when assessing the computational costs of training a predictive model\n",
     "or using it to make predictions.</p>\n",
     "</div>"
    ]
@@ -426,6 +426,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Notebook Recap\n",
+    "\n",
     "In this notebook we:\n",
     "\n",
     "* fitted a **k-nearest neighbors** model on a training dataset;\n",
 
@@ -30,17 +30,6 @@
     "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# to display nice model diagram\n",
-    "from sklearn import set_config\n",
-    "set_config(display='diagram')"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
 
@@ -471,7 +471,9 @@
     "\n",
     "* list all the possible categories and provide it to the encoder via the\n",
     "  keyword argument `categories`;\n",
-    "* use the parameter `handle_unknown`.\n",
+    "* use the parameter `handle_unknown`, i.e. if an unknown category is encountered\n",
+    "  during transform, the resulting one-hot encoded columns for this feature will\n",
+    "  be all zeros. \n",
     "\n",
     "Here, we will use the latter solution for simplicity."
    ]
@@ -483,9 +485,13 @@
     "<div class=\"admonition tip alert alert-warning\">\n",
     "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
     "<p class=\"last\">Be aware the <tt class=\"docutils literal\">OrdinalEncoder</tt> exposes as well a parameter\n",
-    "<tt class=\"docutils literal\">handle_unknown</tt>. It can be set to <tt class=\"docutils literal\">use_encoded_value</tt> and by setting\n",
-    "<tt class=\"docutils literal\">unknown_value</tt> to handle rare categories. You are going to use these\n",
-    "parameters in the next exercise.</p>\n",
+    "<tt class=\"docutils literal\">handle_unknown</tt>. It can be set to <tt class=\"docutils literal\">use_encoded_value</tt>. If that option is chosen,\n",
+    "you can define a fixed value to which all unknowns will be set to during\n",
+    "<tt class=\"docutils literal\">transform</tt>. For example,\n",
+    "<tt class=\"docutils literal\"><span class=\"pre\">OrdinalEncoder(handle_unknown='use_encoded_value',</span> unknown_value=42)</tt>\n",
+    "will set all values encountered during <tt class=\"docutils literal\">transform</tt> to <tt class=\"docutils literal\">42</tt> which are not part of\n",
+    "the data encountered during the <tt class=\"docutils literal\">fit</tt> call.\n",
+    "You are going to use these parameters in the next exercise.</p>\n",
     "</div>"
    ]
   },
 
@@ -168,24 +168,7 @@
     "from sklearn.linear_model import LogisticRegression\n",
     "from sklearn.pipeline import make_pipeline\n",
     "\n",
-    "model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We can display an interactive diagram with the following command:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn import set_config\n",
-    "set_config(display='diagram')\n",
+    "model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))\n",
     "model"
    ]
   },
 
@@ -118,10 +118,8 @@
     "\n",
     "## Bootstrap resampling\n",
     "\n",
-    "Bootstrapping is a resampling \"with replacement\" of the original\n",
-    "dataset. It corresponds to sampling n out of n data points with\n",
-    "replacement uniformly at random from the original dataset. n is the\n",
-    "number of data points in the original dataset.\n",
+    "Given a dataset with `n` data points, bootstrapping corresponds to resampling\n",
+    "with replacement  `n` out of such `n` data points uniformly at random.\n",
     "\n",
     "As a result, the output of the bootstrap sampling procedure is another\n",
     "dataset with also n data points, but likely with duplicates. As a consequence,\n",
@@ -219,19 +217,17 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "lines_to_next_cell": 2
-   },
+   "metadata": {},
    "source": [
     "\n",
-    "On average, ~63.2% of the original data points of the original dataset will\n",
-    "be present in a given bootstrap sample. The other ~36.8% are repeated\n",
-    "samples.\n",
-    "\n",
-    "We are able to generate many datasets, all slightly different.\n",
+    "On average, roughly 63.2% of the original data points of the original dataset\n",
+    "will be present in a given bootstrap sample. Since the bootstrap sample has\n",
+    "the same size as the original dataset, there will be many samples that are in\n",
+    "the bootstrap sample multiple times.\n",
     "\n",
-    "Now, we can fit a decision tree for each of these datasets and they all shall\n",
-    "be slightly different as well."
+    "Using bootstrap we are able to generate many datasets, all slightly\n",
+    "different. We can fit a decision tree for each of these datasets and they all\n",
+    "shall be slightly different as well."
    ]
   },
   {