Skip to content

Commit f51c4e6

Browse files
committed
Update notebooks
1 parent 7aee3bd commit f51c4e6

14 files changed

+282
-358
lines changed

notebooks/01_tabular_data_exploration.ipynb

Lines changed: 52 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
"<http://www.openml.org/d/1590>\n",
3131
"\n",
3232
"The dataset is available as a CSV (Comma-Separated Values) file and we will\n",
33-
"use pandas to read it.\n",
33+
"use `pandas` to read it.\n",
3434
"\n",
3535
"<div class=\"admonition note alert alert-info\">\n",
3636
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
@@ -67,7 +67,7 @@
6767
"source": [
6868
"## The variables (columns) in the dataset\n",
6969
"\n",
70-
"The data are stored in a pandas dataframe. A dataframe is a type of structured\n",
70+
"The data are stored in a `pandas` dataframe. A dataframe is a type of structured\n",
7171
"data composed of 2 dimensions. This type of data is also referred as tabular\n",
7272
"data.\n",
7373
"\n",
@@ -105,7 +105,8 @@
105105
"The column named **class** is our target variable (i.e., the variable which\n",
106106
"we want to predict). The two possible classes are `<=50K` (low-revenue) and\n",
107107
"`>50K` (high-revenue). The resulting prediction problem is therefore a\n",
108-
"binary classification problem, while we will use the other columns as input\n",
108+
"binary classification problem as `class` has only two possible values.\n",
109+
"We will use the left-over columns (any column other than `class`) as input\n",
109110
"variables for our model."
110111
]
111112
},
@@ -125,8 +126,9 @@
125126
"source": [
126127
"<div class=\"admonition note alert alert-info\">\n",
127128
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
128-
"<p>Classes are slightly imbalanced, meaning there are more samples of one or\n",
129-
"more classes compared to others. Class imbalance happens often in practice\n",
129+
"<p>Here, classes are slightly imbalanced, meaning there are more samples of one or\n",
130+
"more classes compared to others. In this case, we have many more samples with\n",
131+
"<tt class=\"docutils literal\">\" &lt;=50K\"</tt> than with <tt class=\"docutils literal\">\" &gt;50K\"</tt>. Class imbalance happens often in practice\n",
130132
"and may need special techniques when building a predictive model.</p>\n",
131133
"<p class=\"last\">For example in a medical setting, if we are trying to predict whether\n",
132134
"subjects will develop a rare disease, there will be a lot more healthy\n",
@@ -150,11 +152,22 @@
150152
"outputs": [],
151153
"source": [
152154
"numerical_columns = [\n",
153-
" \"age\", \"education-num\", \"capital-gain\", \"capital-loss\",\n",
154-
" \"hours-per-week\"]\n",
155+
" \"age\",\n",
156+
" \"education-num\",\n",
157+
" \"capital-gain\",\n",
158+
" \"capital-loss\",\n",
159+
" \"hours-per-week\",\n",
160+
"]\n",
155161
"categorical_columns = [\n",
156-
" \"workclass\", \"education\", \"marital-status\", \"occupation\",\n",
157-
" \"relationship\", \"race\", \"sex\", \"native-country\"]\n",
162+
" \"workclass\",\n",
163+
" \"education\",\n",
164+
" \"marital-status\",\n",
165+
" \"occupation\",\n",
166+
" \"relationship\",\n",
167+
" \"race\",\n",
168+
" \"sex\",\n",
169+
" \"native-country\",\n",
170+
"]\n",
158171
"all_columns = numerical_columns + categorical_columns + [target_column]\n",
159172
"\n",
160173
"adult_census = adult_census[all_columns]"
@@ -174,8 +187,10 @@
174187
"metadata": {},
175188
"outputs": [],
176189
"source": [
177-
"print(f\"The dataset contains {adult_census.shape[0]} samples and \"\n",
178-
" f\"{adult_census.shape[1]} columns\")"
190+
"print(\n",
191+
" f\"The dataset contains {adult_census.shape[0]} samples and \"\n",
192+
" f\"{adult_census.shape[1]} columns\"\n",
193+
")"
179194
]
180195
},
181196
{
@@ -275,17 +290,26 @@
275290
"cell_type": "markdown",
276291
"metadata": {},
277292
"source": [
278-
"Note that there is an important imbalance on the data collection concerning\n",
279-
"the number of male/female samples. Be aware that any kind of data imbalance\n",
280-
"will impact the generalizability of a model trained on it. Moreover, it can\n",
281-
"lead to\n",
293+
"Note that the data collection process resulted in an important imbalance\n",
294+
"between the number of male/female samples.\n",
295+
"\n",
296+
"Be aware that training a model with such data imbalance can cause\n",
297+
"disproportioned prediction errors for the under-represented groups. This is a\n",
298+
"typical cause of\n",
282299
"[fairness](https://docs.microsoft.com/en-us/azure/machine-learning/concept-fairness-ml#what-is-machine-learning-fairness)\n",
283-
"problems if used naively when deploying a real life setting.\n",
300+
"problems if used naively when deploying a machine learning based system in a\n",
301+
"real life setting.\n",
284302
"\n",
285303
"We recommend our readers to refer to [fairlearn.org](https://fairlearn.org)\n",
286304
"for resources on how to quantify and potentially mitigate fairness\n",
287305
"issues related to the deployment of automated decision making\n",
288-
"systems that relying on machine learning components."
306+
"systems that rely on machine learning components.\n",
307+
"\n",
308+
"Studying why the data collection process of this dataset lead to such an\n",
309+
"unexpected gender imbalance is beyond the scope of this MOOC but we should\n",
310+
"keep in mind that this dataset is not representative of the US population\n",
311+
"before drawing any conclusions based on its statistics or the predictions of\n",
312+
"models trained on it."
289313
]
290314
},
291315
{
@@ -323,8 +347,9 @@
323347
"cell_type": "markdown",
324348
"metadata": {},
325349
"source": [
326-
"This shows that `\"education\"` and `\"education-num\"` give you the same\n",
327-
"information. For example, `\"education-num\"=2` is equivalent to\n",
350+
"For every entry in `\\\"education\\\"`, there is only one single corresponding\n",
351+
"value in `\\\"education-num\\\"`. This shows that `\"education\"` and `\"education-num\"`\n",
352+
"give you the same information. For example, `\"education-num\"=2` is equivalent to\n",
328353
"`\"education\"=\"1st-4th\"`. In practice that means we can remove\n",
329354
"`\"education-num\"` without losing information. Note that having redundant (or\n",
330355
"highly correlated) columns can be a problem for machine learning algorithms."
@@ -463,20 +488,21 @@
463488
"will choose the \"best\" splits based on data without human intervention or\n",
464489
"inspection. Decision trees will be covered more in detail in a future module.\n",
465490
"\n",
466-
"Note that machine learning is really interesting when creating rules by hand\n",
467-
"is not straightforward, for example because we are in high dimension (many\n",
468-
"features) or because there are no simple and obvious rules that separate the\n",
469-
"two classes as in the top-right region of the previous plot.\n",
491+
"Note that machine learning is often used when creating rules by hand\n",
492+
"is not straightforward. For example because we are in high dimension (many\n",
493+
"features in a table) or because there are no simple and obvious rules that\n",
494+
"separate the two classes as in the top-right region of the previous plot.\n",
470495
"\n",
471496
"To sum up, the important thing to remember is that in a machine-learning\n",
472-
"setting, a model automatically creates the \"rules\" from the data in order to\n",
473-
"make predictions on new unseen data."
497+
"setting, a model automatically creates the \"rules\" from the existing data in\n",
498+
"order to make predictions on new unseen data."
474499
]
475500
},
476501
{
477502
"cell_type": "markdown",
478503
"metadata": {},
479504
"source": [
505+
"## Notebook Recap\n",
480506
"\n",
481507
"In this notebook we:\n",
482508
"\n",
@@ -487,7 +513,7 @@
487513
" you to decide whether using machine learning is appropriate for your data\n",
488514
" and to highlight potential peculiarities in your data.\n",
489515
"\n",
490-
"Ideas which will be discussed more in detail later:\n",
516+
"We made important observations (which will be discussed later in more detail):\n",
491517
"\n",
492518
"* if your target variable is imbalanced (e.g., you have more samples from one\n",
493519
" target category than another), you may need special techniques for training\n",

notebooks/02_numerical_pipeline_cross_validation.ipynb

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -153,9 +153,9 @@
153153
"source": [
154154
"The output of `cross_validate` is a Python dictionary, which by default\n",
155155
"contains three entries:\n",
156-
"- (i) the time to train the model on the training data for each fold,\n",
157-
"- (ii) the time to predict with the model on the testing data for each fold,\n",
158-
"- (iii) the default score on the testing data for each fold.\n",
156+
"- (i) the time to train the model on the training data for each fold, `fit_time`\n",
157+
"- (ii) the time to predict with the model on the testing data for each fold, `score_time`\n",
158+
"- (iii) the default score on the testing data for each fold, `test_score`.\n",
159159
"\n",
160160
"Setting `cv=5` created 5 distinct splits to get 5 variations for the training\n",
161161
"and testing sets. Each training set is used to fit one model which is then\n",
@@ -215,6 +215,8 @@
215215
"cell_type": "markdown",
216216
"metadata": {},
217217
"source": [
218+
"## Notebook recap\n",
219+
"\n",
218220
"In this notebook we assessed the generalization performance of our model via\n",
219221
"**cross-validation**."
220222
]

notebooks/02_numerical_pipeline_hands_on.ipynb

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -132,8 +132,8 @@
132132
"cell_type": "markdown",
133133
"metadata": {},
134134
"source": [
135-
"We seem to have only two data types. We can make sure by checking the unique\n",
136-
"data types."
135+
"We seem to have only two data types: `int64` and `object`. We can make\n",
136+
"sure by checking for unique data types."
137137
]
138138
},
139139
{
@@ -149,7 +149,7 @@
149149
"cell_type": "markdown",
150150
"metadata": {},
151151
"source": [
152-
"Indeed, the only two types in the dataset are integer and object.\n",
152+
"Indeed, the only two types in the dataset are integer `int64` and `object`.\n",
153153
"We can look at the first few lines of the dataframe to understand the\n",
154154
"meaning of the `object` data type."
155155
]
@@ -379,9 +379,11 @@
379379
"cell_type": "markdown",
380380
"metadata": {},
381381
"source": [
382+
"## Notebook recap\n",
383+
"\n",
382384
"In scikit-learn, the `score` method of a classification model returns the accuracy,\n",
383385
"i.e. the fraction of correctly classified samples. In this case, around\n",
384-
"8 / 10 of the times, the logistic regression predicts the right income of a\n",
386+
"8 / 10 of the times the logistic regression predicts the right income of a\n",
385387
"person. Now the real question is: is this generalization performance relevant\n",
386388
"of a good predictive model? Find out by solving the next exercise!\n",
387389
"\n",

notebooks/02_numerical_pipeline_introduction.ipynb

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -402,7 +402,7 @@
402402
"\n",
403403
"It shows the importance to always testing the generalization performance of\n",
404404
"predictive models on a different set than the one used to train these models.\n",
405-
"We will discuss later in more details how predictive models should be\n",
405+
"We will discuss later in more detail how predictive models should be\n",
406406
"evaluated."
407407
]
408408
},
@@ -417,7 +417,7 @@
417417
"prediction of a model and the true targets. Equivalent terms for\n",
418418
"<strong>generalization performance</strong> are predictive performance and statistical\n",
419419
"performance. We will refer to <strong>computational performance</strong> of a predictive\n",
420-
"model when accessing the computational costs of training a predictive model\n",
420+
"model when assessing the computational costs of training a predictive model\n",
421421
"or using it to make predictions.</p>\n",
422422
"</div>"
423423
]
@@ -426,6 +426,8 @@
426426
"cell_type": "markdown",
427427
"metadata": {},
428428
"source": [
429+
"## Notebook Recap\n",
430+
"\n",
429431
"In this notebook we:\n",
430432
"\n",
431433
"* fitted a **k-nearest neighbors** model on a training dataset;\n",

notebooks/02_numerical_pipeline_scaling.ipynb

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -30,17 +30,6 @@
3030
"adult_census = pd.read_csv(\"../datasets/adult-census.csv\")"
3131
]
3232
},
33-
{
34-
"cell_type": "code",
35-
"execution_count": null,
36-
"metadata": {},
37-
"outputs": [],
38-
"source": [
39-
"# to display nice model diagram\n",
40-
"from sklearn import set_config\n",
41-
"set_config(display='diagram')"
42-
]
43-
},
4433
{
4534
"cell_type": "markdown",
4635
"metadata": {},

notebooks/03_categorical_pipeline.ipynb

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -471,7 +471,9 @@
471471
"\n",
472472
"* list all the possible categories and provide it to the encoder via the\n",
473473
" keyword argument `categories`;\n",
474-
"* use the parameter `handle_unknown`.\n",
474+
"* use the parameter `handle_unknown`, i.e. if an unknown category is encountered\n",
475+
" during transform, the resulting one-hot encoded columns for this feature will\n",
476+
" be all zeros. \n",
475477
"\n",
476478
"Here, we will use the latter solution for simplicity."
477479
]
@@ -483,9 +485,13 @@
483485
"<div class=\"admonition tip alert alert-warning\">\n",
484486
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
485487
"<p class=\"last\">Be aware the <tt class=\"docutils literal\">OrdinalEncoder</tt> exposes as well a parameter\n",
486-
"<tt class=\"docutils literal\">handle_unknown</tt>. It can be set to <tt class=\"docutils literal\">use_encoded_value</tt> and by setting\n",
487-
"<tt class=\"docutils literal\">unknown_value</tt> to handle rare categories. You are going to use these\n",
488-
"parameters in the next exercise.</p>\n",
488+
"<tt class=\"docutils literal\">handle_unknown</tt>. It can be set to <tt class=\"docutils literal\">use_encoded_value</tt>. If that option is chosen,\n",
489+
"you can define a fixed value to which all unknowns will be set to during\n",
490+
"<tt class=\"docutils literal\">transform</tt>. For example,\n",
491+
"<tt class=\"docutils literal\"><span class=\"pre\">OrdinalEncoder(handle_unknown='use_encoded_value',</span> unknown_value=42)</tt>\n",
492+
"will set all values encountered during <tt class=\"docutils literal\">transform</tt> to <tt class=\"docutils literal\">42</tt> which are not part of\n",
493+
"the data encountered during the <tt class=\"docutils literal\">fit</tt> call.\n",
494+
"You are going to use these parameters in the next exercise.</p>\n",
489495
"</div>"
490496
]
491497
},

notebooks/03_categorical_pipeline_column_transformer.ipynb

Lines changed: 1 addition & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -168,24 +168,7 @@
168168
"from sklearn.linear_model import LogisticRegression\n",
169169
"from sklearn.pipeline import make_pipeline\n",
170170
"\n",
171-
"model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))"
172-
]
173-
},
174-
{
175-
"cell_type": "markdown",
176-
"metadata": {},
177-
"source": [
178-
"We can display an interactive diagram with the following command:"
179-
]
180-
},
181-
{
182-
"cell_type": "code",
183-
"execution_count": null,
184-
"metadata": {},
185-
"outputs": [],
186-
"source": [
187-
"from sklearn import set_config\n",
188-
"set_config(display='diagram')\n",
171+
"model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))\n",
189172
"model"
190173
]
191174
},

notebooks/ensemble_bagging.ipynb

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -118,10 +118,8 @@
118118
"\n",
119119
"## Bootstrap resampling\n",
120120
"\n",
121-
"Bootstrapping is a resampling \"with replacement\" of the original\n",
122-
"dataset. It corresponds to sampling n out of n data points with\n",
123-
"replacement uniformly at random from the original dataset. n is the\n",
124-
"number of data points in the original dataset.\n",
121+
"Given a dataset with `n` data points, bootstrapping corresponds to resampling\n",
122+
"with replacement `n` out of such `n` data points uniformly at random.\n",
125123
"\n",
126124
"As a result, the output of the bootstrap sampling procedure is another\n",
127125
"dataset with also n data points, but likely with duplicates. As a consequence,\n",
@@ -219,19 +217,17 @@
219217
},
220218
{
221219
"cell_type": "markdown",
222-
"metadata": {
223-
"lines_to_next_cell": 2
224-
},
220+
"metadata": {},
225221
"source": [
226222
"\n",
227-
"On average, ~63.2% of the original data points of the original dataset will\n",
228-
"be present in a given bootstrap sample. The other ~36.8% are repeated\n",
229-
"samples.\n",
230-
"\n",
231-
"We are able to generate many datasets, all slightly different.\n",
223+
"On average, roughly 63.2% of the original data points of the original dataset\n",
224+
"will be present in a given bootstrap sample. Since the bootstrap sample has\n",
225+
"the same size as the original dataset, there will be many samples that are in\n",
226+
"the bootstrap sample multiple times.\n",
232227
"\n",
233-
"Now, we can fit a decision tree for each of these datasets and they all shall\n",
234-
"be slightly different as well."
228+
"Using bootstrap we are able to generate many datasets, all slightly\n",
229+
"different. We can fit a decision tree for each of these datasets and they all\n",
230+
"shall be slightly different as well."
235231
]
236232
},
237233
{

0 commit comments

Comments
 (0)