|
30 | 30 | "<http://www.openml.org/d/1590>\n", |
31 | 31 | "\n", |
32 | 32 | "The dataset is available as a CSV (Comma-Separated Values) file and we will\n", |
33 | | - "use pandas to read it.\n", |
| 33 | + "use `pandas` to read it.\n", |
34 | 34 | "\n", |
35 | 35 | "<div class=\"admonition note alert alert-info\">\n", |
36 | 36 | "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n", |
|
67 | 67 | "source": [ |
68 | 68 | "## The variables (columns) in the dataset\n", |
69 | 69 | "\n", |
70 | | - "The data are stored in a pandas dataframe. A dataframe is a type of structured\n", |
| 70 | + "The data are stored in a `pandas` dataframe. A dataframe is a type of structured\n", |
71 | 71 | "data composed of 2 dimensions. This type of data is also referred as tabular\n", |
72 | 72 | "data.\n", |
73 | 73 | "\n", |
|
105 | 105 | "The column named **class** is our target variable (i.e., the variable which\n", |
106 | 106 | "we want to predict). The two possible classes are `<=50K` (low-revenue) and\n", |
107 | 107 | "`>50K` (high-revenue). The resulting prediction problem is therefore a\n", |
108 | | - "binary classification problem, while we will use the other columns as input\n", |
| 108 | + "binary classification problem as `class` has only two possible values.\n", |
| 109 | + "We will use the left-over columns (any column other than `class`) as input\n", |
109 | 110 | "variables for our model." |
110 | 111 | ] |
111 | 112 | }, |
|
125 | 126 | "source": [ |
126 | 127 | "<div class=\"admonition note alert alert-info\">\n", |
127 | 128 | "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n", |
128 | | - "<p>Classes are slightly imbalanced, meaning there are more samples of one or\n", |
129 | | - "more classes compared to others. Class imbalance happens often in practice\n", |
| 129 | + "<p>Here, classes are slightly imbalanced, meaning there are more samples of one or\n", |
| 130 | + "more classes compared to others. In this case, we have many more samples with\n", |
| 131 | + "<tt class=\"docutils literal\">\" <=50K\"</tt> than with <tt class=\"docutils literal\">\" >50K\"</tt>. Class imbalance happens often in practice\n", |
130 | 132 | "and may need special techniques when building a predictive model.</p>\n", |
131 | 133 | "<p class=\"last\">For example in a medical setting, if we are trying to predict whether\n", |
132 | 134 | "subjects will develop a rare disease, there will be a lot more healthy\n", |
|
150 | 152 | "outputs": [], |
151 | 153 | "source": [ |
152 | 154 | "numerical_columns = [\n", |
153 | | - " \"age\", \"education-num\", \"capital-gain\", \"capital-loss\",\n", |
154 | | - " \"hours-per-week\"]\n", |
| 155 | + " \"age\",\n", |
| 156 | + " \"education-num\",\n", |
| 157 | + " \"capital-gain\",\n", |
| 158 | + " \"capital-loss\",\n", |
| 159 | + " \"hours-per-week\",\n", |
| 160 | + "]\n", |
155 | 161 | "categorical_columns = [\n", |
156 | | - " \"workclass\", \"education\", \"marital-status\", \"occupation\",\n", |
157 | | - " \"relationship\", \"race\", \"sex\", \"native-country\"]\n", |
| 162 | + " \"workclass\",\n", |
| 163 | + " \"education\",\n", |
| 164 | + " \"marital-status\",\n", |
| 165 | + " \"occupation\",\n", |
| 166 | + " \"relationship\",\n", |
| 167 | + " \"race\",\n", |
| 168 | + " \"sex\",\n", |
| 169 | + " \"native-country\",\n", |
| 170 | + "]\n", |
158 | 171 | "all_columns = numerical_columns + categorical_columns + [target_column]\n", |
159 | 172 | "\n", |
160 | 173 | "adult_census = adult_census[all_columns]" |
|
174 | 187 | "metadata": {}, |
175 | 188 | "outputs": [], |
176 | 189 | "source": [ |
177 | | - "print(f\"The dataset contains {adult_census.shape[0]} samples and \"\n", |
178 | | - " f\"{adult_census.shape[1]} columns\")" |
| 190 | + "print(\n", |
| 191 | + " f\"The dataset contains {adult_census.shape[0]} samples and \"\n", |
| 192 | + " f\"{adult_census.shape[1]} columns\"\n", |
| 193 | + ")" |
179 | 194 | ] |
180 | 195 | }, |
181 | 196 | { |
|
275 | 290 | "cell_type": "markdown", |
276 | 291 | "metadata": {}, |
277 | 292 | "source": [ |
278 | | - "Note that there is an important imbalance on the data collection concerning\n", |
279 | | - "the number of male/female samples. Be aware that any kind of data imbalance\n", |
280 | | - "will impact the generalizability of a model trained on it. Moreover, it can\n", |
281 | | - "lead to\n", |
| 293 | + "Note that the data collection process resulted in an important imbalance\n", |
| 294 | + "between the number of male/female samples.\n", |
| 295 | + "\n", |
| 296 | + "Be aware that training a model with such data imbalance can cause\n", |
| 297 | + "disproportioned prediction errors for the under-represented groups. This is a\n", |
| 298 | + "typical cause of\n", |
282 | 299 | "[fairness](https://docs.microsoft.com/en-us/azure/machine-learning/concept-fairness-ml#what-is-machine-learning-fairness)\n", |
283 | | - "problems if used naively when deploying a real life setting.\n", |
| 300 | + "problems if used naively when deploying a machine learning based system in a\n", |
| 301 | + "real life setting.\n", |
284 | 302 | "\n", |
285 | 303 | "We recommend our readers to refer to [fairlearn.org](https://fairlearn.org)\n", |
286 | 304 | "for resources on how to quantify and potentially mitigate fairness\n", |
287 | 305 | "issues related to the deployment of automated decision making\n", |
288 | | - "systems that relying on machine learning components." |
| 306 | + "systems that rely on machine learning components.\n", |
| 307 | + "\n", |
| 308 | + "Studying why the data collection process of this dataset lead to such an\n", |
| 309 | + "unexpected gender imbalance is beyond the scope of this MOOC but we should\n", |
| 310 | + "keep in mind that this dataset is not representative of the US population\n", |
| 311 | + "before drawing any conclusions based on its statistics or the predictions of\n", |
| 312 | + "models trained on it." |
289 | 313 | ] |
290 | 314 | }, |
291 | 315 | { |
|
323 | 347 | "cell_type": "markdown", |
324 | 348 | "metadata": {}, |
325 | 349 | "source": [ |
326 | | - "This shows that `\"education\"` and `\"education-num\"` give you the same\n", |
327 | | - "information. For example, `\"education-num\"=2` is equivalent to\n", |
| 350 | + "For every entry in `\\\"education\\\"`, there is only one single corresponding\n", |
| 351 | + "value in `\\\"education-num\\\"`. This shows that `\"education\"` and `\"education-num\"`\n", |
| 352 | + "give you the same information. For example, `\"education-num\"=2` is equivalent to\n", |
328 | 353 | "`\"education\"=\"1st-4th\"`. In practice that means we can remove\n", |
329 | 354 | "`\"education-num\"` without losing information. Note that having redundant (or\n", |
330 | 355 | "highly correlated) columns can be a problem for machine learning algorithms." |
|
463 | 488 | "will choose the \"best\" splits based on data without human intervention or\n", |
464 | 489 | "inspection. Decision trees will be covered more in detail in a future module.\n", |
465 | 490 | "\n", |
466 | | - "Note that machine learning is really interesting when creating rules by hand\n", |
467 | | - "is not straightforward, for example because we are in high dimension (many\n", |
468 | | - "features) or because there are no simple and obvious rules that separate the\n", |
469 | | - "two classes as in the top-right region of the previous plot.\n", |
| 491 | + "Note that machine learning is often used when creating rules by hand\n", |
| 492 | + "is not straightforward. For example because we are in high dimension (many\n", |
| 493 | + "features in a table) or because there are no simple and obvious rules that\n", |
| 494 | + "separate the two classes as in the top-right region of the previous plot.\n", |
470 | 495 | "\n", |
471 | 496 | "To sum up, the important thing to remember is that in a machine-learning\n", |
472 | | - "setting, a model automatically creates the \"rules\" from the data in order to\n", |
473 | | - "make predictions on new unseen data." |
| 497 | + "setting, a model automatically creates the \"rules\" from the existing data in\n", |
| 498 | + "order to make predictions on new unseen data." |
474 | 499 | ] |
475 | 500 | }, |
476 | 501 | { |
477 | 502 | "cell_type": "markdown", |
478 | 503 | "metadata": {}, |
479 | 504 | "source": [ |
| 505 | + "## Notebook Recap\n", |
480 | 506 | "\n", |
481 | 507 | "In this notebook we:\n", |
482 | 508 | "\n", |
|
487 | 513 | " you to decide whether using machine learning is appropriate for your data\n", |
488 | 514 | " and to highlight potential peculiarities in your data.\n", |
489 | 515 | "\n", |
490 | | - "Ideas which will be discussed more in detail later:\n", |
| 516 | + "We made important observations (which will be discussed later in more detail):\n", |
491 | 517 | "\n", |
492 | 518 | "* if your target variable is imbalanced (e.g., you have more samples from one\n", |
493 | 519 | " target category than another), you may need special techniques for training\n", |
|
0 commit comments