Skip to content

Commit 73e12fb

Browse files
author
ArturoAmorQ
committed
Update notebooks
1 parent edbfca2 commit 73e12fb

File tree

3 files changed

+54
-8
lines changed

3 files changed

+54
-8
lines changed

notebooks/03_categorical_pipeline_visualization.ipynb

Lines changed: 40 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -166,14 +166,15 @@
166166
" (\"preprocessor\", preprocessor),\n",
167167
" (\"classifier\", LogisticRegression()),\n",
168168
" ]\n",
169-
")"
169+
")\n",
170+
"model"
170171
]
171172
},
172173
{
173174
"cell_type": "markdown",
174175
"metadata": {},
175176
"source": [
176-
"Let's visualize it!"
177+
"Let's fit it!"
177178
]
178179
},
179180
{
@@ -182,14 +183,49 @@
182183
"metadata": {},
183184
"outputs": [],
184185
"source": [
185-
"model"
186+
"model.fit(data, target)"
187+
]
188+
},
189+
{
190+
"cell_type": "markdown",
191+
"metadata": {},
192+
"source": [
193+
"Notice that the diagram changes color once the estimator is fit.\n",
194+
"\n",
195+
"So far we used `Pipeline` and `ColumnTransformer`, which allows us to custom\n",
196+
"the names of the steps in the pipeline. An alternative is to use\n",
197+
"`make_column_transformer` and `make_pipeline`, they do not require, and do not\n",
198+
"permit, naming the estimators. Instead, their names are set to the lowercase\n",
199+
"of their types automatically."
200+
]
201+
},
202+
{
203+
"cell_type": "code",
204+
"execution_count": null,
205+
"metadata": {},
206+
"outputs": [],
207+
"source": [
208+
"from sklearn.compose import make_column_transformer\n",
209+
"from sklearn.pipeline import make_pipeline\n",
210+
"\n",
211+
"numeric_transformer = make_pipeline(\n",
212+
" SimpleImputer(strategy=\"median\"), StandardScaler()\n",
213+
")\n",
214+
"categorical_transformer = OneHotEncoder(handle_unknown=\"ignore\")\n",
215+
"\n",
216+
"preprocessor = make_column_transformer(\n",
217+
" (numeric_transformer, numeric_features),\n",
218+
" (categorical_transformer, categorical_features),\n",
219+
")\n",
220+
"model = make_pipeline(preprocessor, LogisticRegression())\n",
221+
"model.fit(data, target)"
186222
]
187223
},
188224
{
189225
"cell_type": "markdown",
190226
"metadata": {},
191227
"source": [
192-
"## Finally we score the model"
228+
"## Finally we can score the model using cross-validation:"
193229
]
194230
},
195231
{

notebooks/ensemble_bagging.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@
113113
"lines_to_next_cell": 2
114114
},
115115
"source": [
116-
"Let's see how we can use bootstraping to learn several trees.\n",
116+
"Let's see how we can use bootstrapping to learn several trees.\n",
117117
"\n",
118118
"## Bootstrap resampling\n",
119119
"\n",

notebooks/parameter_tuning_grid_search.ipynb

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -371,8 +371,9 @@
371371
"cell_type": "markdown",
372372
"metadata": {},
373373
"source": [
374-
"With only 2 parameters, we might want to visualize the grid-search as a\n",
375-
"heatmap. We need to transform our `cv_results` into a dataframe where:\n",
374+
"Given that we are tuning only 2 parameters, we can visualize the results as a\n",
375+
"heatmap. To do so, we first need to reshape the `cv_results` into a dataframe\n",
376+
"where:\n",
376377
"\n",
377378
"- the rows correspond to the learning-rate values;\n",
378379
"- the columns correspond to the maximum number of leaf;\n",
@@ -398,7 +399,8 @@
398399
"cell_type": "markdown",
399400
"metadata": {},
400401
"source": [
401-
"We can use a heatmap representation to show the above dataframe visually."
402+
"Now that we have the data in the right format, we can create the heatmap as\n",
403+
"follows:"
402404
]
403405
},
404406
{
@@ -424,6 +426,14 @@
424426
"cell_type": "markdown",
425427
"metadata": {},
426428
"source": [
429+
"The heatmap above shows the mean test accuracy (i.e., the average over\n",
430+
"cross-validation splits) for each combination of hyperparameters, where darker\n",
431+
"colors indicate better performance. However, notice that using colors only\n",
432+
"allows us to visually compare the mean test score, but does not carry any\n",
433+
"information on the standard deviation over splits, making it difficult to say\n",
434+
"if different scores coming from different combinations lead to a significantly\n",
435+
"better model or not.\n",
436+
"\n",
427437
"The above tables highlights the following things:\n",
428438
"\n",
429439
"* for too high values of `learning_rate`, the generalization performance of\n",

0 commit comments

Comments
 (0)