diff --git a/notebooks/datasets_adult_census.ipynb b/notebooks/datasets_adult_census.ipynb index 139287829..ae274ebf5 100644 --- a/notebooks/datasets_adult_census.ipynb +++ b/notebooks/datasets_adult_census.ipynb @@ -105,7 +105,7 @@ " dimensions=plot_list,\n", " )\n", ")\n", - "fig.show()" + "fig.show(renderer=\"notebook\")" ] }, { diff --git a/notebooks/linear_models_feature_engineering_classification.ipynb b/notebooks/linear_models_feature_engineering_classification.ipynb index 6781ef734..5043c40ad 100644 --- a/notebooks/linear_models_feature_engineering_classification.ipynb +++ b/notebooks/linear_models_feature_engineering_classification.ipynb @@ -641,7 +641,7 @@ "- Transformers such as `KBinsDiscretizer` and `SplineTransformer` can be used\n", " to engineer non-linear features independently for each original feature.\n", "- As a result, these transformers cannot capture interactions between the\n", - " orignal features (and then would fail on the XOR classification task).\n", + " original features (and then would fail on the XOR classification task).\n", "- Despite this limitation they already augment the expressivity of the\n", " pipeline, which can be sufficient for some datasets.\n", "- They also favor axis-aligned decision boundaries, in particular in the low\n", diff --git a/notebooks/parameter_tuning_grid_search.ipynb b/notebooks/parameter_tuning_grid_search.ipynb index 7f0e0f61f..a7fc56994 100644 --- a/notebooks/parameter_tuning_grid_search.ipynb +++ b/notebooks/parameter_tuning_grid_search.ipynb @@ -198,29 +198,33 @@ "source": [ "## Tuning using a grid-search\n", "\n", - "In the previous exercise we used one `for` loop for each hyperparameter to\n", - "find the best combination over a fixed grid of values. `GridSearchCV` is a\n", - "scikit-learn class that implements a very similar logic with less repetitive\n", - "code.\n", + "In the previous exercise (M3.01) we used two nested `for` loops (one for each\n", + "hyperparameter) to test different combinations over a fixed grid of\n", + "hyperparameter values. In each iteration of the loop, we used\n", + "`cross_val_score` to compute the mean score (as averaged across\n", + "cross-validation splits), and compared those mean scores to select the best\n", + "combination. `GridSearchCV` is a scikit-learn class that implements a very\n", + "similar logic with less repetitive code. The suffix `CV` refers to the\n", + "cross-validation it runs internally (instead of the `cross_val_score` we\n", + "\"hard\" coded).\n", + "\n", + "The `GridSearchCV` estimator takes a `param_grid` parameter which defines all\n", + "hyperparameters and their associated values. The grid-search is in charge of\n", + "creating all possible combinations and testing them.\n", + "\n", + "The number of combinations is equal to the product of the number of values to\n", + "explore for each parameter. Thus, adding new parameters with their associated\n", + "values to be explored rapidly becomes computationally expensive. Because of\n", + "that, here we only explore the combination learning-rate and the maximum\n", + "number of nodes for a total of 4 x 3 = 12 combinations.\n", "\n", - "Let's see how to use the `GridSearchCV` estimator for doing such search. Since\n", - "the grid-search is costly, we only explore the combination learning-rate and\n", - "the maximum number of nodes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ "%%time\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "param_grid = {\n", - " \"classifier__learning_rate\": (0.01, 0.1, 1, 10),\n", - " \"classifier__max_leaf_nodes\": (3, 10, 30),\n", - "}\n", + " \"classifier__learning_rate\": (0.01, 0.1, 1, 10), # 4 possible values\n", + " \"classifier__max_leaf_nodes\": (3, 10, 30), # 3 possible values\n", + "} # 12 unique combinations\n", "model_grid_search = GridSearchCV(model, param_grid=param_grid, n_jobs=2, cv=2)\n", "model_grid_search.fit(data_train, target_train)" ] @@ -229,7 +233,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Finally, we check the accuracy of our model using the test set." + "You can access the best combination of hyperparameters found by the grid\n", + "search using the `best_params_` attribute." ] }, { @@ -238,46 +243,19 @@ "metadata": {}, "outputs": [], "source": [ - "accuracy = model_grid_search.score(data_test, target_test)\n", - "print(\n", - " f\"The test accuracy score of the grid-searched pipeline is: {accuracy:.2f}\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "

Warning

\n", - "

Be aware that the evaluation should normally be performed through\n", - "cross-validation by providing model_grid_search as a model to the\n", - "cross_validate function.

\n", - "

Here, we used a single train-test split to evaluate model_grid_search. In\n", - "a future notebook will go into more detail about nested cross-validation, when\n", - "you use cross-validation both for hyperparameter tuning and model evaluation.

\n", - "
" + "print(f\"The best set of parameters is: {model_grid_search.best_params_}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The `GridSearchCV` estimator takes a `param_grid` parameter which defines all\n", - "hyperparameters and their associated values. The grid-search is in charge\n", - "of creating all possible combinations and test them.\n", - "\n", - "The number of combinations are equal to the product of the number of values to\n", - "explore for each parameter (e.g. in our example 4 x 3 combinations). Thus,\n", - "adding new parameters with their associated values to be explored become\n", - "rapidly computationally expensive.\n", - "\n", - "Once the grid-search is fitted, it can be used as any other predictor by\n", - "calling `predict` and `predict_proba`. Internally, it uses the model with the\n", + "Once the grid-search is fitted, it can be used as any other estimator, i.e. it\n", + "has `predict` and `score` methods. Internally, it uses the model with the\n", "best parameters found during `fit`.\n", "\n", - "Get predictions for the 5 first samples using the estimator with the best\n", - "parameters." + "Let's get the predictions for the 5 first samples using the estimator with the\n", + "best parameters:" ] }, { @@ -293,8 +271,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can know about these parameters by looking at the `best_params_`\n", - "attribute." + "Finally, we check the accuracy of our model using the test set." ] }, { @@ -303,16 +280,43 @@ "metadata": {}, "outputs": [], "source": [ - "print(f\"The best set of parameters is: {model_grid_search.best_params_}\")" + "accuracy = model_grid_search.score(data_test, target_test)\n", + "print(\n", + " f\"The test accuracy score of the grid-search pipeline is: {accuracy:.2f}\"\n", + ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The accuracy and the best parameters of the grid-searched pipeline are similar\n", + "The accuracy and the best parameters of the grid-search pipeline are similar\n", "to the ones we found in the previous exercise, where we searched the best\n", - "parameters \"by hand\" through a double for loop.\n", + "parameters \"by hand\" through a double `for` loop.\n", + "\n", + "## The need for a validation set\n", + "\n", + "In the previous section, the selection of the best hyperparameters was done\n", + "using the train set, coming from the initial train-test split. Then, we\n", + "evaluated the generalization performance of our tuned model on the left out\n", + "test set. This can be shown schematically as follows:\n", + "\n", + "![Cross-validation tuning\n", + "diagram](../figures/cross_validation_train_test_diagram.png)\n", + "\n", + "
\n", + "

Note

\n", + "

This figure shows the particular case of K-fold cross-validation strategy\n", + "using n_splits=5 to further split the train set coming from a train-test\n", + "split. For each cross-validation split, the procedure trains a model on all\n", + "the red samples, evaluates the score of a given set of hyperparameters on the\n", + "green samples. The best combination of hyperparameters best_params is selected\n", + "based on those intermediate scores.

\n", + "

Then a final model is refitted using best_params on the concatenation of the\n", + "red and green samples and evaluated on the blue samples.

\n", + "

The green samples are sometimes referred as the validation set to\n", + "differentiate them from the final test set in blue.

\n", + "
\n", "\n", "In addition, we can inspect all results which are stored in the attribute\n", "`cv_results_` of the grid-search. We filter some specific columns from these\n", diff --git a/notebooks/parameter_tuning_parallel_plot.ipynb b/notebooks/parameter_tuning_parallel_plot.ipynb index 32f411b35..806bbd9f7 100644 --- a/notebooks/parameter_tuning_parallel_plot.ipynb +++ b/notebooks/parameter_tuning_parallel_plot.ipynb @@ -145,7 +145,7 @@ " color=\"mean_test_score\",\n", " color_continuous_scale=px.colors.sequential.Viridis,\n", ")\n", - "fig.show()" + "fig.show(renderer=\"notebook\")" ] }, { diff --git a/notebooks/parameter_tuning_sol_03.ipynb b/notebooks/parameter_tuning_sol_03.ipynb index d52b48176..37c8f15da 100644 --- a/notebooks/parameter_tuning_sol_03.ipynb +++ b/notebooks/parameter_tuning_sol_03.ipynb @@ -266,7 +266,7 @@ " dimensions=[\"n_neighbors\", \"centering\", \"scaling\", \"mean test score\"],\n", " color_continuous_scale=px.colors.diverging.Tealrose,\n", ")\n", - "fig.show()" + "fig.show(renderer=\"notebook\")" ] }, { diff --git a/python_scripts/datasets_adult_census.py b/python_scripts/datasets_adult_census.py index f86bf40ef..d3d36d88f 100644 --- a/python_scripts/datasets_adult_census.py +++ b/python_scripts/datasets_adult_census.py @@ -91,7 +91,7 @@ def generate_dict(col): dimensions=plot_list, ) ) -fig.show() +fig.show(renderer="notebook") # %% [markdown] # The `Parcoords` plot is quite similar to the parallel coordinates plot that we diff --git a/python_scripts/parameter_tuning_parallel_plot.py b/python_scripts/parameter_tuning_parallel_plot.py index 340e75dd0..1be534206 100644 --- a/python_scripts/parameter_tuning_parallel_plot.py +++ b/python_scripts/parameter_tuning_parallel_plot.py @@ -102,7 +102,7 @@ def shorten_param(param_name): color="mean_test_score", color_continuous_scale=px.colors.sequential.Viridis, ) -fig.show() +fig.show(renderer="notebook") # %% [markdown] # ```{note} diff --git a/python_scripts/parameter_tuning_sol_03.py b/python_scripts/parameter_tuning_sol_03.py index 1cdb01191..3f50c0adf 100644 --- a/python_scripts/parameter_tuning_sol_03.py +++ b/python_scripts/parameter_tuning_sol_03.py @@ -160,7 +160,7 @@ dimensions=["n_neighbors", "centering", "scaling", "mean test score"], color_continuous_scale=px.colors.diverging.Tealrose, ) -fig.show() +fig.show(renderer="notebook") # %% [markdown] tags=["solution"] # We recall that it is possible to select a range of results by clicking and