scikit-learn-mooc/python_scripts/ensemble_sol_04.py at 3b3307258db0cd322e8496e327f3ec16740928a0 · SebastienMelo/scikit-learn-mooc · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# ---
# jupyter:
#   kernelspec:
#     display_name: Python 3
#     name: python3
# ---

# %% [markdown]
# # 📃 Solution for Exercise M6.04
#
# The aim of the exercise is to get familiar with the histogram
# gradient-boosting in scikit-learn. Besides, we will use this model within a
# cross-validation framework in order to inspect internal parameters found via
# grid-search.
#
# We will use the California housing dataset.

# %%
from sklearn.datasets import fetch_california_housing

data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100  # rescale the target in k$

# %% [markdown]
# First, create a histogram gradient boosting regressor. You can set the trees
# number to be large, and configure the model to use early-stopping.

# %%
# solution
from sklearn.ensemble import HistGradientBoostingRegressor

hist_gbdt = HistGradientBoostingRegressor(
    max_iter=1000, early_stopping=True, random_state=0
)

# %% [markdown]
# We will use a grid-search to find some optimal parameter for this model. In
# this grid-search, you should search for the following parameters:
#
# * `max_depth: [3, 8]`;
# * `max_leaf_nodes: [15, 31]`;
# * `learning_rate: [0.1, 1]`.
#
# Feel free to explore the space with additional values. Create the grid-search
# providing the previous gradient boosting instance as the model.

# %%
# solution
from sklearn.model_selection import GridSearchCV

params = {
    "max_depth": [3, 8],
    "max_leaf_nodes": [15, 31],
    "learning_rate": [0.1, 1],
}

search = GridSearchCV(hist_gbdt, params)

# %% [markdown]
# Finally, we will run our experiment through cross-validation. In this regard,
# define a 5-fold cross-validation. Besides, be sure to shuffle the data.
# Subsequently, use the function `sklearn.model_selection.cross_validate` to run
# the cross-validation. You should also set `return_estimator=True`, so that we
# can investigate the inner model trained via cross-validation.

# %%
# solution
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold

cv = KFold(n_splits=5, shuffle=True, random_state=0)
results = cross_validate(
    search,
    data,
    target,
    cv=cv,
    return_estimator=True,
    # n_jobs=2  # Uncomment this if you run locally
)

# %% [markdown]
# Now that we got the cross-validation results, print out the mean and standard
# deviation score.

# %%
# solution
print(
    "R2 score with cross-validation:\n"
    f"{results['test_score'].mean():.3f} ± "
    f"{results['test_score'].std():.3f}"
)

# %% [markdown]
# Then inspect the `estimator` entry of the results and check the best
# parameters values. Besides, check the number of trees used by the model.

# %%
# solution
for estimator in results["estimator"]:
    print(estimator.best_params_)
    print(f"# trees: {estimator.best_estimator_.n_iter_}")

# %% [markdown] tags=["solution"]
# We observe that the parameters are varying. We can get the intuition that
# results of the inner CV are very close for certain set of parameters.

# %% [markdown]
# Inspect the results of the inner CV for each estimator of the outer CV.
# Aggregate the mean test score for each parameter combination and make a box
# plot of these scores.

# %%
# solution
import pandas as pd

index_columns = [f"param_{name}" for name in params.keys()]
columns = index_columns + ["mean_test_score"]

inner_cv_results = []
for cv_idx, estimator in enumerate(results["estimator"]):
    search_cv_results = pd.DataFrame(estimator.cv_results_)
    search_cv_results = search_cv_results[columns].set_index(index_columns)
    search_cv_results = search_cv_results.rename(
        columns={"mean_test_score": f"CV {cv_idx}"}
    )
    inner_cv_results.append(search_cv_results)
inner_cv_results = pd.concat(inner_cv_results, axis=1).T

# %% tags=["solution"]
import matplotlib.pyplot as plt

color = {"whiskers": "black", "medians": "black", "caps": "black"}
inner_cv_results.plot.box(vert=False, color=color)
plt.xlabel("R2 score")
plt.ylabel("Parameters")
_ = plt.title(
    "Inner CV results with parameters\n"
    "(max_depth, max_leaf_nodes, learning_rate)"
)

# %% [markdown] tags=["solution"]
# We see that the first 4 ranked set of parameters are very close. We could
# select any of these 4 combinations. It coincides with the results we observe
# when inspecting the best parameters of the outer CV.