|
17 | 17 | "<div class=\"admonition caution alert alert-warning\">\n", |
18 | 18 | "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n", |
19 | 19 | "<p class=\"last\">For the sake of clarity, no cross-validation will be used to estimate the\n", |
20 | | - "testing error. We are only showing the effect of the parameters\n", |
21 | | - "on the validation set of what should be the inner cross-validation.</p>\n", |
| 20 | + "variability of the testing error. We are only showing the effect of the\n", |
| 21 | + "parameters on the validation set of what should be the inner loop of a nested\n", |
| 22 | + "cross-validation.</p>\n", |
22 | 23 | "</div>\n", |
23 | 24 | "\n", |
24 | | - "## Random forest\n", |
25 | | - "\n", |
26 | | - "The main parameter to tune for random forest is the `n_estimators` parameter.\n", |
27 | | - "In general, the more trees in the forest, the better the generalization\n", |
28 | | - "performance will be. However, it will slow down the fitting and prediction\n", |
29 | | - "time. The goal is to balance computing time and generalization performance when\n", |
30 | | - "setting the number of estimators when putting such learner in production.\n", |
31 | | - "\n", |
32 | | - "Then, we could also tune a parameter that controls the depth of each tree in\n", |
33 | | - "the forest. Two parameters are important for this: `max_depth` and\n", |
34 | | - "`max_leaf_nodes`. They differ in the way they control the tree structure.\n", |
35 | | - "Indeed, `max_depth` will enforce to have a more symmetric tree, while\n", |
36 | | - "`max_leaf_nodes` does not impose such constraint.\n", |
37 | | - "\n", |
38 | | - "Be aware that with random forest, trees are generally deep since we are\n", |
39 | | - "seeking to overfit each tree on each bootstrap sample because this will be\n", |
40 | | - "mitigated by combining them altogether. Assembling underfitted trees (i.e.\n", |
41 | | - "shallow trees) might also lead to an underfitted forest." |
| 25 | + "We will start by loading the california housing dataset." |
42 | 26 | ] |
43 | 27 | }, |
44 | 28 | { |
|
56 | 40 | " data, target, random_state=0)" |
57 | 41 | ] |
58 | 42 | }, |
| 43 | + { |
| 44 | + "cell_type": "markdown", |
| 45 | + "metadata": {}, |
| 46 | + "source": [ |
| 47 | + "## Random forest\n", |
| 48 | + "\n", |
| 49 | + "The main parameter to select in random forest is the `n_estimators` parameter.\n", |
| 50 | + "In general, the more trees in the forest, the better the generalization\n", |
| 51 | + "performance will be. However, it will slow down the fitting and prediction\n", |
| 52 | + "time. The goal is to balance computing time and generalization performance\n", |
| 53 | + "when setting the number of estimators. Here, we fix `n_estimators=100`, which\n", |
| 54 | + "is already the default value.\n", |
| 55 | + "\n", |
| 56 | + "<div class=\"admonition caution alert alert-warning\">\n", |
| 57 | + "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n", |
| 58 | + "<p class=\"last\">Tuning the <tt class=\"docutils literal\">n_estimators</tt> for random forests generally result in a waste of\n", |
| 59 | + "computer power. We just need to ensure that it is large enough so that doubling\n", |
| 60 | + "its value does not lead to a significant improvement of the validation error.</p>\n", |
| 61 | + "</div>\n", |
| 62 | + "\n", |
| 63 | + "Instead, we can tune the hyperparameter `max_features`, which controls the\n", |
| 64 | + "size of the random subset of features to consider when looking for the best\n", |
| 65 | + "split when growing the trees: smaller values for `max_features` will lead to\n", |
| 66 | + "more random trees with hopefully more uncorrelated prediction errors. However\n", |
| 67 | + "if `max_features` is too small, predictions can be too random, even after\n", |
| 68 | + "averaging with the trees in the ensemble.\n", |
| 69 | + "\n", |
| 70 | + "If `max_features` is set to `None`, then this is equivalent to setting\n", |
| 71 | + "`max_features=n_features` which means that the only source of randomness in\n", |
| 72 | + "the random forest is the bagging procedure." |
| 73 | + ] |
| 74 | + }, |
| 75 | + { |
| 76 | + "cell_type": "code", |
| 77 | + "execution_count": null, |
| 78 | + "metadata": {}, |
| 79 | + "outputs": [], |
| 80 | + "source": [ |
| 81 | + "print(f\"In this case, n_features={len(data.columns)}\")" |
| 82 | + ] |
| 83 | + }, |
| 84 | + { |
| 85 | + "cell_type": "markdown", |
| 86 | + "metadata": {}, |
| 87 | + "source": [ |
| 88 | + "We can also tune the different parameters that control the depth of each tree\n", |
| 89 | + "in the forest. Two parameters are important for this: `max_depth` and\n", |
| 90 | + "`max_leaf_nodes`. They differ in the way they control the tree structure.\n", |
| 91 | + "Indeed, `max_depth` will enforce to have a more symmetric tree, while\n", |
| 92 | + "`max_leaf_nodes` does not impose such constraint. If `max_leaf_nodes=None`\n", |
| 93 | + "then the number of leaf nodes is unlimited.\n", |
| 94 | + "\n", |
| 95 | + "The hyperparameter `min_samples_leaf` controls the minimum number of samples\n", |
| 96 | + "required to be at a leaf node. This means that a split point (at any depth) is\n", |
| 97 | + "only done if it leaves at least `min_samples_leaf` training samples in each of\n", |
| 98 | + "the left and right branches. A small value for `min_samples_leaf` means that\n", |
| 99 | + "some samples can become isolated when a tree is deep, promoting overfitting. A\n", |
| 100 | + "large value would prevent deep trees, which can lead to underfitting.\n", |
| 101 | + "\n", |
| 102 | + "Be aware that with random forest, trees are expected to be deep since we are\n", |
| 103 | + "seeking to overfit each tree on each bootstrap sample. Overfitting is\n", |
| 104 | + "mitigated when combining the trees altogether, whereas assembling underfitted\n", |
| 105 | + "trees (i.e. shallow trees) might also lead to an underfitted forest." |
| 106 | + ] |
| 107 | + }, |
59 | 108 | { |
60 | 109 | "cell_type": "code", |
61 | 110 | "execution_count": null, |
|
67 | 116 | "from sklearn.ensemble import RandomForestRegressor\n", |
68 | 117 | "\n", |
69 | 118 | "param_distributions = {\n", |
70 | | - " \"n_estimators\": [1, 2, 5, 10, 20, 50, 100, 200, 500],\n", |
71 | | - " \"max_leaf_nodes\": [2, 5, 10, 20, 50, 100],\n", |
| 119 | + " \"max_features\": [1, 2, 3, 5, None],\n", |
| 120 | + " \"max_leaf_nodes\": [10, 100, 1000, None],\n", |
| 121 | + " \"min_samples_leaf\": [1, 2, 5, 10, 20, 50, 100],\n", |
72 | 122 | "}\n", |
73 | 123 | "search_cv = RandomizedSearchCV(\n", |
74 | 124 | " RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,\n", |
|
88 | 138 | "cell_type": "markdown", |
89 | 139 | "metadata": {}, |
90 | 140 | "source": [ |
91 | | - "We can observe in our search that we are required to have a large\n", |
92 | | - "number of leaves and thus deep trees. This parameter seems particularly\n", |
93 | | - "impactful in comparison to the number of trees for this particular dataset:\n", |
94 | | - "with at least 50 trees, the generalization performance will be driven by the\n", |
95 | | - "number of leaves.\n", |
96 | | - "\n", |
97 | | - "Now we will estimate the generalization performance of the best model by\n", |
98 | | - "refitting it with the full training set and using the test set for scoring on\n", |
99 | | - "unseen data. This is done by default when calling the `.fit` method." |
| 141 | + "We can observe in our search that we are required to have a large number of\n", |
| 142 | + "`max_leaf_nodes` and thus deep trees. This parameter seems particularly\n", |
| 143 | + "impactful with respect to the other tuning parameters, but large values of\n", |
| 144 | + "`min_samples_leaf` seem to reduce the performance of the model.\n", |
| 145 | + "\n", |
| 146 | + "In practice, more iterations of random search would be necessary to precisely\n", |
| 147 | + "assert the role of each parameters. Using `n_iter=10` is good enough to\n", |
| 148 | + "quickly inspect the hyperparameter combinations that yield models that work\n", |
| 149 | + "well enough without spending too much computational resources. Feel free to\n", |
| 150 | + "try more interations on your own.\n", |
| 151 | + "\n", |
| 152 | + "Once the `RandomizedSearchCV` has found the best set of hyperparameters, it\n", |
| 153 | + "uses them to refit the model using the full training set. To estimate the\n", |
| 154 | + "generalization performance of the best model it suffices to call `.score` on\n", |
| 155 | + "the unseen data." |
100 | 156 | ] |
101 | 157 | }, |
102 | 158 | { |
|
180 | 236 | "\n", |
181 | 237 | "<div class=\"admonition caution alert alert-warning\">\n", |
182 | 238 | "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n", |
183 | | - "<p class=\"last\">Here, we tune the <tt class=\"docutils literal\">n_estimators</tt> but be aware that using early-stopping as\n", |
184 | | - "in the previous exercise will be better.</p>\n", |
| 239 | + "<p class=\"last\">Here, we tune the <tt class=\"docutils literal\">n_estimators</tt> but be aware that is better to use\n", |
| 240 | + "<tt class=\"docutils literal\">early_stopping</tt> as done in the Exercise M6.04.</p>\n", |
185 | 241 | "</div>\n", |
186 | 242 | "\n", |
187 | 243 | "In this search, we see that the `learning_rate` is required to be large\n", |
|
196 | 252 | "cell_type": "markdown", |
197 | 253 | "metadata": {}, |
198 | 254 | "source": [ |
199 | | - "Now we estimate the generalization performance of the best model\n", |
200 | | - "using the test set." |
| 255 | + "Now we estimate the generalization performance of the best model using the\n", |
| 256 | + "test set." |
201 | 257 | ] |
202 | 258 | }, |
203 | 259 | { |
|
216 | 272 | "source": [ |
217 | 273 | "The mean test score in the held-out test set is slightly better than the score\n", |
218 | 274 | "of the best model. The reason is that the final model is refitted on the whole\n", |
219 | | - "training set and therefore, on more data than the inner cross-validated models\n", |
220 | | - "of the grid search procedure." |
| 275 | + "training set and therefore, on more data than the cross-validated models of\n", |
| 276 | + "the grid search procedure." |
221 | 277 | ] |
222 | 278 | } |
223 | 279 | ], |
|
0 commit comments