@@ -1588,7 +1588,7 @@ cancer_irrelevant[
1588
1588
]
1589
1589
```
1590
1590
1591
- Next, we build a sequence of $K$-NN classifiers that include ` Smoothness ` ,
1591
+ Next, we build a sequence of KNN classifiers that include ` Smoothness ` ,
1592
1592
` Concavity ` , and ` Perimeter ` as predictor variables, but also increasingly many irrelevant
1593
1593
variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors.
1594
1594
Then we build a model, tuned via 5-fold cross-validation, for each data set.
@@ -1692,15 +1692,9 @@ glue("fig:06-performance-irrelevant-features", plt_irrelevant_accuracies)
1692
1692
Effect of inclusion of irrelevant predictors.
1693
1693
:::
1694
1694
1695
- ``` {code-cell} ipython3
1696
- :tags: [remove-cell]
1697
-
1698
- glue("cancer_propn_1", "{:0.0f}".format(cancer_proportions.loc["Benign", "percent"]))
1699
- ```
1700
-
1701
1695
Although the accuracy decreases as expected, one surprising thing about
1702
1696
{numref}` fig:06-performance-irrelevant-features ` is that it shows that the method
1703
- still outperforms the baseline majority classifier (with about {glue: text }` cancer_propn_1 ` % accuracy)
1697
+ still outperforms the baseline majority classifier (with about {glue: text }` cancer_train_b_prop ` % accuracy)
1704
1698
even with 40 irrelevant variables.
1705
1699
How could that be? {numref}` fig:06-neighbors-irrelevant-features ` provides the answer:
1706
1700
the tuning procedure for the $K$-nearest neighbors classifier combats the extra randomness from the irrelevant variables
@@ -1803,13 +1797,13 @@ Best subset selection is applicable to any classification method ($K$-NN or othe
1803
1797
However, it becomes very slow when you have even a moderate
1804
1798
number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
1805
1799
grows very quickly with the number of predictors, and you have to train the model (itself
1806
- a slow process!) for each one. For example, if we have $2$ predictors&mdash ; let's call
1800
+ a slow process!) for each one. For example, if we have 2 predictors&mdash ; let's call
1807
1801
them A and B&mdash ; then we have 3 variable sets to try: A alone, B alone, and finally A
1808
- and B together. If we have $3$ predictors&mdash ; A, B, and C&mdash ; then we have 7
1802
+ and B together. If we have 3 predictors&mdash ; A, B, and C&mdash ; then we have 7
1809
1803
to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
1810
1804
we have to train for $m$ predictors is $2^m-1$; in other words, when we
1811
- get to $10$ predictors we have over * one thousand* models to train, and
1812
- at $20$ predictors we have over * one million* models to train!
1805
+ get to 10 predictors we have over * one thousand* models to train, and
1806
+ at 20 predictors we have over * one million* models to train!
1813
1807
So although it is a simple method, best subset selection is usually too computationally
1814
1808
expensive to use in practice.
1815
1809
@@ -1834,8 +1828,8 @@ This pattern continues for as many iterations as you want. If you run the method
1834
1828
all the way until you run out of predictors to choose, you will end up training
1835
1829
$\frac{1}{2}m(m+1)$ separate models. This is a * big* improvement from the $2^m-1$
1836
1830
models that best subset selection requires you to train! For example, while best subset selection requires
1837
- training over 1000 candidate models with $m=10$ predictors, forward selection requires training only 55 candidate models.
1838
- Therefore we will continue the rest of this section using forward selection.
1831
+ training over 1000 candidate models with 10 predictors, forward selection requires training only 55 candidate models.
1832
+ Therefore we will continue the rest of this section using forward selection.
1839
1833
1840
1834
``` {note}
1841
1835
One word of caution before we move on. Every additional model that you train
0 commit comments