@@ -1536,7 +1536,6 @@ the $K$-NN here.
1536
1536
1537
1537
+++
1538
1538
1539
- <!--
1540
1539
## Predictor variable selection
1541
1540
1542
1541
``` {note}
@@ -1589,7 +1588,7 @@ cancer_irrelevant[
1589
1588
]
1590
1589
```
1591
1590
1592
- Next, we build a sequence of $K$-NN classifiers that include `Smoothness`,
1591
+ Next, we build a sequence of KNN classifiers that include ` Smoothness ` ,
1593
1592
` Concavity ` , and ` Perimeter ` as predictor variables, but also increasingly many irrelevant
1594
1593
variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors.
1595
1594
Then we build a model, tuned via 5-fold cross-validation, for each data set.
@@ -1693,15 +1692,9 @@ glue("fig:06-performance-irrelevant-features", plt_irrelevant_accuracies)
1693
1692
Effect of inclusion of irrelevant predictors.
1694
1693
:::
1695
1694
1696
- ```{code-cell} ipython3
1697
- :tags: [remove-cell]
1698
-
1699
- glue("cancer_propn_1", "{:0.0f}".format(cancer_proportions.loc["Benign", "percent"]))
1700
- ```
1701
-
1702
1695
Although the accuracy decreases as expected, one surprising thing about
1703
1696
{numref}` fig:06-performance-irrelevant-features ` is that it shows that the method
1704
- still outperforms the baseline majority classifier (with about {glue:text}`cancer_propn_1 `% accuracy)
1697
+ still outperforms the baseline majority classifier (with about {glue: text }` cancer_train_b_prop ` % accuracy)
1705
1698
even with 40 irrelevant variables.
1706
1699
How could that be? {numref}` fig:06-neighbors-irrelevant-features ` provides the answer:
1707
1700
the tuning procedure for the $K$-nearest neighbors classifier combats the extra randomness from the irrelevant variables
@@ -1804,13 +1797,13 @@ Best subset selection is applicable to any classification method ($K$-NN or othe
1804
1797
However, it becomes very slow when you have even a moderate
1805
1798
number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
1806
1799
grows very quickly with the number of predictors, and you have to train the model (itself
1807
- a slow process!) for each one. For example, if we have $2$ predictors—let's call
1800
+ a slow process!) for each one. For example, if we have 2 predictors&mdash ; let's call
1808
1801
them A and B&mdash ; then we have 3 variable sets to try: A alone, B alone, and finally A
1809
- and B together. If we have $3$ predictors—A, B, and C—then we have 7
1802
+ and B together. If we have 3 predictors&mdash ; A, B, and C&mdash ; then we have 7
1810
1803
to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
1811
1804
we have to train for $m$ predictors is $2^m-1$; in other words, when we
1812
- get to $10$ predictors we have over *one thousand* models to train, and
1813
- at $20$ predictors we have over *one million* models to train!
1805
+ get to 10 predictors we have over * one thousand* models to train, and
1806
+ at 20 predictors we have over * one million* models to train!
1814
1807
So although it is a simple method, best subset selection is usually too computationally
1815
1808
expensive to use in practice.
1816
1809
@@ -1835,8 +1828,8 @@ This pattern continues for as many iterations as you want. If you run the method
1835
1828
all the way until you run out of predictors to choose, you will end up training
1836
1829
$\frac{1}{2}m(m+1)$ separate models. This is a * big* improvement from the $2^m-1$
1837
1830
models that best subset selection requires you to train! For example, while best subset selection requires
1838
- training over 1000 candidate models with $m=10$ predictors, forward selection requires training only 55 candidate models.
1839
- Therefore we will continue the rest of this section using forward selection.
1831
+ training over 1000 candidate models with 10 predictors, forward selection requires training only 55 candidate models.
1832
+ Therefore we will continue the rest of this section using forward selection.
1840
1833
1841
1834
``` {note}
1842
1835
One word of caution before we move on. Every additional model that you train
@@ -1856,31 +1849,9 @@ where to learn more about advanced predictor selection methods.
1856
1849
### Forward selection in ` scikit-learn `
1857
1850
1858
1851
We now turn to implementing forward selection in Python.
1859
- The function [`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html)
1860
- in the `scikit-learn` can automate this for us, and a simple demo is shown below. However, for
1861
- the learning purpose, we also want to show how each predictor is selected over iterations,
1862
- so we will have to code it ourselves.
1863
-
1864
- +++
1865
-
1866
- First we will extract the "total" set of predictors that we are willing to work with.
1867
- Here we will load the modified version of the cancer data with irrelevant
1868
- predictors, and select `Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`
1869
- as potential predictors, and the `Class` variable as the label.
1870
- We will also extract the column names for the full set of predictor variables.
1871
-
1872
- ```{code-cell} ipython3
1873
- :tags: [remove-cell]
1874
-
1875
- # We now turn to implementing forward selection in Python.
1876
- # Unfortunately there is no built-in way to do this using the `tidymodels` framework,
1877
- # so we will have to code it ourselves. First we will use the `select` function
1878
- # to extract the "total" set of predictors that we are willing to work with.
1879
- # Here we will load the modified version of the cancer data with irrelevant
1880
- # predictors, and select `Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`
1881
- # as potential predictors, and the `Class` variable as the label.
1882
- # We will also extract the column names for the full set of predictor variables.
1883
- ```
1852
+ First we will extract a smaller set of predictors to work with in this illustrative example&mdash ; ` Smoothness ` ,
1853
+ ` Concavity ` , ` Perimeter ` , ` Irrelevant1 ` , ` Irrelevant2 ` , and ` Irrelevant3 ` &mdash ; as well as the ` Class ` variable as the label.
1854
+ We will also extract the column names for the full set of predictors.
1884
1855
1885
1856
``` {code-cell} ipython3
1886
1857
cancer_subset = cancer_irrelevant[
@@ -1902,151 +1873,79 @@ names = list(cancer_subset.drop(
1902
1873
cancer_subset
1903
1874
```
1904
1875
1905
- ```{code-cell} ipython3
1906
- :tags: []
1907
-
1908
- # Using scikit-learn SequentialFeatureSelector
1909
- from sklearn.feature_selection import SequentialFeatureSelector
1910
- cancer_preprocessor = make_column_transformer(
1911
- (
1912
- StandardScaler(),
1913
- list(cancer_subset.drop(columns=["Class"]).columns),
1914
- ),
1915
- )
1916
-
1917
- cancer_pipe_forward = make_pipeline(
1918
- cancer_preprocessor,
1919
- SequentialFeatureSelector(KNeighborsClassifier(), direction="forward"),
1920
- KNeighborsClassifier(),
1921
- )
1922
-
1923
- X = cancer_subset.drop(columns=["Class"])
1924
- y = cancer_subset["Class"]
1925
-
1926
- cancer_pipe_forward.fit(X, y)
1927
-
1928
- cancer_pipe_forward.named_steps["sequentialfeatureselector"].n_features_to_select_
1929
- ```
1930
-
1931
- ```{code-cell} ipython3
1932
- :tags: [remove-cell]
1933
-
1934
- glue(
1935
- "sequentialfeatureselector_n_features",
1936
- "{:d}".format(cancer_pipe_forward.named_steps["sequentialfeatureselector"].n_features_to_select_),
1937
- )
1938
- ```
1939
-
1940
- This means that {glue:text}`sequentialfeatureselector_n_features` features were selected according to the forward selection algorithm.
1941
-
1942
- +++
1943
-
1944
- Now, let's code the actual algorithm by ourselves. The key idea of the forward
1945
- selection code is to properly extract each subset of predictors for which we
1946
- want to build a model, pass them to the preprocessor and fit the pipeline with
1947
- them.
1948
-
1949
- ```{code-cell} ipython3
1950
- :tags: [remove-cell]
1951
-
1952
- # The key idea of the forward selection code is to use the `paste` function (which concatenates strings
1953
- # separated by spaces) to create a model formula for each subset of predictors for which we want to build a model.
1954
- # The `collapse` argument tells `paste` what to put between the items in the list;
1955
- # to make a formula, we need to put a `+` symbol between each variable.
1956
- # As an example, let's make a model formula for all the predictors,
1957
- # which should output something like
1958
- # `Class ~ Smoothness + Concavity + Perimeter + Irrelevant1 + Irrelevant2 + Irrelevant3`:
1959
- ```
1960
-
1961
- Finally, we need to write some code that performs the task of sequentially
1962
- finding the best predictor to add to the model.
1876
+ To perform forward selection, we could use the
1877
+ [ ` SequentialFeatureSelector ` ] ( https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html )
1878
+ from ` scikit-learn ` ; but it is difficult to combine this approach with parameter tuning to find a good number of neighbors
1879
+ for each set of features. Instead we will code the forward selection algorithm manually.
1880
+ In particular, we need code that tries adding each available predictor to a model, finding the best, and iterating.
1963
1881
If you recall the end of the wrangling chapter, we mentioned
1964
1882
that sometimes one needs more flexible forms of iteration than what
1965
1883
we have used earlier, and in these cases one typically resorts to
1966
- a *for loop*; see [the section on control flow (for loops)](https://wesmckinney.com/book/python-basics.html#control_for) in *Python for Data Analysis* {cite:p}`mckinney2012python`.
1967
- Here we will use two for loops:
1968
- one over increasing predictor set sizes
1884
+ a * for loop* ; see
1885
+ the [ control flow section] ( https://wesmckinney.com/book/python-basics.html#control_for ) in
1886
+ * Python for Data Analysis* {cite: p }` mckinney2012python ` .
1887
+ Here we will use two for loops: one over increasing predictor set sizes
1969
1888
(where you see ` for i in range(1, n_total + 1): ` below),
1970
1889
and another to check which predictor to add in each round (where you see ` for j in range(len(names)) ` below).
1971
1890
For each set of predictors to try, we extract the subset of predictors,
1972
1891
pass it into a preprocessor, build a ` Pipeline ` that tunes
1973
- a $K$ -NN classifier using 10-fold cross-validation,
1892
+ a K -NN classifier using 10-fold cross-validation,
1974
1893
and finally records the estimated accuracy.
1975
1894
1976
1895
``` {code-cell} ipython3
1977
- :tags: [remove-cell]
1896
+ from sklearn.compose import make_column_selector
1978
1897
1979
- # Finally, we need to write some code that performs the task of sequentially
1980
- # finding the best predictor to add to the model.
1981
- # If you recall the end of the wrangling chapter, we mentioned
1982
- # that sometimes one needs more flexible forms of iteration than what
1983
- # we have used earlier, and in these cases one typically resorts to
1984
- # a *for loop*; see [the chapter on iteration](https://r4ds.had.co.nz/iteration.html) in *R for Data Science* [@wickham2016r].
1985
- # Here we will use two for loops:
1986
- # one over increasing predictor set sizes
1987
- # (where you see `for (i in 1:length(names))` below),
1988
- # and another to check which predictor to add in each round (where you see `for (j in 1:length(names))` below).
1989
- # For each set of predictors to try, we construct a model formula,
1990
- # pass it into a `recipe`, build a `workflow` that tunes
1991
- # a $K$-NN classifier using 5-fold cross-validation,
1992
- # and finally records the estimated accuracy.
1993
- ```
1994
-
1995
- ```{code-cell} ipython3
1996
1898
accuracy_dict = {"size": [], "selected_predictors": [], "accuracy": []}
1997
1899
1998
1900
# store the total number of predictors
1999
1901
n_total = len(names)
2000
1902
1903
+ # start with an empty list of selected predictors
2001
1904
selected = []
2002
1905
1906
+
1907
+ # create the pipeline and CV grid search objects
1908
+ param_grid = {
1909
+ "kneighborsclassifier__n_neighbors": range(1, 61, 5),
1910
+ }
1911
+ cancer_preprocessor = make_column_transformer(
1912
+ (StandardScaler(), make_column_selector(dtype_include="number"))
1913
+ )
1914
+ cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
1915
+ cancer_tune_grid = GridSearchCV(
1916
+ estimator=cancer_tune_pipe,
1917
+ param_grid=param_grid,
1918
+ cv=10,
1919
+ n_jobs=-1
1920
+ )
1921
+
2003
1922
# for every possible number of predictors
2004
1923
for i in range(1, n_total + 1):
2005
- accs = []
2006
- models = []
1924
+ accs = np.zeros(len(names))
1925
+ # for every possible predictor to add
2007
1926
for j in range(len(names)):
2008
- # create the preprocessor and pipeline with specified set of predictors
2009
- cancer_preprocessor = make_column_transformer(
2010
- (StandardScaler(), selected + [names[j]]),
2011
- )
2012
- cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
2013
- # tune the KNN classifier with these predictors,
2014
- # and collect the accuracy for the best K
2015
- param_grid = {
2016
- "kneighborsclassifier__n_neighbors": range(1, 61, 5),
2017
- } ## double check
2018
-
2019
- cancer_tune_grid = GridSearchCV(
2020
- estimator=cancer_tune_pipe,
2021
- param_grid=param_grid,
2022
- cv=10, ## double check
2023
- n_jobs=-1,
2024
- # return_train_score=True,
2025
- )
2026
-
1927
+ # Add remaining predictor j to the model
2027
1928
X = cancer_subset[selected + [names[j]]]
2028
1929
y = cancer_subset["Class"]
2029
-
1930
+
1931
+ # Find the best K for this set of predictors
2030
1932
cancer_model_grid = cancer_tune_grid.fit(X, y)
2031
1933
accuracies_grid = pd.DataFrame(cancer_model_grid.cv_results_)
2032
- sorted_accuracies = accuracies_grid.sort_values(
2033
- by="mean_test_score", ascending=False
2034
- )
2035
1934
2036
- res = sorted_accuracies.iloc[0, :]
2037
- accs.append(res["mean_test_score"])
2038
- models.append(
2039
- selected + [names[j]]
2040
- ) # (res["param_kneighborsclassifier__n_neighbors"]) ## if want to know the best selection of K
2041
- # get the best selection of (newly added) feature which maximizes cv accuracy
2042
- best_set = models[accs.index(max(accs))]
1935
+ # Store the tuned accuracy for this set of predictors
1936
+ accs[j] = accuracies_grid["mean_test_score"].max()
1937
+
1938
+ # get the best new set of predictors that maximize cv accuracy
1939
+ best_set = selected + [names[accs.argmax()]]
2043
1940
1941
+ # store the results for this round of forward selection
2044
1942
accuracy_dict["size"].append(i)
2045
1943
accuracy_dict["selected_predictors"].append(", ".join(best_set))
2046
- accuracy_dict["accuracy"].append(max(accs ))
1944
+ accuracy_dict["accuracy"].append(accs. max())
2047
1945
1946
+ # update the selected & available sets of predictors
2048
1947
selected = best_set
2049
- del names[accs.index(max(accs) )]
1948
+ del names[accs.argmax( )]
2050
1949
2051
1950
accuracies = pd.DataFrame(accuracy_dict)
2052
1951
accuracies
@@ -2103,8 +2002,6 @@ part of tuning your classifier, you *cannot use your test data* for this
2103
2002
process!
2104
2003
```
2105
2004
2106
- -->
2107
-
2108
2005
## Exercises
2109
2006
2110
2007
Practice exercises for the material covered in this chapter
0 commit comments