Skip to content

Commit 70b7d19

Browse files
Merge pull request #266 from UBC-DSCI/predictor-selection
Re-introduce predictor selection
2 parents 682cd85 + ab4ec94 commit 70b7d19

File tree

1 file changed

+53
-156
lines changed

1 file changed

+53
-156
lines changed

source/classification2.md

Lines changed: 53 additions & 156 deletions
Original file line numberDiff line numberDiff line change
@@ -1536,7 +1536,6 @@ the $K$-NN here.
15361536

15371537
+++
15381538

1539-
<!--
15401539
## Predictor variable selection
15411540

15421541
```{note}
@@ -1589,7 +1588,7 @@ cancer_irrelevant[
15891588
]
15901589
```
15911590

1592-
Next, we build a sequence of $K$-NN classifiers that include `Smoothness`,
1591+
Next, we build a sequence of KNN classifiers that include `Smoothness`,
15931592
`Concavity`, and `Perimeter` as predictor variables, but also increasingly many irrelevant
15941593
variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors.
15951594
Then we build a model, tuned via 5-fold cross-validation, for each data set.
@@ -1693,15 +1692,9 @@ glue("fig:06-performance-irrelevant-features", plt_irrelevant_accuracies)
16931692
Effect of inclusion of irrelevant predictors.
16941693
:::
16951694

1696-
```{code-cell} ipython3
1697-
:tags: [remove-cell]
1698-
1699-
glue("cancer_propn_1", "{:0.0f}".format(cancer_proportions.loc["Benign", "percent"]))
1700-
```
1701-
17021695
Although the accuracy decreases as expected, one surprising thing about
17031696
{numref}`fig:06-performance-irrelevant-features` is that it shows that the method
1704-
still outperforms the baseline majority classifier (with about {glue:text}`cancer_propn_1`% accuracy)
1697+
still outperforms the baseline majority classifier (with about {glue:text}`cancer_train_b_prop`% accuracy)
17051698
even with 40 irrelevant variables.
17061699
How could that be? {numref}`fig:06-neighbors-irrelevant-features` provides the answer:
17071700
the tuning procedure for the $K$-nearest neighbors classifier combats the extra randomness from the irrelevant variables
@@ -1804,13 +1797,13 @@ Best subset selection is applicable to any classification method ($K$-NN or othe
18041797
However, it becomes very slow when you have even a moderate
18051798
number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
18061799
grows very quickly with the number of predictors, and you have to train the model (itself
1807-
a slow process!) for each one. For example, if we have $2$ predictors&mdash;let's call
1800+
a slow process!) for each one. For example, if we have 2 predictors&mdash;let's call
18081801
them A and B&mdash;then we have 3 variable sets to try: A alone, B alone, and finally A
1809-
and B together. If we have $3$ predictors&mdash;A, B, and C&mdash;then we have 7
1802+
and B together. If we have 3 predictors&mdash;A, B, and C&mdash;then we have 7
18101803
to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
18111804
we have to train for $m$ predictors is $2^m-1$; in other words, when we
1812-
get to $10$ predictors we have over *one thousand* models to train, and
1813-
at $20$ predictors we have over *one million* models to train!
1805+
get to 10 predictors we have over *one thousand* models to train, and
1806+
at 20 predictors we have over *one million* models to train!
18141807
So although it is a simple method, best subset selection is usually too computationally
18151808
expensive to use in practice.
18161809

@@ -1835,8 +1828,8 @@ This pattern continues for as many iterations as you want. If you run the method
18351828
all the way until you run out of predictors to choose, you will end up training
18361829
$\frac{1}{2}m(m+1)$ separate models. This is a *big* improvement from the $2^m-1$
18371830
models that best subset selection requires you to train! For example, while best subset selection requires
1838-
training over 1000 candidate models with $m=10$ predictors, forward selection requires training only 55 candidate models.
1839-
Therefore we will continue the rest of this section using forward selection.
1831+
training over 1000 candidate models with 10 predictors, forward selection requires training only 55 candidate models.
1832+
Therefore we will continue the rest of this section using forward selection.
18401833

18411834
```{note}
18421835
One word of caution before we move on. Every additional model that you train
@@ -1856,31 +1849,9 @@ where to learn more about advanced predictor selection methods.
18561849
### Forward selection in `scikit-learn`
18571850

18581851
We now turn to implementing forward selection in Python.
1859-
The function [`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html)
1860-
in the `scikit-learn` can automate this for us, and a simple demo is shown below. However, for
1861-
the learning purpose, we also want to show how each predictor is selected over iterations,
1862-
so we will have to code it ourselves.
1863-
1864-
+++
1865-
1866-
First we will extract the "total" set of predictors that we are willing to work with.
1867-
Here we will load the modified version of the cancer data with irrelevant
1868-
predictors, and select `Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`
1869-
as potential predictors, and the `Class` variable as the label.
1870-
We will also extract the column names for the full set of predictor variables.
1871-
1872-
```{code-cell} ipython3
1873-
:tags: [remove-cell]
1874-
1875-
# We now turn to implementing forward selection in Python.
1876-
# Unfortunately there is no built-in way to do this using the `tidymodels` framework,
1877-
# so we will have to code it ourselves. First we will use the `select` function
1878-
# to extract the "total" set of predictors that we are willing to work with.
1879-
# Here we will load the modified version of the cancer data with irrelevant
1880-
# predictors, and select `Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`
1881-
# as potential predictors, and the `Class` variable as the label.
1882-
# We will also extract the column names for the full set of predictor variables.
1883-
```
1852+
First we will extract a smaller set of predictors to work with in this illustrative example&mdash;`Smoothness`,
1853+
`Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`&mdash;as well as the `Class` variable as the label.
1854+
We will also extract the column names for the full set of predictors.
18841855

18851856
```{code-cell} ipython3
18861857
cancer_subset = cancer_irrelevant[
@@ -1902,151 +1873,79 @@ names = list(cancer_subset.drop(
19021873
cancer_subset
19031874
```
19041875

1905-
```{code-cell} ipython3
1906-
:tags: []
1907-
1908-
# Using scikit-learn SequentialFeatureSelector
1909-
from sklearn.feature_selection import SequentialFeatureSelector
1910-
cancer_preprocessor = make_column_transformer(
1911-
(
1912-
StandardScaler(),
1913-
list(cancer_subset.drop(columns=["Class"]).columns),
1914-
),
1915-
)
1916-
1917-
cancer_pipe_forward = make_pipeline(
1918-
cancer_preprocessor,
1919-
SequentialFeatureSelector(KNeighborsClassifier(), direction="forward"),
1920-
KNeighborsClassifier(),
1921-
)
1922-
1923-
X = cancer_subset.drop(columns=["Class"])
1924-
y = cancer_subset["Class"]
1925-
1926-
cancer_pipe_forward.fit(X, y)
1927-
1928-
cancer_pipe_forward.named_steps["sequentialfeatureselector"].n_features_to_select_
1929-
```
1930-
1931-
```{code-cell} ipython3
1932-
:tags: [remove-cell]
1933-
1934-
glue(
1935-
"sequentialfeatureselector_n_features",
1936-
"{:d}".format(cancer_pipe_forward.named_steps["sequentialfeatureselector"].n_features_to_select_),
1937-
)
1938-
```
1939-
1940-
This means that {glue:text}`sequentialfeatureselector_n_features` features were selected according to the forward selection algorithm.
1941-
1942-
+++
1943-
1944-
Now, let's code the actual algorithm by ourselves. The key idea of the forward
1945-
selection code is to properly extract each subset of predictors for which we
1946-
want to build a model, pass them to the preprocessor and fit the pipeline with
1947-
them.
1948-
1949-
```{code-cell} ipython3
1950-
:tags: [remove-cell]
1951-
1952-
# The key idea of the forward selection code is to use the `paste` function (which concatenates strings
1953-
# separated by spaces) to create a model formula for each subset of predictors for which we want to build a model.
1954-
# The `collapse` argument tells `paste` what to put between the items in the list;
1955-
# to make a formula, we need to put a `+` symbol between each variable.
1956-
# As an example, let's make a model formula for all the predictors,
1957-
# which should output something like
1958-
# `Class ~ Smoothness + Concavity + Perimeter + Irrelevant1 + Irrelevant2 + Irrelevant3`:
1959-
```
1960-
1961-
Finally, we need to write some code that performs the task of sequentially
1962-
finding the best predictor to add to the model.
1876+
To perform forward selection, we could use the
1877+
[`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html)
1878+
from `scikit-learn`; but it is difficult to combine this approach with parameter tuning to find a good number of neighbors
1879+
for each set of features. Instead we will code the forward selection algorithm manually.
1880+
In particular, we need code that tries adding each available predictor to a model, finding the best, and iterating.
19631881
If you recall the end of the wrangling chapter, we mentioned
19641882
that sometimes one needs more flexible forms of iteration than what
19651883
we have used earlier, and in these cases one typically resorts to
1966-
a *for loop*; see [the section on control flow (for loops)](https://wesmckinney.com/book/python-basics.html#control_for) in *Python for Data Analysis* {cite:p}`mckinney2012python`.
1967-
Here we will use two for loops:
1968-
one over increasing predictor set sizes
1884+
a *for loop*; see
1885+
the [control flow section](https://wesmckinney.com/book/python-basics.html#control_for) in
1886+
*Python for Data Analysis* {cite:p}`mckinney2012python`.
1887+
Here we will use two for loops: one over increasing predictor set sizes
19691888
(where you see `for i in range(1, n_total + 1):` below),
19701889
and another to check which predictor to add in each round (where you see `for j in range(len(names))` below).
19711890
For each set of predictors to try, we extract the subset of predictors,
19721891
pass it into a preprocessor, build a `Pipeline` that tunes
1973-
a $K$-NN classifier using 10-fold cross-validation,
1892+
a K-NN classifier using 10-fold cross-validation,
19741893
and finally records the estimated accuracy.
19751894

19761895
```{code-cell} ipython3
1977-
:tags: [remove-cell]
1896+
from sklearn.compose import make_column_selector
19781897
1979-
# Finally, we need to write some code that performs the task of sequentially
1980-
# finding the best predictor to add to the model.
1981-
# If you recall the end of the wrangling chapter, we mentioned
1982-
# that sometimes one needs more flexible forms of iteration than what
1983-
# we have used earlier, and in these cases one typically resorts to
1984-
# a *for loop*; see [the chapter on iteration](https://r4ds.had.co.nz/iteration.html) in *R for Data Science* [@wickham2016r].
1985-
# Here we will use two for loops:
1986-
# one over increasing predictor set sizes
1987-
# (where you see `for (i in 1:length(names))` below),
1988-
# and another to check which predictor to add in each round (where you see `for (j in 1:length(names))` below).
1989-
# For each set of predictors to try, we construct a model formula,
1990-
# pass it into a `recipe`, build a `workflow` that tunes
1991-
# a $K$-NN classifier using 5-fold cross-validation,
1992-
# and finally records the estimated accuracy.
1993-
```
1994-
1995-
```{code-cell} ipython3
19961898
accuracy_dict = {"size": [], "selected_predictors": [], "accuracy": []}
19971899
19981900
# store the total number of predictors
19991901
n_total = len(names)
20001902
1903+
# start with an empty list of selected predictors
20011904
selected = []
20021905
1906+
1907+
# create the pipeline and CV grid search objects
1908+
param_grid = {
1909+
"kneighborsclassifier__n_neighbors": range(1, 61, 5),
1910+
}
1911+
cancer_preprocessor = make_column_transformer(
1912+
(StandardScaler(), make_column_selector(dtype_include="number"))
1913+
)
1914+
cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
1915+
cancer_tune_grid = GridSearchCV(
1916+
estimator=cancer_tune_pipe,
1917+
param_grid=param_grid,
1918+
cv=10,
1919+
n_jobs=-1
1920+
)
1921+
20031922
# for every possible number of predictors
20041923
for i in range(1, n_total + 1):
2005-
accs = []
2006-
models = []
1924+
accs = np.zeros(len(names))
1925+
# for every possible predictor to add
20071926
for j in range(len(names)):
2008-
# create the preprocessor and pipeline with specified set of predictors
2009-
cancer_preprocessor = make_column_transformer(
2010-
(StandardScaler(), selected + [names[j]]),
2011-
)
2012-
cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
2013-
# tune the KNN classifier with these predictors,
2014-
# and collect the accuracy for the best K
2015-
param_grid = {
2016-
"kneighborsclassifier__n_neighbors": range(1, 61, 5),
2017-
} ## double check
2018-
2019-
cancer_tune_grid = GridSearchCV(
2020-
estimator=cancer_tune_pipe,
2021-
param_grid=param_grid,
2022-
cv=10, ## double check
2023-
n_jobs=-1,
2024-
# return_train_score=True,
2025-
)
2026-
1927+
# Add remaining predictor j to the model
20271928
X = cancer_subset[selected + [names[j]]]
20281929
y = cancer_subset["Class"]
2029-
1930+
1931+
# Find the best K for this set of predictors
20301932
cancer_model_grid = cancer_tune_grid.fit(X, y)
20311933
accuracies_grid = pd.DataFrame(cancer_model_grid.cv_results_)
2032-
sorted_accuracies = accuracies_grid.sort_values(
2033-
by="mean_test_score", ascending=False
2034-
)
20351934
2036-
res = sorted_accuracies.iloc[0, :]
2037-
accs.append(res["mean_test_score"])
2038-
models.append(
2039-
selected + [names[j]]
2040-
) # (res["param_kneighborsclassifier__n_neighbors"]) ## if want to know the best selection of K
2041-
# get the best selection of (newly added) feature which maximizes cv accuracy
2042-
best_set = models[accs.index(max(accs))]
1935+
# Store the tuned accuracy for this set of predictors
1936+
accs[j] = accuracies_grid["mean_test_score"].max()
1937+
1938+
# get the best new set of predictors that maximize cv accuracy
1939+
best_set = selected + [names[accs.argmax()]]
20431940
1941+
# store the results for this round of forward selection
20441942
accuracy_dict["size"].append(i)
20451943
accuracy_dict["selected_predictors"].append(", ".join(best_set))
2046-
accuracy_dict["accuracy"].append(max(accs))
1944+
accuracy_dict["accuracy"].append(accs.max())
20471945
1946+
# update the selected & available sets of predictors
20481947
selected = best_set
2049-
del names[accs.index(max(accs))]
1948+
del names[accs.argmax()]
20501949
20511950
accuracies = pd.DataFrame(accuracy_dict)
20521951
accuracies
@@ -2103,8 +2002,6 @@ part of tuning your classifier, you *cannot use your test data* for this
21032002
process!
21042003
```
21052004

2106-
-->
2107-
21082005
## Exercises
21092006

21102007
Practice exercises for the material covered in this chapter

0 commit comments

Comments
 (0)