Merge pull request #266 from UBC-DSCI/predictor-selection

trevorcampbell · web-flow · commit 70b7d19d3bfa · 2023-09-23T11:56:17.000-07:00
Re-introduce predictor selection
diff --git a/source/classification2.md b/source/classification2.md
@@ -1536,7 +1536,6 @@ the $K$-NN here.
 
 +++
 
-<!--
 ## Predictor variable selection
 
 ```{note}
@@ -1589,7 +1588,7 @@ cancer_irrelevant[
 ]
 ```
 
-Next, we build a sequence of $K$-NN classifiers that include `Smoothness`,
+Next, we build a sequence of KNN classifiers that include `Smoothness`,
 `Concavity`, and `Perimeter` as predictor variables, but also increasingly many irrelevant
 variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors.
 Then we build a model, tuned via 5-fold cross-validation, for each data set.
@@ -1693,15 +1692,9 @@ glue("fig:06-performance-irrelevant-features", plt_irrelevant_accuracies)
 Effect of inclusion of irrelevant predictors.
 :::
 
-```{code-cell} ipython3
-:tags: [remove-cell]
-
-glue("cancer_propn_1", "{:0.0f}".format(cancer_proportions.loc["Benign", "percent"]))
-```
-
 Although the accuracy decreases as expected, one surprising thing about 
 {numref}`fig:06-performance-irrelevant-features` is that it shows that the method
-still outperforms the baseline majority classifier (with about {glue:text}`cancer_propn_1`% accuracy) 
+still outperforms the baseline majority classifier (with about {glue:text}`cancer_train_b_prop`% accuracy) 
 even with 40 irrelevant variables.
 How could that be? {numref}`fig:06-neighbors-irrelevant-features` provides the answer:
 the tuning procedure for the $K$-nearest neighbors classifier combats the extra randomness from the irrelevant variables 
@@ -1804,13 +1797,13 @@ Best subset selection is applicable to any classification method ($K$-NN or othe
 However, it becomes very slow when you have even a moderate
 number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
 grows very quickly with the number of predictors, and you have to train the model (itself
-a slow process!) for each one. For example, if we have $2$ predictors&mdash;let's call
+a slow process!) for each one. For example, if we have 2 predictors&mdash;let's call
 them A and B&mdash;then we have 3 variable sets to try: A alone, B alone, and finally A
-and B together. If we have $3$ predictors&mdash;A, B, and C&mdash;then we have 7
+and B together. If we have 3 predictors&mdash;A, B, and C&mdash;then we have 7
 to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
 we have to train for $m$ predictors is $2^m-1$; in other words, when we 
-get to $10$ predictors we have over *one thousand* models to train, and 
-at $20$ predictors we have over *one million* models to train! 
+get to 10 predictors we have over *one thousand* models to train, and 
+at 20 predictors we have over *one million* models to train! 
 So although it is a simple method, best subset selection is usually too computationally 
 expensive to use in practice.
 
@@ -1835,8 +1828,8 @@ This pattern continues for as many iterations as you want. If you run the method
 all the way until you run out of predictors to choose, you will end up training
 $\frac{1}{2}m(m+1)$ separate models. This is a *big* improvement from the $2^m-1$
 models that best subset selection requires you to train! For example, while best subset selection requires
-training over 1000 candidate models with $m=10$ predictors, forward selection requires training only 55 candidate models.
- Therefore we will continue the rest of this section using forward selection.
+training over 1000 candidate models with 10 predictors, forward selection requires training only 55 candidate models.
+Therefore we will continue the rest of this section using forward selection.
 
 ```{note}
 One word of caution before we move on. Every additional model that you train 
@@ -1856,31 +1849,9 @@ where to learn more about advanced predictor selection methods.
 ### Forward selection in `scikit-learn`
  
 We now turn to implementing forward selection in Python.
-The function [`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) 
-in the `scikit-learn` can automate this for us, and a simple demo is shown below. However, for
-the learning purpose, we also want to show how each predictor is selected over iterations, 
-so we will have to code it ourselves. 
-
-+++
-
-First we will extract the "total" set of predictors that we are willing to work with. 
-Here we will load the modified version of the cancer data with irrelevant 
-predictors, and select `Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`
-as potential predictors, and the `Class` variable as the label.
-We will also extract the column names for the full set of predictor variables.
-
-```{code-cell} ipython3
-:tags: [remove-cell]
-
-# We now turn to implementing forward selection in Python.
-# Unfortunately there is no built-in way to do this using the `tidymodels` framework,
-# so we will have to code it ourselves. First we will use the `select` function
-# to extract the "total" set of predictors that we are willing to work with. 
-# Here we will load the modified version of the cancer data with irrelevant 
-# predictors, and select `Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`
-# as potential predictors, and the `Class` variable as the label.
-# We will also extract the column names for the full set of predictor variables.
-```
+First we will extract a smaller set of predictors to work with in this illustrative example&mdash;`Smoothness`,
+`Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`&mdash;as well as the `Class` variable as the label.
+We will also extract the column names for the full set of predictors.
 
 ```{code-cell} ipython3
 cancer_subset = cancer_irrelevant[
@@ -1902,151 +1873,79 @@ names = list(cancer_subset.drop(
 cancer_subset
 ```
 
-```{code-cell} ipython3
-:tags: []
-
-# Using scikit-learn SequentialFeatureSelector
-from sklearn.feature_selection import SequentialFeatureSelector
-cancer_preprocessor = make_column_transformer(
-    (
-        StandardScaler(),
-        list(cancer_subset.drop(columns=["Class"]).columns),
-    ),
-)
-
-cancer_pipe_forward = make_pipeline(
-    cancer_preprocessor,
-    SequentialFeatureSelector(KNeighborsClassifier(), direction="forward"),
-    KNeighborsClassifier(),
-)
-
-X = cancer_subset.drop(columns=["Class"])
-y = cancer_subset["Class"]
-
-cancer_pipe_forward.fit(X, y)
-
-cancer_pipe_forward.named_steps["sequentialfeatureselector"].n_features_to_select_
-```
-
-```{code-cell} ipython3
-:tags: [remove-cell]
-
-glue(
-    "sequentialfeatureselector_n_features",
-    "{:d}".format(cancer_pipe_forward.named_steps["sequentialfeatureselector"].n_features_to_select_),
-)
-```
-
-This means that {glue:text}`sequentialfeatureselector_n_features` features were selected according to the forward selection algorithm.
-
-+++
-
-Now, let's code the actual algorithm by ourselves. The key idea of the forward
-selection code is to properly extract each subset of predictors for which we
-want to build a model, pass them to the preprocessor and fit the pipeline with
-them.
-
-```{code-cell} ipython3
-:tags: [remove-cell]
-
-# The key idea of the forward selection code is to use the `paste` function (which concatenates strings
-# separated by spaces) to create a model formula for each subset of predictors for which we want to build a model.
-# The `collapse` argument tells `paste` what to put between the items in the list;
-# to make a formula, we need to put a `+` symbol between each variable.
-# As an example, let's make a model formula for all the predictors,
-# which should output something like
-# `Class ~ Smoothness + Concavity + Perimeter + Irrelevant1 + Irrelevant2 + Irrelevant3`:
-```
-
-Finally, we need to write some code that performs the task of sequentially
-finding the best predictor to add to the model.
+To perform forward selection, we could use the
+[`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) 
+from `scikit-learn`; but it is difficult to combine this approach with parameter tuning to find a good number of neighbors 
+for each set of features. Instead we will code the forward selection algorithm manually. 
+In particular, we need code that tries adding each available predictor to a model, finding the best, and iterating. 
 If you recall the end of the wrangling chapter, we mentioned
 that sometimes one needs more flexible forms of iteration than what 
 we have used earlier, and in these cases one typically resorts to
-a *for loop*; see [the section on control flow (for loops)](https://wesmckinney.com/book/python-basics.html#control_for) in *Python for Data Analysis* {cite:p}`mckinney2012python`.
-Here we will use two for loops:
-one over increasing predictor set sizes 
+a *for loop*; see
+the [control flow section](https://wesmckinney.com/book/python-basics.html#control_for) in
+*Python for Data Analysis* {cite:p}`mckinney2012python`.
+Here we will use two for loops: one over increasing predictor set sizes
 (where you see `for i in range(1, n_total + 1):` below),
 and another to check which predictor to add in each round (where you see `for j in range(len(names))` below).
 For each set of predictors to try, we extract the subset of predictors,
 pass it into a preprocessor, build a `Pipeline` that tunes
-a $K$-NN classifier using 10-fold cross-validation, 
+a K-NN classifier using 10-fold cross-validation, 
 and finally records the estimated accuracy.
 
 ```{code-cell} ipython3
-:tags: [remove-cell]
+from sklearn.compose import make_column_selector
 
-# Finally, we need to write some code that performs the task of sequentially
-# finding the best predictor to add to the model.
-# If you recall the end of the wrangling chapter, we mentioned
-# that sometimes one needs more flexible forms of iteration than what 
-# we have used earlier, and in these cases one typically resorts to
-# a *for loop*; see [the chapter on iteration](https://r4ds.had.co.nz/iteration.html) in *R for Data Science* [@wickham2016r].
-# Here we will use two for loops:
-# one over increasing predictor set sizes 
-# (where you see `for (i in 1:length(names))` below),
-# and another to check which predictor to add in each round (where you see `for (j in 1:length(names))` below).
-# For each set of predictors to try, we construct a model formula,
-# pass it into a `recipe`, build a `workflow` that tunes
-# a $K$-NN classifier using 5-fold cross-validation, 
-# and finally records the estimated accuracy.
-```
-
-```{code-cell} ipython3
 accuracy_dict = {"size": [], "selected_predictors": [], "accuracy": []}
 
 # store the total number of predictors
 n_total = len(names)
 
+# start with an empty list of selected predictors
 selected = []
 
+
+# create the pipeline and CV grid search objects
+param_grid = {
+    "kneighborsclassifier__n_neighbors": range(1, 61, 5),
+}
+cancer_preprocessor = make_column_transformer(
+    (StandardScaler(), make_column_selector(dtype_include="number"))
+)
+cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
+cancer_tune_grid = GridSearchCV(
+    estimator=cancer_tune_pipe,
+    param_grid=param_grid,
+    cv=10, 
+    n_jobs=-1
+)
+
 # for every possible number of predictors
 for i in range(1, n_total + 1):
-    accs = []
-    models = []
+    accs = np.zeros(len(names)) 
+    # for every possible predictor to add
     for j in range(len(names)):
-        # create the preprocessor and pipeline with specified set of predictors
-        cancer_preprocessor = make_column_transformer(
-            (StandardScaler(), selected + [names[j]]),
-        )
-        cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
-        # tune the KNN classifier with these predictors,
-        # and collect the accuracy for the best K
-        param_grid = {
-            "kneighborsclassifier__n_neighbors": range(1, 61, 5),
-        }  ## double check
-
-        cancer_tune_grid = GridSearchCV(
-            estimator=cancer_tune_pipe,
-            param_grid=param_grid,
-            cv=10,  ## double check
-            n_jobs=-1,
-            # return_train_score=True,
-        )
-
+        # Add remaining predictor j to the model
         X = cancer_subset[selected + [names[j]]]
         y = cancer_subset["Class"]
-
+        
+        # Find the best K for this set of predictors
         cancer_model_grid = cancer_tune_grid.fit(X, y)
         accuracies_grid = pd.DataFrame(cancer_model_grid.cv_results_)
-        sorted_accuracies = accuracies_grid.sort_values(
-            by="mean_test_score", ascending=False
-        )
 
-        res = sorted_accuracies.iloc[0, :]
-        accs.append(res["mean_test_score"])
-        models.append(
-            selected + [names[j]]
-        )  # (res["param_kneighborsclassifier__n_neighbors"]) ## if want to know the best selection of K
-    # get the best selection of (newly added) feature which maximizes cv accuracy    
-    best_set = models[accs.index(max(accs))]
+        # Store the tuned accuracy for this set of predictors
+        accs[j] = accuracies_grid["mean_test_score"].max()
+
+    # get the best new set of predictors that maximize cv accuracy    
+    best_set = selected + [names[accs.argmax()]]
     
+    # store the results for this round of forward selection
     accuracy_dict["size"].append(i)
     accuracy_dict["selected_predictors"].append(", ".join(best_set))
-    accuracy_dict["accuracy"].append(max(accs))
+    accuracy_dict["accuracy"].append(accs.max())
     
+    # update the selected & available sets of predictors
     selected = best_set
-    del names[accs.index(max(accs))]
+    del names[accs.argmax()]
 
 accuracies = pd.DataFrame(accuracy_dict)
 accuracies
@@ -2103,8 +2002,6 @@ part of tuning your classifier, you *cannot use your test data* for this
 process! 
 ```
 
--->
-
 ## Exercises
 
 Practice exercises for the material covered in this chapter