Skip to content

Commit ab4ec94

Browse files
predictor selection works now
1 parent a3f785d commit ab4ec94

File tree

1 file changed

+45
-139
lines changed

1 file changed

+45
-139
lines changed

source/classification2.md

Lines changed: 45 additions & 139 deletions
Original file line numberDiff line numberDiff line change
@@ -1849,31 +1849,9 @@ where to learn more about advanced predictor selection methods.
18491849
### Forward selection in `scikit-learn`
18501850

18511851
We now turn to implementing forward selection in Python.
1852-
The function [`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html)
1853-
in the `scikit-learn` can automate this for us, and a simple demo is shown below. However, for
1854-
the learning purpose, we also want to show how each predictor is selected over iterations,
1855-
so we will have to code it ourselves.
1856-
1857-
+++
1858-
1859-
First we will extract the "total" set of predictors that we are willing to work with.
1860-
Here we will load the modified version of the cancer data with irrelevant
1861-
predictors, and select `Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`
1862-
as potential predictors, and the `Class` variable as the label.
1863-
We will also extract the column names for the full set of predictor variables.
1864-
1865-
```{code-cell} ipython3
1866-
:tags: [remove-cell]
1867-
1868-
# We now turn to implementing forward selection in Python.
1869-
# Unfortunately there is no built-in way to do this using the `tidymodels` framework,
1870-
# so we will have to code it ourselves. First we will use the `select` function
1871-
# to extract the "total" set of predictors that we are willing to work with.
1872-
# Here we will load the modified version of the cancer data with irrelevant
1873-
# predictors, and select `Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`
1874-
# as potential predictors, and the `Class` variable as the label.
1875-
# We will also extract the column names for the full set of predictor variables.
1876-
```
1852+
First we will extract a smaller set of predictors to work with in this illustrative example—`Smoothness`,
1853+
`Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`—as well as the `Class` variable as the label.
1854+
We will also extract the column names for the full set of predictors.
18771855

18781856
```{code-cell} ipython3
18791857
cancer_subset = cancer_irrelevant[
@@ -1895,151 +1873,79 @@ names = list(cancer_subset.drop(
18951873
cancer_subset
18961874
```
18971875

1898-
```{code-cell} ipython3
1899-
:tags: []
1900-
1901-
# Using scikit-learn SequentialFeatureSelector
1902-
from sklearn.feature_selection import SequentialFeatureSelector
1903-
cancer_preprocessor = make_column_transformer(
1904-
(
1905-
StandardScaler(),
1906-
list(cancer_subset.drop(columns=["Class"]).columns),
1907-
),
1908-
)
1909-
1910-
cancer_pipe_forward = make_pipeline(
1911-
cancer_preprocessor,
1912-
SequentialFeatureSelector(KNeighborsClassifier(), direction="forward"),
1913-
KNeighborsClassifier(),
1914-
)
1915-
1916-
X = cancer_subset.drop(columns=["Class"])
1917-
y = cancer_subset["Class"]
1918-
1919-
cancer_pipe_forward.fit(X, y)
1920-
1921-
cancer_pipe_forward.named_steps["sequentialfeatureselector"].n_features_to_select_
1922-
```
1923-
1924-
```{code-cell} ipython3
1925-
:tags: [remove-cell]
1926-
1927-
glue(
1928-
"sequentialfeatureselector_n_features",
1929-
"{:d}".format(cancer_pipe_forward.named_steps["sequentialfeatureselector"].n_features_to_select_),
1930-
)
1931-
```
1932-
1933-
This means that {glue:text}`sequentialfeatureselector_n_features` features were selected according to the forward selection algorithm.
1934-
1935-
+++
1936-
1937-
Now, let's code the actual algorithm by ourselves. The key idea of the forward
1938-
selection code is to properly extract each subset of predictors for which we
1939-
want to build a model, pass them to the preprocessor and fit the pipeline with
1940-
them.
1941-
1942-
```{code-cell} ipython3
1943-
:tags: [remove-cell]
1944-
1945-
# The key idea of the forward selection code is to use the `paste` function (which concatenates strings
1946-
# separated by spaces) to create a model formula for each subset of predictors for which we want to build a model.
1947-
# The `collapse` argument tells `paste` what to put between the items in the list;
1948-
# to make a formula, we need to put a `+` symbol between each variable.
1949-
# As an example, let's make a model formula for all the predictors,
1950-
# which should output something like
1951-
# `Class ~ Smoothness + Concavity + Perimeter + Irrelevant1 + Irrelevant2 + Irrelevant3`:
1952-
```
1953-
1954-
Finally, we need to write some code that performs the task of sequentially
1955-
finding the best predictor to add to the model.
1876+
To perform forward selection, we could use the
1877+
[`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html)
1878+
from `scikit-learn`; but it is difficult to combine this approach with parameter tuning to find a good number of neighbors
1879+
for each set of features. Instead we will code the forward selection algorithm manually.
1880+
In particular, we need code that tries adding each available predictor to a model, finding the best, and iterating.
19561881
If you recall the end of the wrangling chapter, we mentioned
19571882
that sometimes one needs more flexible forms of iteration than what
19581883
we have used earlier, and in these cases one typically resorts to
1959-
a *for loop*; see [the section on control flow (for loops)](https://wesmckinney.com/book/python-basics.html#control_for) in *Python for Data Analysis* {cite:p}`mckinney2012python`.
1960-
Here we will use two for loops:
1961-
one over increasing predictor set sizes
1884+
a *for loop*; see
1885+
the [control flow section](https://wesmckinney.com/book/python-basics.html#control_for) in
1886+
*Python for Data Analysis* {cite:p}`mckinney2012python`.
1887+
Here we will use two for loops: one over increasing predictor set sizes
19621888
(where you see `for i in range(1, n_total + 1):` below),
19631889
and another to check which predictor to add in each round (where you see `for j in range(len(names))` below).
19641890
For each set of predictors to try, we extract the subset of predictors,
19651891
pass it into a preprocessor, build a `Pipeline` that tunes
1966-
a $K$-NN classifier using 10-fold cross-validation,
1892+
a K-NN classifier using 10-fold cross-validation,
19671893
and finally records the estimated accuracy.
19681894

19691895
```{code-cell} ipython3
1970-
:tags: [remove-cell]
1896+
from sklearn.compose import make_column_selector
19711897
1972-
# Finally, we need to write some code that performs the task of sequentially
1973-
# finding the best predictor to add to the model.
1974-
# If you recall the end of the wrangling chapter, we mentioned
1975-
# that sometimes one needs more flexible forms of iteration than what
1976-
# we have used earlier, and in these cases one typically resorts to
1977-
# a *for loop*; see [the chapter on iteration](https://r4ds.had.co.nz/iteration.html) in *R for Data Science* [@wickham2016r].
1978-
# Here we will use two for loops:
1979-
# one over increasing predictor set sizes
1980-
# (where you see `for (i in 1:length(names))` below),
1981-
# and another to check which predictor to add in each round (where you see `for (j in 1:length(names))` below).
1982-
# For each set of predictors to try, we construct a model formula,
1983-
# pass it into a `recipe`, build a `workflow` that tunes
1984-
# a $K$-NN classifier using 5-fold cross-validation,
1985-
# and finally records the estimated accuracy.
1986-
```
1987-
1988-
```{code-cell} ipython3
19891898
accuracy_dict = {"size": [], "selected_predictors": [], "accuracy": []}
19901899
19911900
# store the total number of predictors
19921901
n_total = len(names)
19931902
1903+
# start with an empty list of selected predictors
19941904
selected = []
19951905
1906+
1907+
# create the pipeline and CV grid search objects
1908+
param_grid = {
1909+
"kneighborsclassifier__n_neighbors": range(1, 61, 5),
1910+
}
1911+
cancer_preprocessor = make_column_transformer(
1912+
(StandardScaler(), make_column_selector(dtype_include="number"))
1913+
)
1914+
cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
1915+
cancer_tune_grid = GridSearchCV(
1916+
estimator=cancer_tune_pipe,
1917+
param_grid=param_grid,
1918+
cv=10,
1919+
n_jobs=-1
1920+
)
1921+
19961922
# for every possible number of predictors
19971923
for i in range(1, n_total + 1):
1998-
accs = []
1999-
models = []
1924+
accs = np.zeros(len(names))
1925+
# for every possible predictor to add
20001926
for j in range(len(names)):
2001-
# create the preprocessor and pipeline with specified set of predictors
2002-
cancer_preprocessor = make_column_transformer(
2003-
(StandardScaler(), selected + [names[j]]),
2004-
)
2005-
cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
2006-
# tune the KNN classifier with these predictors,
2007-
# and collect the accuracy for the best K
2008-
param_grid = {
2009-
"kneighborsclassifier__n_neighbors": range(1, 61, 5),
2010-
} ## double check
2011-
2012-
cancer_tune_grid = GridSearchCV(
2013-
estimator=cancer_tune_pipe,
2014-
param_grid=param_grid,
2015-
cv=10, ## double check
2016-
n_jobs=-1,
2017-
# return_train_score=True,
2018-
)
2019-
1927+
# Add remaining predictor j to the model
20201928
X = cancer_subset[selected + [names[j]]]
20211929
y = cancer_subset["Class"]
2022-
1930+
1931+
# Find the best K for this set of predictors
20231932
cancer_model_grid = cancer_tune_grid.fit(X, y)
20241933
accuracies_grid = pd.DataFrame(cancer_model_grid.cv_results_)
2025-
sorted_accuracies = accuracies_grid.sort_values(
2026-
by="mean_test_score", ascending=False
2027-
)
20281934
2029-
res = sorted_accuracies.iloc[0, :]
2030-
accs.append(res["mean_test_score"])
2031-
models.append(
2032-
selected + [names[j]]
2033-
) # (res["param_kneighborsclassifier__n_neighbors"]) ## if want to know the best selection of K
2034-
# get the best selection of (newly added) feature which maximizes cv accuracy
2035-
best_set = models[accs.index(max(accs))]
1935+
# Store the tuned accuracy for this set of predictors
1936+
accs[j] = accuracies_grid["mean_test_score"].max()
1937+
1938+
# get the best new set of predictors that maximize cv accuracy
1939+
best_set = selected + [names[accs.argmax()]]
20361940
1941+
# store the results for this round of forward selection
20371942
accuracy_dict["size"].append(i)
20381943
accuracy_dict["selected_predictors"].append(", ".join(best_set))
2039-
accuracy_dict["accuracy"].append(max(accs))
1944+
accuracy_dict["accuracy"].append(accs.max())
20401945
1946+
# update the selected & available sets of predictors
20411947
selected = best_set
2042-
del names[accs.index(max(accs))]
1948+
del names[accs.argmax()]
20431949
20441950
accuracies = pd.DataFrame(accuracy_dict)
20451951
accuracies

0 commit comments

Comments
 (0)