Skip to content

Commit 18b984f

Browse files
consistent usage of fit in clsfn2
1 parent c3105b8 commit 18b984f

File tree

1 file changed

+25
-30
lines changed

1 file changed

+25
-30
lines changed

source/classification2.md

Lines changed: 25 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -594,9 +594,10 @@ knn = KNeighborsClassifier(n_neighbors=3)
594594
X = cancer_train[["Smoothness", "Concavity"]]
595595
y = cancer_train["Class"]
596596
597-
knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
597+
knn_pipeline = make_pipeline(cancer_preprocessor, knn)
598+
knn_pipeline.fit(X, y)
598599
599-
knn_fit
600+
knn_pipeline
600601
```
601602

602603
### Predict the labels in the test set
@@ -614,7 +615,7 @@ variables in the output data frame.
614615

615616
```{code-cell} ipython3
616617
cancer_test_predictions = cancer_test.assign(
617-
predicted = knn_fit.predict(cancer_test[["Smoothness", "Concavity"]])
618+
predicted = knn_pipeline.predict(cancer_test[["Smoothness", "Concavity"]])
618619
)
619620
cancer_test_predictions[["ID", "Class", "predicted"]]
620621
```
@@ -645,7 +646,7 @@ for the predictors that we originally passed into `predict` when making predicti
645646
and we provide the actual labels via the `cancer_test["Class"]` series.
646647

647648
```{code-cell} ipython3
648-
cancer_acc_1 = knn_fit.score(
649+
cancer_acc_1 = knn_pipeline.score(
649650
cancer_test[["Smoothness", "Concavity"]],
650651
cancer_test["Class"]
651652
)
@@ -662,11 +663,9 @@ glue("cancer_acc_1", "{:0.0f}".format(100*cancer_acc_1))
662663

663664
The output shows that the estimated accuracy of the classifier on the test data
664665
was {glue:text}`cancer_acc_1`%.
665-
We can also look at the *confusion matrix* for the classifier
666-
using the `crosstab` function from `pandas`. A confusion matrix shows how many
667-
observations of each (actual) label were classified as each (predicted) label.
668-
The `crosstab` function
669-
takes two arguments: the actual labels first, then the predicted labels second.
666+
We can also look at the *confusion matrix* for the classifier
667+
using the `crosstab` function from `pandas`. The `crosstab` function takes two
668+
arguments: the actual labels first, then the predicted labels second.
670669

671670
```{code-cell} ipython3
672671
pd.crosstab(
@@ -884,10 +883,11 @@ cancer_subtrain, cancer_validation = train_test_split(
884883
knn = KNeighborsClassifier(n_neighbors=3)
885884
X = cancer_subtrain[["Smoothness", "Concavity"]]
886885
y = cancer_subtrain["Class"]
887-
knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
886+
knn_pipeline = make_pipeline(cancer_preprocessor, knn)
887+
knn_pipeline.fit(X, y)
888888
889889
# compute the score on validation data
890-
acc = knn_fit.score(
890+
acc = knn_pipeline.score(
891891
cancer_validation[["Smoothness", "Concavity"]],
892892
cancer_validation["Class"]
893893
)
@@ -908,10 +908,10 @@ for i in range(1, 5):
908908
knn = KNeighborsClassifier(n_neighbors=3)
909909
X = cancer_subtrain[["Smoothness", "Concavity"]]
910910
y = cancer_subtrain["Class"]
911-
knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
911+
knn_pipeline = make_pipeline(cancer_preprocessor, knn).fit(X, y)
912912
913913
# compute the score on validation data
914-
accuracies.append(knn_fit.score(
914+
accuracies.append(knn_pipeline.score(
915915
cancer_validation[["Smoothness", "Concavity"]],
916916
cancer_validation["Class"]
917917
))
@@ -979,7 +979,6 @@ Since the `cross_validate` function outputs a dictionary, we use `pd.DataFrame`
979979
dataframe for better visualization.
980980
Note that the `cross_validate` function handles stratifying the classes in
981981
each train and validate fold automatically.
982-
We begin by importing the `cross_validate` function from `sklearn`.
983982

984983
```{code-cell} ipython3
985984
from sklearn.model_selection import cross_validate
@@ -1183,17 +1182,14 @@ format. We will wrap it in a `pd.DataFrame` to make it easier to understand,
11831182
and print the `info` of the result.
11841183

11851184
```{code-cell} ipython3
1186-
accuracies_grid = pd.DataFrame(
1187-
cancer_tune_grid.fit(
1188-
cancer_train[["Smoothness", "Concavity"]],
1189-
cancer_train["Class"]
1190-
).cv_results_
1185+
cancer_tune_grid.fit(
1186+
cancer_train[["Smoothness", "Concavity"]],
1187+
cancer_train["Class"]
11911188
)
1192-
```
1193-
1194-
```{code-cell} ipython3
1189+
accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_)
11951190
accuracies_grid.info()
11961191
```
1192+
11971193
There is a lot of information to look at here, but we are most interested
11981194
in three quantities: the number of neighbors (`param_kneighbors_classifier__n_neighbors`),
11991195
the cross-validation accuracy estimate (`mean_test_score`),
@@ -1303,13 +1299,13 @@ large_cancer_tune_grid = GridSearchCV(
13031299
cv=10
13041300
)
13051301
1306-
large_accuracies_grid = pd.DataFrame(
1307-
large_cancer_tune_grid.fit(
1308-
cancer_train[["Smoothness", "Concavity"]],
1309-
cancer_train["Class"]
1310-
).cv_results_
1302+
large_cancer_tune_grid.fit(
1303+
cancer_train[["Smoothness", "Concavity"]],
1304+
cancer_train["Class"]
13111305
)
13121306
1307+
large_accuracies_grid = pd.DataFrame(large_cancer_tune_grid.cv_results_)
1308+
13131309
large_accuracy_vs_k = alt.Chart(large_accuracies_grid).mark_line(point=True).encode(
13141310
x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors"),
13151311
y=alt.Y("mean_test_score")
@@ -1903,7 +1899,6 @@ n_total = len(names)
19031899
# start with an empty list of selected predictors
19041900
selected = []
19051901
1906-
19071902
# create the pipeline and CV grid search objects
19081903
param_grid = {
19091904
"kneighborsclassifier__n_neighbors": range(1, 61, 5),
@@ -1929,8 +1924,8 @@ for i in range(1, n_total + 1):
19291924
y = cancer_subset["Class"]
19301925
19311926
# Find the best K for this set of predictors
1932-
cancer_model_grid = cancer_tune_grid.fit(X, y)
1933-
accuracies_grid = pd.DataFrame(cancer_model_grid.cv_results_)
1927+
cancer_tune_grid.fit(X, y)
1928+
accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_)
19341929
19351930
# Store the tuned accuracy for this set of predictors
19361931
accs[j] = accuracies_grid["mean_test_score"].max()

0 commit comments

Comments
 (0)