Skip to content

Commit a829e72

Browse files
resample to sample in clsfcn1 to align with inference
1 parent eccd5df commit a829e72

File tree

1 file changed

+8
-10
lines changed

1 file changed

+8
-10
lines changed

source/classification1.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ By the end of the chapter, readers will be able to do the following:
4949
- Explain the $K$-nearest neighbor classification algorithm.
5050
- Perform $K$-nearest neighbor classification in Python using `scikit-learn`.
5151
- Use `StandardScaler` and `make_column_transformer` to preprocess data to be centered and scaled.
52-
- Use `resample` to preprocess data to be balanced.
52+
- Use `sample` to preprocess data to be balanced.
5353
- Combine preprocessing and model training using `make_pipeline`.
5454

5555
+++
@@ -1600,7 +1600,7 @@ Imbalanced data with background color indicating the decision of the classifier
16001600

16011601
+++
16021602

1603-
```{index} oversampling, scikit-learn; resample
1603+
```{index} oversampling, scikit-learn; sample
16041604
```
16051605

16061606
Despite the simplicity of the problem, solving it in a statistically sound manner is actually
@@ -1610,11 +1610,11 @@ In other words, we will replicate rare observations multiple times in our data s
16101610
voting power in the $K$-nearest neighbor algorithm. In order to do this, we will
16111611
first separate the classes out into their own data frames by filtering.
16121612
Then, we will
1613-
use the [`resample`](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html) function
1614-
from the `sklearn` package to increase the number of `Malignant` observations to be the same as the number
1615-
of `Benign` observations. We set the `n_samples` argument to be the number of `Malignant` observations we want.
1613+
use the `sample` method on the rare class data frame to increase the number of `Malignant` observations to be the same as the number
1614+
of `Benign` observations. We set the `n` argument to be the number of `Malignant` observations we want, and set `replace=True`
1615+
to indicate that we are sampling with replacement.
16161616
Finally, we use the `value_counts` method to see that our classes are now balanced.
1617-
Note that `resample` picks which data to replicate *randomly*; we will learn more about properly handling randomness
1617+
Note that `sample` picks which data to replicate *randomly*; we will learn more about properly handling randomness
16181618
in data analysis in {numref}`Chapter %s <classification2>`.
16191619

16201620
```{code-cell} ipython3
@@ -1626,12 +1626,10 @@ np.random.seed(1)
16261626
```
16271627

16281628
```{code-cell} ipython3
1629-
from sklearn.utils import resample
1630-
16311629
malignant_cancer = rare_cancer[rare_cancer["Class"] == "Malignant"]
16321630
benign_cancer = rare_cancer[rare_cancer["Class"] == "Benign"]
1633-
malignant_cancer_upsample = resample(
1634-
malignant_cancer, n_samples=benign_cancer.shape[0]
1631+
malignant_cancer_upsample = malignant_cancer.sample(
1632+
n=benign_cancer.shape[0], replace=True
16351633
)
16361634
upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer))
16371635
upsampled_cancer["Class"].value_counts()

0 commit comments

Comments
 (0)