@@ -49,7 +49,7 @@ By the end of the chapter, readers will be able to do the following:
49
49
- Explain the $K$-nearest neighbor classification algorithm.
50
50
- Perform $K$-nearest neighbor classification in Python using ` scikit-learn ` .
51
51
- Use ` StandardScaler ` and ` make_column_transformer ` to preprocess data to be centered and scaled.
52
- - Use ` resample ` to preprocess data to be balanced.
52
+ - Use ` sample ` to preprocess data to be balanced.
53
53
- Combine preprocessing and model training using ` make_pipeline ` .
54
54
55
55
+++
@@ -1600,7 +1600,7 @@ Imbalanced data with background color indicating the decision of the classifier
1600
1600
1601
1601
+++
1602
1602
1603
- ``` {index} oversampling, scikit-learn; resample
1603
+ ``` {index} oversampling, scikit-learn; sample
1604
1604
```
1605
1605
1606
1606
Despite the simplicity of the problem, solving it in a statistically sound manner is actually
@@ -1610,11 +1610,11 @@ In other words, we will replicate rare observations multiple times in our data s
1610
1610
voting power in the $K$-nearest neighbor algorithm. In order to do this, we will
1611
1611
first separate the classes out into their own data frames by filtering.
1612
1612
Then, we will
1613
- use the [ ` resample ` ] ( https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html ) function
1614
- from the ` sklearn ` package to increase the number of ` Malignant ` observations to be the same as the number
1615
- of ` Benign ` observations. We set the ` n_samples ` argument to be the number of ` Malignant ` observations we want.
1613
+ use the ` sample ` method on the rare class data frame to increase the number of ` Malignant ` observations to be the same as the number
1614
+ of ` Benign ` observations. We set the ` n ` argument to be the number of ` Malignant ` observations we want, and set ` replace=True `
1615
+ to indicate that we are sampling with replacement.
1616
1616
Finally, we use the ` value_counts ` method to see that our classes are now balanced.
1617
- Note that ` resample ` picks which data to replicate * randomly* ; we will learn more about properly handling randomness
1617
+ Note that ` sample ` picks which data to replicate * randomly* ; we will learn more about properly handling randomness
1618
1618
in data analysis in {numref}` Chapter %s <classification2> ` .
1619
1619
1620
1620
``` {code-cell} ipython3
@@ -1626,12 +1626,10 @@ np.random.seed(1)
1626
1626
```
1627
1627
1628
1628
``` {code-cell} ipython3
1629
- from sklearn.utils import resample
1630
-
1631
1629
malignant_cancer = rare_cancer[rare_cancer["Class"] == "Malignant"]
1632
1630
benign_cancer = rare_cancer[rare_cancer["Class"] == "Benign"]
1633
- malignant_cancer_upsample = resample (
1634
- malignant_cancer, n_samples =benign_cancer.shape[0]
1631
+ malignant_cancer_upsample = malignant_cancer.sample (
1632
+ n =benign_cancer.shape[0], replace=True
1635
1633
)
1636
1634
upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer))
1637
1635
upsampled_cancer["Class"].value_counts()
0 commit comments