Skip to content

Commit 68309a9

Browse files
committed
DOC add more details regarding effect of kind_sel in ENN
1 parent b1d8d4f commit 68309a9

File tree

3 files changed

+29
-4
lines changed

3 files changed

+29
-4
lines changed

doc/under_sampling.rst

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -244,10 +244,7 @@ Edited data set using nearest neighbours
244244
"edit" the dataset by removing samples which do not agree "enough" with their
245245
neighboorhood :cite:`wilson1972asymptotic`. For each sample in the class to be
246246
under-sampled, the nearest-neighbours are computed and if the selection
247-
criterion is not fulfilled, the sample is removed. Two selection criteria are
248-
currently available: (i) the majority (i.e., ``kind_sel='mode'``) or (ii) all
249-
(i.e., ``kind_sel='all'``) the nearest-neighbors have to belong to the same
250-
class than the sample inspected to keep it in the dataset::
247+
criterion is not fulfilled, the sample is removed::
251248

252249
>>> sorted(Counter(y).items())
253250
[(0, 64), (1, 262), (2, 4674)]
@@ -257,6 +254,22 @@ class than the sample inspected to keep it in the dataset::
257254
>>> print(sorted(Counter(y_resampled).items()))
258255
[(0, 64), (1, 213), (2, 4568)]
259256

257+
Two selection criteria are currently available: (i) the majority (i.e.,
258+
``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the
259+
nearest-neighbors have to belong to the same class than the sample inspected to
260+
keep it in the dataset. Thus, it implies that `kind_sel='all'` will be less
261+
conservative than `kind_sel='mode'`, and more samples will be excluded in
262+
the former strategy than the latest::
263+
264+
>>> enn = EditedNearestNeighbours(kind_sel="all")
265+
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
266+
>>> print(sorted(Counter(y_resampled).items()))
267+
[(0, 64), (1, 213), (2, 4568)]
268+
>>> enn = EditedNearestNeighbours(kind_sel="mode")
269+
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
270+
>>> print(sorted(Counter(y_resampled).items()))
271+
[(0, 64), (1, 234), (2, 4666)]
272+
260273
The parameter ``n_neighbors`` allows to give a classifier subclassed from
261274
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make
262275
the decision to keep a given sample or not.

imblearn/under_sampling/_prototype_selection/_edited_nearest_neighbours.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,9 @@ class EditedNearestNeighbours(BaseCleaningSampler):
5252
- If ``'mode'``, the majority vote of the neighbours will be used in
5353
order to exclude a sample.
5454
55+
The strategy `"all"` will be less conservative than `'mode'`. Thus,
56+
more samples will be removed when `kind_sel="all"` generally.
57+
5558
{n_jobs}
5659
5760
Attributes
@@ -195,6 +198,9 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
195198
- If ``'mode'``, the majority vote of the neighbours will be used in
196199
order to exclude a sample.
197200
201+
The strategy `"all"` will be less conservative than `'mode'`. Thus,
202+
more samples will be removed when `kind_sel="all"` generally.
203+
198204
{n_jobs}
199205
200206
Attributes
@@ -373,6 +379,9 @@ class AllKNN(BaseCleaningSampler):
373379
- If ``'mode'``, the majority vote of the neighbours will be used in
374380
order to exclude a sample.
375381
382+
The strategy `"all"` will be less conservative than `'mode'`. Thus,
383+
more samples will be removed when `kind_sel="all"` generally.
384+
376385
allow_minority : bool, default=False
377386
If ``True``, it allows the majority classes to become the minority
378387
class without early stopping.

imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler):
5050
- If ``'mode'``, the majority vote of the neighbours will be used in
5151
order to exclude a sample.
5252
53+
The strategy `"all"` will be less conservative than `'mode'`. Thus,
54+
more samples will be removed when `kind_sel="all"` generally.
55+
5356
threshold_cleaning : float, default=0.5
5457
Threshold used to whether consider a class or not during the cleaning
5558
after applying ENN. A class will be considered during cleaning when:

0 commit comments

Comments
 (0)