Skip to content

Commit 7d5509c

Browse files
authored
DOC improve instance hardness threshold user guide (#1029)
1 parent ef2e75b commit 7d5509c

File tree

2 files changed

+30
-10
lines changed

2 files changed

+30
-10
lines changed

doc/under_sampling.rst

Lines changed: 29 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -467,12 +467,32 @@ and the output a 3 nearest neighbors classifier. The class can be used as::
467467

468468
.. _instance_hardness_threshold:
469469

470+
Additional undersampling techniques
471+
-----------------------------------
472+
470473
Instance hardness threshold
471474
^^^^^^^^^^^^^^^^^^^^^^^^^^^
472475

473-
:class:`InstanceHardnessThreshold` is a specific algorithm in which a
474-
classifier is trained on the data and the samples with lower probabilities are
475-
removed :cite:`smith2014instance`. The class can be used as::
476+
**Instance Hardness** is a measure of how difficult it is to classify an instance or
477+
observation correctly. In other words, hard instances are observations that are hard to
478+
classify correctly.
479+
480+
Fundamentally, instances that are hard to classify correctly are those for which the
481+
learning algorithm or classifier produces a low probability of predicting the correct
482+
class label.
483+
484+
If we removed these hard instances from the dataset, the logic goes, we would help the
485+
classifier better identify the different classes :cite:`smith2014instance`.
486+
487+
:class:`InstanceHardnessThreshold` trains a classifier on the data and then removes the
488+
samples with lower probabilities :cite:`smith2014instance`. Or in other words, it
489+
retains the observations with the higher class probabilities.
490+
491+
In our implementation, :class:`InstanceHardnessThreshold` is (almost) a controlled
492+
under-sampling method: it will retain a specific number of observations of the target
493+
class(es), which is specified by the user (see caveat below).
494+
495+
The class can be used as::
476496

477497
>>> from sklearn.linear_model import LogisticRegression
478498
>>> from imblearn.under_sampling import InstanceHardnessThreshold
@@ -483,18 +503,18 @@ removed :cite:`smith2014instance`. The class can be used as::
483503
>>> print(sorted(Counter(y_resampled).items()))
484504
[(0, 64), (1, 64), (2, 64)]
485505

486-
This class has 2 important parameters. ``estimator`` will accept any
487-
scikit-learn classifier which has a method ``predict_proba``. The classifier
488-
training is performed using a cross-validation and the parameter ``cv`` can set
489-
the number of folds to use.
506+
:class:`InstanceHardnessThreshold` has 2 important parameters. The parameter
507+
``estimator`` accepts any scikit-learn classifier with a method ``predict_proba``.
508+
This classifier will be used to identify the hard instances. The training is performed
509+
with cross-validation which can be specified through the parameter ``cv`.
490510
491511
.. note::
492512
493513
:class:`InstanceHardnessThreshold` could almost be considered as a
494514
controlled under-sampling method. However, due to the probability outputs, it
495-
is not always possible to get a specific number of samples.
515+
is not always possible to get the specified number of samples.
496516
497-
The figure below gives another examples on some toy data.
517+
The figure below shows examples of instance hardness undersampling on a toy dataset.
498518
499519
.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_006.png
500520
:target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html

imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ class InstanceHardnessThreshold(BaseUnderSampler):
5151
----------
5252
sampling_strategy_ : dict
5353
Dictionary containing the information to sample the dataset. The keys
54-
corresponds to the class labels from which to sample and the values
54+
correspond to the class labels from which to sample and the values
5555
are the number of samples to sample.
5656
5757
estimator_ : estimator object

0 commit comments

Comments
 (0)