@@ -467,12 +467,32 @@ and the output a 3 nearest neighbors classifier. The class can be used as::
467
467
468
468
.. _instance_hardness_threshold :
469
469
470
+ Additional undersampling techniques
471
+ -----------------------------------
472
+
470
473
Instance hardness threshold
471
474
^^^^^^^^^^^^^^^^^^^^^^^^^^^
472
475
473
- :class: `InstanceHardnessThreshold ` is a specific algorithm in which a
474
- classifier is trained on the data and the samples with lower probabilities are
475
- removed :cite: `smith2014instance `. The class can be used as::
476
+ **Instance Hardness ** is a measure of how difficult it is to classify an instance or
477
+ observation correctly. In other words, hard instances are observations that are hard to
478
+ classify correctly.
479
+
480
+ Fundamentally, instances that are hard to classify correctly are those for which the
481
+ learning algorithm or classifier produces a low probability of predicting the correct
482
+ class label.
483
+
484
+ If we removed these hard instances from the dataset, the logic goes, we would help the
485
+ classifier better identify the different classes :cite: `smith2014instance `.
486
+
487
+ :class: `InstanceHardnessThreshold ` trains a classifier on the data and then removes the
488
+ samples with lower probabilities :cite: `smith2014instance `. Or in other words, it
489
+ retains the observations with the higher class probabilities.
490
+
491
+ In our implementation, :class: `InstanceHardnessThreshold ` is (almost) a controlled
492
+ under-sampling method: it will retain a specific number of observations of the target
493
+ class(es), which is specified by the user (see caveat below).
494
+
495
+ The class can be used as::
476
496
477
497
>>> from sklearn.linear_model import LogisticRegression
478
498
>>> from imblearn.under_sampling import InstanceHardnessThreshold
@@ -483,18 +503,18 @@ removed :cite:`smith2014instance`. The class can be used as::
483
503
>>> print(sorted(Counter(y_resampled).items()))
484
504
[(0, 64), (1, 64), (2, 64)]
485
505
486
- This class has 2 important parameters. `` estimator `` will accept any
487
- scikit-learn classifier which has a method ``predict_proba ``. The classifier
488
- training is performed using a cross-validation and the parameter `` cv `` can set
489
- the number of folds to use .
506
+ : class: ` InstanceHardnessThreshold ` has 2 important parameters. The parameter
507
+ `` estimator `` accepts any scikit-learn classifier with a method ``predict_proba ``.
508
+ This classifier will be used to identify the hard instances. The training is performed
509
+ with cross-validation which can be specified through the parameter `` cv` .
490
510
491
511
.. note::
492
512
493
513
:class:`InstanceHardnessThreshold` could almost be considered as a
494
514
controlled under-sampling method. However, due to the probability outputs, it
495
- is not always possible to get a specific number of samples.
515
+ is not always possible to get the specified number of samples.
496
516
497
- The figure below gives another examples on some toy data .
517
+ The figure below shows examples of instance hardness undersampling on a toy dataset .
498
518
499
519
.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_006.png
500
520
:target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
0 commit comments