Skip to content

Commit 636dc5b

Browse files
committed
explain instance hardness in user guide
1 parent 4647a2b commit 636dc5b

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed

doc/cross_validation.rst

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,36 @@ We will discuss instance hardness in this document and explain how to use the
2222

2323
Instance hardness and average precision
2424
=======================================
25+
Instance hardness is defined as 1 minus the probability of the most probable class:
26+
27+
.. math::
28+
29+
H(x) = 1 - P(\hat{y}|x)
30+
31+
In this equation :math:`H(x)` is the instance hardness for a sample with features
32+
:math:`x` and :math:`P(\hat{y}|x)` the probability of predicted label :math:`\hat{y}`
33+
given the features. If the model predicts label 0 and gives a `predict_proba` output
34+
of [0.9, 0.1], the probability of the most probable class (0) is 0.9 and the
35+
instance hardness is 1-0.9=0.1.
36+
37+
Samples with large instance hardness have significant effect on the area under
38+
precision-recall curve, or average precision. Especially samples with label 0
39+
with large instance hardness (so the model predicts label 1) reduce the average
40+
precision a lot as these points affect the precision-recall curve in the left
41+
where the area is largest; the precision is lowered in the range of low recall
42+
and high thresholds. When doing cross validation, e.g. in case of hyperparameter
43+
tuning or recursive feature elimination, random gathering of these points in
44+
some folds introduce variance in CV results that deteriorates robustness of the
45+
cross validation task. The `InstanceHardnessCV`
46+
splitter aims to distribute the samples with large instance hardness over the
47+
folds in order to reduce undesired variance. Note that one should use this
48+
splitter to make model *selection* tasks robust like hyperparameter tuning and
49+
feature selection but not for model *performance estimation* for which you also
50+
want to know the variance of performance to be expected in production.
51+
52+
53+
Create imbalanced dataset with samples with large instance hardness
54+
===================================================================
2555

2656
Let’s start by creating a dataset to work with. We create a dataset with 5% class
2757
imbalance using scikit-learn’s `make_blobs` function.
@@ -54,6 +84,9 @@ Now we add some samples with large instance hardness
5484
:target: ./auto_examples/cross_validation/plot_instance_hardness_cv.html
5585
:align: center
5686

87+
Assess cross validation performance variance using InstanceHardnessCV splitter
88+
==============================================================================
89+
5790
Then we take a `LogisticRegressionClassifier` and assess the cross validation
5891
performance using a `StratifiedKFold` cv splitter and the `cross_validate`
5992
function.
@@ -78,6 +111,7 @@ the `InstanceHardnessCV` splitter is lower than for the `StratifiedKFold` splitt
78111
>>> plt.boxplot([skf_result['test_score'], ih_result['test_score']],
79112
... tick_labels=["StratifiedKFold", "InstanceHardnessCV"],
80113
... vert=False)
114+
>>> plt.xlabel('Average precision')
81115
>>> plt.tight_layout()
82116

83117
.. image:: ./auto_examples/cross_validation/images/sphx_glr_plot_instance_hardness_cv_003.png

0 commit comments

Comments
 (0)