@@ -22,6 +22,36 @@ We will discuss instance hardness in this document and explain how to use the
22
22
23
23
Instance hardness and average precision
24
24
=======================================
25
+ Instance hardness is defined as 1 minus the probability of the most probable class:
26
+
27
+ .. math ::
28
+
29
+ H(x) = 1 - P(\hat {y}|x)
30
+
31
+ In this equation :math: `H(x)` is the instance hardness for a sample with features
32
+ :math: `x` and :math: `P(\hat {y}|x)` the probability of predicted label :math: `\hat {y}`
33
+ given the features. If the model predicts label 0 and gives a `predict_proba ` output
34
+ of [0.9, 0.1], the probability of the most probable class (0) is 0.9 and the
35
+ instance hardness is 1-0.9=0.1.
36
+
37
+ Samples with large instance hardness have significant effect on the area under
38
+ precision-recall curve, or average precision. Especially samples with label 0
39
+ with large instance hardness (so the model predicts label 1) reduce the average
40
+ precision a lot as these points affect the precision-recall curve in the left
41
+ where the area is largest; the precision is lowered in the range of low recall
42
+ and high thresholds. When doing cross validation, e.g. in case of hyperparameter
43
+ tuning or recursive feature elimination, random gathering of these points in
44
+ some folds introduce variance in CV results that deteriorates robustness of the
45
+ cross validation task. The `InstanceHardnessCV `
46
+ splitter aims to distribute the samples with large instance hardness over the
47
+ folds in order to reduce undesired variance. Note that one should use this
48
+ splitter to make model *selection * tasks robust like hyperparameter tuning and
49
+ feature selection but not for model *performance estimation * for which you also
50
+ want to know the variance of performance to be expected in production.
51
+
52
+
53
+ Create imbalanced dataset with samples with large instance hardness
54
+ ===================================================================
25
55
26
56
Let’s start by creating a dataset to work with. We create a dataset with 5% class
27
57
imbalance using scikit-learn’s `make_blobs ` function.
@@ -54,6 +84,9 @@ Now we add some samples with large instance hardness
54
84
:target: ./auto_examples/cross_validation/plot_instance_hardness_cv.html
55
85
:align: center
56
86
87
+ Assess cross validation performance variance using InstanceHardnessCV splitter
88
+ ==============================================================================
89
+
57
90
Then we take a `LogisticRegressionClassifier ` and assess the cross validation
58
91
performance using a `StratifiedKFold ` cv splitter and the `cross_validate `
59
92
function.
@@ -78,6 +111,7 @@ the `InstanceHardnessCV` splitter is lower than for the `StratifiedKFold` splitt
78
111
>>> plt.boxplot([skf_result[' test_score' ], ih_result[' test_score' ]],
79
112
... tick_labels= [" StratifiedKFold" , " InstanceHardnessCV" ],
80
113
... vert= False )
114
+ >>> plt.xlabel(' Average precision' )
81
115
>>> plt.tight_layout()
82
116
83
117
.. image :: ./auto_examples/cross_validation/images/sphx_glr_plot_instance_hardness_cv_003.png
0 commit comments