4
4
Cross validation
5
5
================
6
6
7
- .. currentmodule :: imblearn.cross_validation
7
+ .. currentmodule :: imblearn.model_selection
8
8
9
9
10
- .. _ instance_hardness_threshold :
10
+ .. _ instance_hardness_threshold_cv :
11
11
12
- The term instance hardness is used in literature to express the difficulty to
13
- correctly classify an instance. An instance for which the predicted probability
14
- of the true class is low, has large instance hardness. The way these
15
- hard-to-classify instances are distributed over train and test sets in cross
16
- validation, has significant effect on the test set performance metrics. The
17
- ` InstanceHardnessCV ` splitter distributes samples with large instance hardness
18
- equally over the folds, resulting in more robust cross validation.
12
+ The term instance hardness is used in literature to express the difficulty to correctly
13
+ classify an instance. An instance for which the predicted probability of the true class
14
+ is low, has large instance hardness. The way these hard-to-classify instances are
15
+ distributed over train and test sets in cross validation, has significant effect on the
16
+ test set performance metrics. The :class: ` ~imblearn.model_selection.InstanceHardnessCV `
17
+ splitter distributes samples with large instance hardness equally over the folds,
18
+ resulting in more robust cross validation.
19
19
20
20
We will discuss instance hardness in this document and explain how to use the
21
- ` InstanceHardnessCV ` splitter.
21
+ :class: ` ~imblearn.model_selection. InstanceHardnessCV ` splitter.
22
22
23
23
Instance hardness and average precision
24
24
=======================================
25
+
25
26
Instance hardness is defined as 1 minus the probability of the most probable class:
26
27
27
28
.. math ::
@@ -32,7 +33,7 @@ In this equation :math:`H(x)` is the instance hardness for a sample with feature
32
33
:math: `x` and :math: `P(\hat {y}|x)` the probability of predicted label :math: `\hat {y}`
33
34
given the features. If the model predicts label 0 and gives a `predict_proba ` output
34
35
of [0.9, 0.1], the probability of the most probable class (0) is 0.9 and the
35
- instance hardness is 1-0.9=0.1.
36
+ instance hardness is ` 1-0.9=0.1 ` .
36
37
37
38
Samples with large instance hardness have significant effect on the area under
38
39
precision-recall curve, or average precision. Especially samples with label 0
@@ -42,7 +43,7 @@ where the area is largest; the precision is lowered in the range of low recall
42
43
and high thresholds. When doing cross validation, e.g. in case of hyperparameter
43
44
tuning or recursive feature elimination, random gathering of these points in
44
45
some folds introduce variance in CV results that deteriorates robustness of the
45
- cross validation task. The ` InstanceHardnessCV `
46
+ cross validation task. The :class: ` ~imblearn.model_selection. InstanceHardnessCV `
46
47
splitter aims to distribute the samples with large instance hardness over the
47
48
folds in order to reduce undesired variance. Note that one should use this
48
49
splitter to make model *selection * tasks robust like hyperparameter tuning and
@@ -53,8 +54,8 @@ want to know the variance of performance to be expected in production.
53
54
Create imbalanced dataset with samples with large instance hardness
54
55
===================================================================
55
56
56
- Let’ s start by creating a dataset to work with. We create a dataset with 5% class
57
- imbalance using scikit-learn’s ` make_blobs ` function.
57
+ Let' s start by creating a dataset to work with. We create a dataset with 5% class
58
+ imbalance using scikit-learn's :func: ` ~sklearn.datasets. make_blobs ` function.
58
59
59
60
>>> import numpy as np
60
61
>>> from matplotlib import pyplot as plt
@@ -66,8 +67,8 @@ imbalance using scikit-learn’s `make_blobs` function.
66
67
>>> plt.scatter(X[:, 0 ], X[:, 1 ], c = y)
67
68
>>> plt.show()
68
69
69
- .. image :: ./auto_examples/cross_validation /images/sphx_glr_plot_instance_hardness_cv_001.png
70
- :target: ./auto_examples/cross_validation /plot_instance_hardness_cv.html
70
+ .. image :: ./auto_examples/model_selection /images/sphx_glr_plot_instance_hardness_cv_001.png
71
+ :target: ./auto_examples/model_selection /plot_instance_hardness_cv.html
71
72
:align: center
72
73
73
74
Now we add some samples with large instance hardness
@@ -80,40 +81,48 @@ Now we add some samples with large instance hardness
80
81
>>> plt.scatter(X[:, 0 ], X[:, 1 ], c = y)
81
82
>>> plt.show()
82
83
83
- .. image :: ./auto_examples/cross_validation /images/sphx_glr_plot_instance_hardness_cv_002.png
84
- :target: ./auto_examples/cross_validation /plot_instance_hardness_cv.html
84
+ .. image :: ./auto_examples/model_selection /images/sphx_glr_plot_instance_hardness_cv_002.png
85
+ :target: ./auto_examples/model_selection /plot_instance_hardness_cv.html
85
86
:align: center
86
87
87
- Assess cross validation performance variance using InstanceHardnessCV splitter
88
- ==============================================================================
88
+ Assess cross validation performance variance using ` InstanceHardnessCV ` splitter
89
+ ================================================================================
89
90
90
- Then we take a ` LogisticRegressionClassifier ` and assess the cross validation
91
- performance using a ` StratifiedKFold ` cv splitter and the ` cross_validate `
92
- function.
91
+ Then we take a :class: ` ~sklearn.linear_model.LogisticRegression ` and assess the
92
+ cross validation performance using a :class: ` ~sklearn.model_selection. StratifiedKFold `
93
+ cv splitter and the :func: ` ~sklearn.model_selection.cross_validate ` function.
93
94
94
95
>>> from sklearn.ensemble import LogisticRegressionClassifier
95
96
>>> clf = LogisticRegressionClassifier(random_state = random_state)
96
97
>>> skf_cv = StratifiedKFold(n_splits = 5 , shuffle = True ,
97
98
... random_state= random_state)
98
99
>>> skf_result = cross_validate(clf, X, y, cv = skf_cv, scoring = " average_precision" )
99
100
100
- Now, we do the same using an ` InstanceHardnessCV ` splitter. We use provide our
101
- classifier to the splitter to calculate instance hardness and distribute samples
102
- with large instance hardness equally over the folds.
101
+ Now, we do the same using an :class: ` ~imblearn.model_selection. InstanceHardnessCV `
102
+ splitter. We use provide our classifier to the splitter to calculate instance hardness
103
+ and distribute samples with large instance hardness equally over the folds.
103
104
104
105
>>> ih_cv = InstanceHardnessCV(estimator = clf, n_splits = 5 ,
105
106
... random_state= random_state)
106
107
>>> ih_result = cross_validate(clf, X, y, cv = ih_cv, scoring = " average_precision" )
107
108
108
- When we plot the test scores for both cv splitters, we see that the variance using
109
- the `InstanceHardnessCV ` splitter is lower than for the `StratifiedKFold ` splitter.
109
+ When we plot the test scores for both cv splitters, we see that the variance using the
110
+ :class: `~imblearn.model_selection.InstanceHardnessCV ` splitter is lower than for the
111
+ :class: `~sklearn.model_selection.StratifiedKFold ` splitter.
110
112
111
113
>>> plt.boxplot([skf_result[' test_score' ], ih_result[' test_score' ]],
112
114
... tick_labels= [" StratifiedKFold" , " InstanceHardnessCV" ],
113
115
... vert= False )
114
116
>>> plt.xlabel(' Average precision' )
115
117
>>> plt.tight_layout()
116
118
117
- .. image :: ./auto_examples/cross_validation/images/sphx_glr_plot_instance_hardness_cv_003.png
118
- :target: ./auto_examples/cross_validation/plot_instance_hardness_cv.html
119
- :align: center
119
+ .. image :: ./auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_003.png
120
+ :target: ./auto_examples/model_selection/plot_instance_hardness_cv.html
121
+ :align: center
122
+
123
+ Be aware that the most important part of cross-validation splitters is to simulate the
124
+ conditions that one will encounter in production. Therefore, if it is likely to get
125
+ difficult samples in production, one should use a cross-validation splitter that
126
+ emulates this situation. In our case, the
127
+ :class: `~sklearn.model_selection.StratifiedKFold ` splitter did not allow to distribute
128
+ the difficult samples over the folds and thus it was likely a problem for our use case.
0 commit comments