6
6
7
7
.. currentmodule :: imblearn.metrics
8
8
9
+ Classification metrics
10
+ ----------------------
11
+
9
12
Currently, scikit-learn only offers the
10
13
``sklearn.metrics.balanced_accuracy_score `` (in 0.20) as metric to deal with
11
14
imbalanced datasets. The module :mod: `imblearn.metrics ` offers a couple of
@@ -15,7 +18,7 @@ classifiers.
15
18
.. _sensitivity_specificity :
16
19
17
20
Sensitivity and specificity metrics
18
- -----------------------------------
21
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
19
22
20
23
Sensitivity and specificity are metrics which are well known in medical
21
24
imaging. Sensitivity (also called true positive rate or recall) is the
@@ -34,7 +37,7 @@ use those metrics.
34
37
.. _imbalanced_metrics :
35
38
36
39
Additional metrics specific to imbalanced datasets
37
- --------------------------------------------------
40
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38
41
39
42
The :func: `geometric_mean_score `
40
43
:cite: `barandela2003strategies,kubat1997addressing ` is the root of the product
@@ -48,7 +51,7 @@ parameter ``alpha``.
48
51
.. _macro_averaged_mean_absolute_error :
49
52
50
53
Macro-Averaged Mean Absolute Error (MA-MAE)
51
- -------------------------------------------
54
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
52
55
53
56
Ordinal classification is used when there is a rank among classes, for example
54
57
levels of functionality or movie ratings.
@@ -60,9 +63,84 @@ each class and averaged over classes, giving an equal weight to each class.
60
63
.. _classification_report :
61
64
62
65
Summary of important metrics
63
- ----------------------------
66
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
64
67
65
68
The :func: `classification_report_imbalanced ` will compute a set of metrics
66
69
per class and summarize it in a table. The parameter `output_dict ` allows
67
70
to get a string or a Python dictionary. This dictionary can be reused to create
68
71
a Pandas dataframe for instance.
72
+
73
+ .. _pairwise_metrics :
74
+
75
+ Pairwise metrics
76
+ ----------------
77
+
78
+ The :mod: `imblearn.metrics.pairwise ` submodule implements pairwise distances
79
+ that are available in scikit-learn while used in some of the methods in
80
+ imbalanced-learn.
81
+
82
+ .. _vdm :
83
+
84
+ Value Difference Metric
85
+ ~~~~~~~~~~~~~~~~~~~~~~~
86
+
87
+ The class :class: `~imblearn.metrics.pairwise.ValueDifferenceMetric ` is
88
+ implementing the Value Difference Metric proposed in
89
+ :cite: `stanfill1986toward `. This measure is used to compute the proximity
90
+ of two samples composed of only nominal values.
91
+
92
+ Given a single feature, categories with similar correlation with the target
93
+ vector will be considered closer. Let's give an example to illustrate this
94
+ behaviour as given in :cite: `wilson1997improved `. `X ` will be represented by a
95
+ single feature which will be some color and the target will be if a sample is
96
+ whether or not an apple::
97
+
98
+ >>> import numpy as np
99
+ >>> X = np.array(["green"] * 10 + ["red"] * 10 + ["blue"] * 10).reshape(-1, 1)
100
+ >>> y = ["apple"] * 8 + ["not apple"] * 5 + ["apple"] * 7 + ["not apple"] * 9 + ["apple"]
101
+
102
+ In this dataset, the categories "red" and "green" are more correlated to the
103
+ target `y ` and should have a smaller distance than with the category "blue".
104
+ We should this behaviour. Be aware that we need to encode the `X ` to work with
105
+ numerical values::
106
+
107
+ >>> from sklearn.preprocessing import OrdinalEncoder
108
+ >>> encoder = OrdinalEncoder(dtype=np.int32)
109
+ >>> X_encoded = encoder.fit_transform(X)
110
+
111
+ Now, we can compute the distance between three different samples representing
112
+ the different categories::
113
+
114
+ >>> from imblearn.metrics.pairwise import ValueDifferenceMetric
115
+ >>> vdm = ValueDifferenceMetric().fit(X_encoded, y)
116
+ >>> X_test = np.array(["green", "red", "blue"]).reshape(-1, 1)
117
+ >>> X_test_encoded = encoder.transform(X_test)
118
+ >>> vdm.pairwise(X_test_encoded)
119
+ array([[ 0. , 0.04, 1.96],
120
+ [ 0.04, 0. , 1.44],
121
+ [ 1.96, 1.44, 0. ]])
122
+
123
+ We see that the minimum distance happen when the categories "red" and "green"
124
+ are compared. Whenever comparing with "blue", the distance is much larger.
125
+
126
+ **Mathematical formulation **
127
+
128
+ The distance between feature values of two samples is defined as:
129
+
130
+ .. math ::
131
+ \delta (x, y) = \sum _{c=1 }^{C} |p(c|x_{f}) - p(c|y_{f})|^{k} \ ,
132
+
133
+ where :math: `x` and :math: `y` are two samples and :math: `f` a given
134
+ feature, :math: `C` is the number of classes, :math: `p(c|x_{f})` is the
135
+ conditional probability that the output class is :math: `c` given that
136
+ the feature value :math: `f` has the value :math: `x` and :math: `k` an
137
+ exponent usually defined to 1 or 2.
138
+
139
+ The distance for the feature vectors :math: `X` and :math: `Y` is
140
+ subsequently defined as:
141
+
142
+ .. math ::
143
+ \Delta (X, Y) = \sum _{f=1 }^{F} \delta (X_{f}, Y_{f})^{r} \ ,
144
+
145
+ where :math: `F` is the number of feature and :math: `r` an exponent usually
146
+ defined equal to 1 or 2.
0 commit comments