Skip to content

Commit ce4e1f7

Browse files
glemaitrechkoar
andauthored
FEA add ValueDifferenceMetric as a pairwise metric (#796)
Co-authored-by: Christos Aridas <[email protected]>
1 parent 1d77037 commit ce4e1f7

File tree

10 files changed

+539
-12
lines changed

10 files changed

+539
-12
lines changed

.flake8

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
[flake8]
2+
max-line-length = 88
3+
# Default flake8 3.5 ignored flags
4+
ignore=E121,E123,E126,E226,E24,E704,W503,W504,E203
5+
# It's fine not to put the import at the top of the file in the examples
6+
# folder.
7+
per-file-ignores =
8+
examples/*: E402

doc/api.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,10 @@ Imbalance-learn provides some fast-prototyping tools.
205205

206206
.. currentmodule:: imblearn
207207

208+
Classification metrics
209+
----------------------
210+
See the :ref:`metrics` section of the user guide for further details.
211+
208212
.. autosummary::
209213
:toctree: generated/
210214
:template: function.rst
@@ -217,6 +221,22 @@ Imbalance-learn provides some fast-prototyping tools.
217221
metrics.macro_averaged_mean_absolute_error
218222
metrics.make_index_balanced_accuracy
219223

224+
Pairwise metrics
225+
----------------
226+
See the :ref:`pairwise_metrics` section of the user guide for further details.
227+
228+
.. automodule:: imblearn.metrics.pairwise
229+
:no-members:
230+
:no-inherited-members:
231+
232+
.. currentmodule:: imblearn
233+
234+
.. autosummary::
235+
:toctree: generated/
236+
:template: class.rst
237+
238+
metrics.pairwise.ValueDifferenceMetric
239+
220240
.. _datasets_ref:
221241

222242
:mod:`imblearn.datasets`: Datasets

doc/bibtex/refs.bib

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -223,4 +223,24 @@ @article{esuli2009ordinal
223223
publisher = {IEEE Computer Society},
224224
address = {Los Alamitos, CA, USA},
225225
month = {dec}
226-
}
226+
}
227+
228+
@article{stanfill1986toward,
229+
title={Toward memory-based reasoning},
230+
author={Stanfill, Craig and Waltz, David},
231+
journal={Communications of the ACM},
232+
volume={29},
233+
number={12},
234+
pages={1213--1228},
235+
year={1986},
236+
publisher={ACM New York, NY, USA}
237+
}
238+
239+
@article{wilson1997improved,
240+
title={Improved heterogeneous distance functions},
241+
author={Wilson, D Randall and Martinez, Tony R},
242+
journal={Journal of artificial intelligence research},
243+
volume={6},
244+
pages={1--34},
245+
year={1997}
246+
}

doc/conf.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@
2121
# documentation root, use os.path.abspath to make it absolute, like shown here.
2222
sys.path.insert(0, os.path.abspath("sphinxext"))
2323
from github_link import make_linkcode_resolve
24-
import sphinx_gallery
2524

2625
# -- General configuration ------------------------------------------------
2726

@@ -44,7 +43,7 @@
4443
]
4544

4645
# bibtex file
47-
bibtex_bibfiles = ['bibtex/refs.bib']
46+
bibtex_bibfiles = ["bibtex/refs.bib"]
4847

4948
# this is needed for some reason...
5049
# see https://github.com/numpy/numpydoc/issues/69
@@ -77,8 +76,8 @@
7776
master_doc = "index"
7877

7978
# General information about the project.
80-
project = 'imbalanced-learn'
81-
copyright = '2014-2020, The imbalanced-learn developers'
79+
project = "imbalanced-learn"
80+
copyright = "2014-2020, The imbalanced-learn developers"
8281

8382
# The version info for the project you're documenting, acts as replacement for
8483
# |version| and |release|, also used in various other places throughout the
@@ -260,7 +259,10 @@
260259

261260
# intersphinx configuration
262261
intersphinx_mapping = {
263-
"python": ("https://docs.python.org/{.major}".format(sys.version_info), None,),
262+
"python": (
263+
"https://docs.python.org/{.major}".format(sys.version_info),
264+
None,
265+
),
264266
"numpy": ("https://docs.scipy.org/doc/numpy/", None),
265267
"scipy": ("https://docs.scipy.org/doc/scipy/reference", None),
266268
"matplotlib": ("https://matplotlib.org/", None),

doc/metrics.rst

Lines changed: 82 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ Metrics
66

77
.. currentmodule:: imblearn.metrics
88

9+
Classification metrics
10+
----------------------
11+
912
Currently, scikit-learn only offers the
1013
``sklearn.metrics.balanced_accuracy_score`` (in 0.20) as metric to deal with
1114
imbalanced datasets. The module :mod:`imblearn.metrics` offers a couple of
@@ -15,7 +18,7 @@ classifiers.
1518
.. _sensitivity_specificity:
1619

1720
Sensitivity and specificity metrics
18-
-----------------------------------
21+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1922

2023
Sensitivity and specificity are metrics which are well known in medical
2124
imaging. Sensitivity (also called true positive rate or recall) is the
@@ -34,7 +37,7 @@ use those metrics.
3437
.. _imbalanced_metrics:
3538

3639
Additional metrics specific to imbalanced datasets
37-
--------------------------------------------------
40+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3841

3942
The :func:`geometric_mean_score`
4043
:cite:`barandela2003strategies,kubat1997addressing` is the root of the product
@@ -48,7 +51,7 @@ parameter ``alpha``.
4851
.. _macro_averaged_mean_absolute_error:
4952

5053
Macro-Averaged Mean Absolute Error (MA-MAE)
51-
-------------------------------------------
54+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5255

5356
Ordinal classification is used when there is a rank among classes, for example
5457
levels of functionality or movie ratings.
@@ -60,9 +63,84 @@ each class and averaged over classes, giving an equal weight to each class.
6063
.. _classification_report:
6164

6265
Summary of important metrics
63-
----------------------------
66+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6467

6568
The :func:`classification_report_imbalanced` will compute a set of metrics
6669
per class and summarize it in a table. The parameter `output_dict` allows
6770
to get a string or a Python dictionary. This dictionary can be reused to create
6871
a Pandas dataframe for instance.
72+
73+
.. _pairwise_metrics:
74+
75+
Pairwise metrics
76+
----------------
77+
78+
The :mod:`imblearn.metrics.pairwise` submodule implements pairwise distances
79+
that are available in scikit-learn while used in some of the methods in
80+
imbalanced-learn.
81+
82+
.. _vdm:
83+
84+
Value Difference Metric
85+
~~~~~~~~~~~~~~~~~~~~~~~
86+
87+
The class :class:`~imblearn.metrics.pairwise.ValueDifferenceMetric` is
88+
implementing the Value Difference Metric proposed in
89+
:cite:`stanfill1986toward`. This measure is used to compute the proximity
90+
of two samples composed of only nominal values.
91+
92+
Given a single feature, categories with similar correlation with the target
93+
vector will be considered closer. Let's give an example to illustrate this
94+
behaviour as given in :cite:`wilson1997improved`. `X` will be represented by a
95+
single feature which will be some color and the target will be if a sample is
96+
whether or not an apple::
97+
98+
>>> import numpy as np
99+
>>> X = np.array(["green"] * 10 + ["red"] * 10 + ["blue"] * 10).reshape(-1, 1)
100+
>>> y = ["apple"] * 8 + ["not apple"] * 5 + ["apple"] * 7 + ["not apple"] * 9 + ["apple"]
101+
102+
In this dataset, the categories "red" and "green" are more correlated to the
103+
target `y` and should have a smaller distance than with the category "blue".
104+
We should this behaviour. Be aware that we need to encode the `X` to work with
105+
numerical values::
106+
107+
>>> from sklearn.preprocessing import OrdinalEncoder
108+
>>> encoder = OrdinalEncoder(dtype=np.int32)
109+
>>> X_encoded = encoder.fit_transform(X)
110+
111+
Now, we can compute the distance between three different samples representing
112+
the different categories::
113+
114+
>>> from imblearn.metrics.pairwise import ValueDifferenceMetric
115+
>>> vdm = ValueDifferenceMetric().fit(X_encoded, y)
116+
>>> X_test = np.array(["green", "red", "blue"]).reshape(-1, 1)
117+
>>> X_test_encoded = encoder.transform(X_test)
118+
>>> vdm.pairwise(X_test_encoded)
119+
array([[ 0. , 0.04, 1.96],
120+
[ 0.04, 0. , 1.44],
121+
[ 1.96, 1.44, 0. ]])
122+
123+
We see that the minimum distance happen when the categories "red" and "green"
124+
are compared. Whenever comparing with "blue", the distance is much larger.
125+
126+
**Mathematical formulation**
127+
128+
The distance between feature values of two samples is defined as:
129+
130+
.. math::
131+
\delta(x, y) = \sum_{c=1}^{C} |p(c|x_{f}) - p(c|y_{f})|^{k} \ ,
132+
133+
where :math:`x` and :math:`y` are two samples and :math:`f` a given
134+
feature, :math:`C` is the number of classes, :math:`p(c|x_{f})` is the
135+
conditional probability that the output class is :math:`c` given that
136+
the feature value :math:`f` has the value :math:`x` and :math:`k` an
137+
exponent usually defined to 1 or 2.
138+
139+
The distance for the feature vectors :math:`X` and :math:`Y` is
140+
subsequently defined as:
141+
142+
.. math::
143+
\Delta(X, Y) = \sum_{f=1}^{F} \delta(X_{f}, Y_{f})^{r} \ ,
144+
145+
where :math:`F` is the number of feature and :math:`r` an exponent usually
146+
defined equal to 1 or 2.

doc/whats_new/v0.8.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@ New features
1515
classification.
1616
:pr:`780` by :user:`Aurélien Massiot <AurelienMassiot>`.
1717

18+
- Add the class :class:`imblearn.metrics.pairwise.ValueDifferenceMetric` to
19+
compute pairwise distances between samples containing only nominal values.
20+
:pr:`796` by :user:`Guillaume Lemaitre <glemaitre>`.
21+
1822
Enhancements
1923
............
2024

imblearn/metrics/_classification.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# coding: utf-8
2-
"""Metrics to assess performance on classification task given class prediction
2+
"""Metrics to assess performance on a classification task given class
3+
predictions. The available metrics are complementary from the metrics available
4+
in scikit-learn.
35
46
Functions named as ``*_score`` return a scalar value to maximize: the higher
57
the better

0 commit comments

Comments
 (0)