FEA add ValueDifferenceMetric as a pairwise metric (#796)

glemaitre · chkoar · web-flow · commit ce4e1f7b231f · 2021-02-15T00:28:54.000+01:00
Co-authored-by: Christos Aridas &lt;chkoar@users.noreply.github.com&gt;
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,8 @@
+[flake8]
+max-line-length = 88
+# Default flake8 3.5 ignored flags
+ignore=E121,E123,E126,E226,E24,E704,W503,W504,E203
+# It's fine not to put the import at the top of the file in the examples
+# folder.
+per-file-ignores =
+    examples/*: E402
diff --git a/doc/api.rst b/doc/api.rst
@@ -205,6 +205,10 @@ Imbalance-learn provides some fast-prototyping tools.
 
 .. currentmodule:: imblearn
 
+Classification metrics
+----------------------
+See the :ref:`metrics` section of the user guide for further details.
+
 .. autosummary::
    :toctree: generated/
    :template: function.rst
@@ -217,6 +221,22 @@ Imbalance-learn provides some fast-prototyping tools.
    metrics.macro_averaged_mean_absolute_error
    metrics.make_index_balanced_accuracy
 
+Pairwise metrics
+----------------
+See the :ref:`pairwise_metrics` section of the user guide for further details.
+
+.. automodule:: imblearn.metrics.pairwise
+   :no-members:
+   :no-inherited-members:
+
+.. currentmodule:: imblearn
+
+.. autosummary::
+   :toctree: generated/
+   :template: class.rst
+
+   metrics.pairwise.ValueDifferenceMetric
+
 .. _datasets_ref:
 
 :mod:`imblearn.datasets`: Datasets
diff --git a/doc/bibtex/refs.bib b/doc/bibtex/refs.bib
@@ -223,4 +223,24 @@ @article{esuli2009ordinal
   publisher = {IEEE Computer Society},
   address = {Los Alamitos, CA, USA},
   month = {dec}
-}
+}
+
+@article{stanfill1986toward,
+  title={Toward memory-based reasoning},
+  author={Stanfill, Craig and Waltz, David},
+  journal={Communications of the ACM},
+  volume={29},
+  number={12},
+  pages={1213--1228},
+  year={1986},
+  publisher={ACM New York, NY, USA}
+}
+
+@article{wilson1997improved,
+  title={Improved heterogeneous distance functions},
+  author={Wilson, D Randall and Martinez, Tony R},
+  journal={Journal of artificial intelligence research},
+  volume={6},
+  pages={1--34},
+  year={1997}
+}
diff --git a/doc/conf.py b/doc/conf.py
@@ -21,7 +21,6 @@
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 sys.path.insert(0, os.path.abspath("sphinxext"))
 from github_link import make_linkcode_resolve
-import sphinx_gallery
 
 # -- General configuration ------------------------------------------------
 
@@ -44,7 +43,7 @@
 ]
 
 # bibtex file
-bibtex_bibfiles = ['bibtex/refs.bib']
+bibtex_bibfiles = ["bibtex/refs.bib"]
 
 # this is needed for some reason...
 # see https://github.com/numpy/numpydoc/issues/69
@@ -77,8 +76,8 @@
 master_doc = "index"
 
 # General information about the project.
-project = 'imbalanced-learn'
-copyright = '2014-2020, The imbalanced-learn developers'
+project = "imbalanced-learn"
+copyright = "2014-2020, The imbalanced-learn developers"
 
 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the
@@ -260,7 +259,10 @@
 
 # intersphinx configuration
 intersphinx_mapping = {
-    "python": ("https://docs.python.org/{.major}".format(sys.version_info), None,),
+    "python": (
+        "https://docs.python.org/{.major}".format(sys.version_info),
+        None,
+    ),
     "numpy": ("https://docs.scipy.org/doc/numpy/", None),
     "scipy": ("https://docs.scipy.org/doc/scipy/reference", None),
     "matplotlib": ("https://matplotlib.org/", None),
diff --git a/doc/metrics.rst b/doc/metrics.rst
@@ -6,6 +6,9 @@ Metrics
 
 .. currentmodule:: imblearn.metrics
 
+Classification metrics
+----------------------
+
 Currently, scikit-learn only offers the
 ``sklearn.metrics.balanced_accuracy_score`` (in 0.20) as metric to deal with
 imbalanced datasets. The module :mod:`imblearn.metrics` offers a couple of
@@ -15,7 +18,7 @@ classifiers.
 .. _sensitivity_specificity:
 
 Sensitivity and specificity metrics
------------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Sensitivity and specificity are metrics which are well known in medical
 imaging. Sensitivity (also called true positive rate or recall) is the
@@ -34,7 +37,7 @@ use those metrics.
 .. _imbalanced_metrics:
 
 Additional metrics specific to imbalanced datasets
---------------------------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The :func:`geometric_mean_score`
 :cite:`barandela2003strategies,kubat1997addressing` is the root of the product
@@ -48,7 +51,7 @@ parameter ``alpha``.
 .. _macro_averaged_mean_absolute_error:
 
 Macro-Averaged Mean Absolute Error (MA-MAE)
--------------------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Ordinal classification is used when there is a rank among classes, for example
 levels of functionality or movie ratings.
@@ -60,9 +63,84 @@ each class and averaged over classes, giving an equal weight to each class.
 .. _classification_report:
 
 Summary of important metrics
-----------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The :func:`classification_report_imbalanced` will compute a set of metrics
 per class and summarize it in a table. The parameter `output_dict` allows
 to get a string or a Python dictionary. This dictionary can be reused to create
 a Pandas dataframe for instance.
+
+.. _pairwise_metrics:
+
+Pairwise metrics
+----------------
+
+The :mod:`imblearn.metrics.pairwise` submodule implements pairwise distances
+that are available in scikit-learn while used in some of the methods in
+imbalanced-learn.
+
+.. _vdm:
+
+Value Difference Metric
+~~~~~~~~~~~~~~~~~~~~~~~
+
+The class :class:`~imblearn.metrics.pairwise.ValueDifferenceMetric` is
+implementing the Value Difference Metric proposed in
+:cite:`stanfill1986toward`. This measure is used to compute the proximity
+of two samples composed of only nominal values.
+
+Given a single feature, categories with similar correlation with the target
+vector will be considered closer. Let's give an example to illustrate this
+behaviour as given in :cite:`wilson1997improved`. `X` will be represented by a
+single feature which will be some color and the target will be if a sample is
+whether or not an apple::
+
+    >>> import numpy as np
+    >>> X = np.array(["green"] * 10 + ["red"] * 10 + ["blue"] * 10).reshape(-1, 1)
+    >>> y = ["apple"] * 8 + ["not apple"] * 5 + ["apple"] * 7 + ["not apple"] * 9 + ["apple"]
+
+In this dataset, the categories "red" and "green" are more correlated to the
+target `y` and should have a smaller distance than with the category "blue".
+We should this behaviour. Be aware that we need to encode the `X` to work with
+numerical values::
+
+    >>> from sklearn.preprocessing import OrdinalEncoder
+    >>> encoder = OrdinalEncoder(dtype=np.int32)
+    >>> X_encoded = encoder.fit_transform(X)
+
+Now, we can compute the distance between three different samples representing
+the different categories::
+
+    >>> from imblearn.metrics.pairwise import ValueDifferenceMetric
+    >>> vdm = ValueDifferenceMetric().fit(X_encoded, y)
+    >>> X_test = np.array(["green", "red", "blue"]).reshape(-1, 1)
+    >>> X_test_encoded = encoder.transform(X_test)
+    >>> vdm.pairwise(X_test_encoded)
+    array([[ 0.  ,  0.04,  1.96],
+           [ 0.04,  0.  ,  1.44],
+           [ 1.96,  1.44,  0.  ]])
+
+We see that the minimum distance happen when the categories "red" and "green"
+are compared. Whenever comparing with "blue", the distance is much larger.
+
+**Mathematical formulation**
+
+The distance between feature values of two samples is defined as:
+
+.. math::
+    \delta(x, y) = \sum_{c=1}^{C} |p(c|x_{f}) - p(c|y_{f})|^{k} \ ,
+
+where :math:`x` and :math:`y` are two samples and :math:`f` a given
+feature, :math:`C` is the number of classes, :math:`p(c|x_{f})` is the
+conditional probability that the output class is :math:`c` given that
+the feature value :math:`f` has the value :math:`x` and :math:`k` an
+exponent usually defined to 1 or 2.
+
+The distance for the feature vectors :math:`X` and :math:`Y` is
+subsequently defined as:
+
+.. math::
+    \Delta(X, Y) = \sum_{f=1}^{F} \delta(X_{f}, Y_{f})^{r} \ ,
+
+where :math:`F` is the number of feature and :math:`r` an exponent usually
+defined equal to 1 or 2.
diff --git a/doc/whats_new/v0.8.rst b/doc/whats_new/v0.8.rst
@@ -15,6 +15,10 @@ New features
   classification.
   :pr:`780` by :user:`Aurélien Massiot <AurelienMassiot>`.
 
+- Add the class :class:`imblearn.metrics.pairwise.ValueDifferenceMetric` to
+  compute pairwise distances between samples containing only nominal values.
+  :pr:`796` by :user:`Guillaume Lemaitre <glemaitre>`.
+
 Enhancements
 ............
 
diff --git a/imblearn/metrics/_classification.py b/imblearn/metrics/_classification.py
@@ -1,5 +1,7 @@
 # coding: utf-8
-"""Metrics to assess performance on classification task given class prediction
+"""Metrics to assess performance on a classification task given class
+predictions. The available metrics are complementary from the metrics available
+in scikit-learn.
 
 Functions named as ``*_score`` return a scalar value to maximize: the higher
 the better
diff --git a/imblearn/metrics/pairwise.py b/imblearn/metrics/pairwise.py
diff --git a/imblearn/metrics/tests/test_pairwise.py b/imblearn/metrics/tests/test_pairwise.py
diff --git a/references.bib b/references.bib