EHN: Add RUSBoostClassifier (#469)

glemaitre · web-flow · commit a928f10d03d6 · 2018-09-10T19:08:23.000+02:00
diff --git a/doc/api.rst b/doc/api.rst
@@ -112,6 +112,7 @@ Prototype selection
    ensemble.BalancedRandomForestClassifier
    ensemble.EasyEnsemble
    ensemble.EasyEnsembleClassifier
+   ensemble.RUSBoostClassifier
 
 .. _keras_ref:
 
diff --git a/doc/ensemble.rst b/doc/ensemble.rst
@@ -6,80 +6,30 @@ Ensemble of samplers
 
 .. currentmodule:: imblearn.ensemble
 
-.. _ensemble_samplers:
-
-Samplers
---------
-
-.. warning::
-   Note that those:class:`EasyEnsemble` is deprecated and you should use
-   :class:`EasyEnsembleClassifier` instead. :class:`EasyEnsembleClassifier` is
-   presented in the next section.
-
-An imbalanced data set can be balanced by creating several balanced
-subsets. The module :mod:`imblearn.ensemble` allows to create such sets.
-
-:class:`EasyEnsemble` creates an ensemble of data set by randomly
-under-sampling the original set::
-
-  >>> from collections import Counter
-  >>> from sklearn.datasets import make_classification
-  >>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
-  ...                            n_redundant=0, n_repeated=0, n_classes=3,
-  ...                            n_clusters_per_class=1,
-  ...                            weights=[0.01, 0.05, 0.94],
-  ...                            class_sep=0.8, random_state=0)
-  >>> print(sorted(Counter(y).items()))
-  [(0, 64), (1, 262), (2, 4674)]
-  >>> from imblearn.ensemble import EasyEnsemble
-  >>> ee = EasyEnsemble(random_state=0, n_subsets=10) # doctest: +SKIP
-  >>> X_resampled, y_resampled = ee.fit_resample(X, y) # doctest: +SKIP
-  >>> print(X_resampled.shape) # doctest: +SKIP
-  (10, 192, 2)
-  >>> print(sorted(Counter(y_resampled[0]).items())) # doctest: +SKIP
-  [(0, 64), (1, 64), (2, 64)]
-
-:class:`EasyEnsemble` has two important parameters: (i) ``n_subsets`` will be
-used to return number of subset and (ii) ``replacement`` to randomly sample
-with or without replacement.
-
-:class:`BalanceCascade` differs from the previous method by using a classifier
-(using the parameter ``estimator``) to ensure that misclassified samples can
-again be selected for the next subset. In fact, the classifier play the role of
-a "smart" replacement method. The maximum number of subset can be set using the
-parameter ``n_max_subset`` and an additional bootstraping can be activated with
-``bootstrap`` set to ``True``::
-
-  >>> from imblearn.ensemble import BalanceCascade
-  >>> from sklearn.linear_model import LogisticRegression
-  >>> bc = BalanceCascade(random_state=0,
-  ...                     estimator=LogisticRegression(solver='lbfgs',
-  ...                                                  multi_class='auto',
-  ...                                                  random_state=0),
-  ...                     n_max_subset=4)
-  >>> X_resampled, y_resampled = bc.fit_resample(X, y)
-  >>> print(X_resampled.shape)
-  (4, 192, 2)
-  >>> print(sorted(Counter(y_resampled[0]).items()))
-  [(0, 64), (1, 64), (2, 64)]
+.. _ensemble_meta_estimators:
 
-See
-:ref:`sphx_glr_auto_examples_ensemble_plot_easy_ensemble.py` and
-:ref:`sphx_glr_auto_examples_ensemble_plot_balance_cascade.py`.
+Classifier including inner balancing samplers
+=============================================
 
-.. _ensemble_meta_estimators:
+.. _bagging:
 
-Chaining ensemble of samplers and estimators
---------------------------------------------
+Bagging classifier
+------------------
 
 In ensemble classifiers, bagging methods build several estimators on different
 randomly selected subset of data. In scikit-learn, this classifier is named
 ``BaggingClassifier``. However, this classifier does not allow to balance each
 subset of data. Therefore, when training on imbalanced data set, this
 classifier will favor the majority classes::
 
+  >>> from sklearn.datasets import make_classification
+  >>> X, y = make_classification(n_samples=10000, n_features=2, n_informative=2,
+  ...                            n_redundant=0, n_repeated=0, n_classes=3,
+  ...                            n_clusters_per_class=1,
+  ...                            weights=[0.01, 0.05, 0.94], class_sep=0.8,
+  ...                            random_state=0)
   >>> from sklearn.model_selection import train_test_split
-  >>> from sklearn.metrics import confusion_matrix
+  >>> from sklearn.metrics import balanced_accuracy_score
   >>> from sklearn.ensemble import BaggingClassifier
   >>> from sklearn.tree import DecisionTreeClassifier
   >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
@@ -88,10 +38,8 @@ classifier will favor the majority classes::
   >>> bc.fit(X_train, y_train) #doctest: +ELLIPSIS
   BaggingClassifier(...)
   >>> y_pred = bc.predict(X_test)
-  >>> confusion_matrix(y_test, y_pred)
-  array([[   9,    1,    2],
-         [   0,   54,    5],
-         [   1,    6, 1172]])
+  >>> balanced_accuracy_score(y_test, y_pred)  # doctest: +ELLIPSIS
+  0.77...
 
 :class:`BalancedBaggingClassifier` allows to resample each subset of data
 before to train each estimator of the ensemble. In short, it combines the
@@ -111,45 +59,77 @@ random under-sampler::
   >>> bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
   BalancedBaggingClassifier(...)
   >>> y_pred = bbc.predict(X_test)
-  >>> confusion_matrix(y_test, y_pred)
-  array([[   9,    1,    2],
-         [   0,   55,    4],
-         [  42,   46, 1091]])
+  >>> balanced_accuracy_score(y_test, y_pred)  # doctest: +ELLIPSIS
+  0.80...
+
+.. _forest:
+
+Forest of randomized trees
+--------------------------
 
 :class:`BalancedRandomForestClassifier` is another ensemble method in which
-each tree of the forest will be provided a balanced boostrap sample. This class
+each tre1vided a balanced boostrap sample [1CLB2004]_. This class
 provides all functionality of the
 :class:`sklearn.ensemble.RandomForestClassifier` and notably the
 `feature_importances_` attributes::
 
-
   >>> from imblearn.ensemble import BalancedRandomForestClassifier
-  >>> brf = BalancedRandomForestClassifier(n_estimators=10, random_state=0)
+  >>> brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0)
   >>> brf.fit(X_train, y_train) # doctest: +ELLIPSIS
   BalancedRandomForestClassifier(...)
   >>> y_pred = brf.predict(X_test)
-  >>> confusion_matrix(y_test, y_pred)
-  array([[   9,    1,    2],
-         [   3,   54,    2],
-         [ 113,   47, 1019]])
-  >>> brf.feature_importances_
-  array([ 0.63501243,  0.36498757])
-
-A specific method which uses ``AdaBoost`` as learners in the bagging
-classifier is called EasyEnsemble. The :class:`EasyEnsembleClassifier` allows
-to bag AdaBoost learners which are trained on balanced bootstrap samples.
-Similarly to the :class:`BalancedBaggingClassifier` API, one can construct
-the ensemble as::
+  >>> balanced_accuracy_score(y_test, y_pred)  # doctest: +ELLIPSIS
+  0.81...
+  >>> brf.feature_importances_  # doctest: +ELLIPSIS
+  array([ 0.55...,  0.44...])
+
+.. _boosting:
+
+Boosting
+--------
+
+Several methods taking advantage of boosting have been designed.
+
+:class:`RUSBoostClassifier` randomly under-sample the dataset before to perform
+a boosting iteration [SKHN2010]_::
+
+  >>> from imblearn.ensemble import RUSBoostClassifier
+  >>> rusboost = RUSBoostClassifier(random_state=0)
+  >>> rusboost.fit(X_train, y_train)  # doctest: +ELLIPSIS
+  RUSBoostClassifier(...)
+  >>> y_pred = rusboost.predict(X_test)
+  >>> balanced_accuracy_score(y_test, y_pred)  # doctest: +ELLIPSIS
+  0.74770070758043261
+
+A specific method which uses ``AdaBoost`` as learners in the bagging classifier
+is called EasyEnsemble. The :class:`EasyEnsembleClassifier` allows to bag
+AdaBoost learners which are trained on balanced bootstrap samples [LWZ2009]_.
+Similarly to the :class:`BalancedBaggingClassifier` API, one can construct the
+ensemble as::
 
   >>> from imblearn.ensemble import EasyEnsembleClassifier
   >>> eec = EasyEnsembleClassifier(random_state=0)
   >>> eec.fit(X_train, y_train) # doctest: +ELLIPSIS
   EasyEnsembleClassifier(...)
   >>> y_pred = eec.predict(X_test)
-  >>> confusion_matrix(y_test, y_pred)
-  array([[  9,   1,   2],
-         [  5,  52,   2],
-         [252,  45, 882]])
+  >>> balanced_accuracy_score(y_test, y_pred)  # doctest: +ELLIPSIS
+  0.62484778593026025
 
 See
-:ref:`sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py`.
+:ref:`sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py`.
+
+.. topic:: References
+
+  .. [1CLB2004] Chen, Chao, Andy Liaw, and Leo Breiman. "Using random forest to
+                learn imbalanced data." University of California, Berkeley 110
+                (2004): 1-12.
+
+  .. [LWZ2009] X. Y. Liu, J. Wu and Z. H. Zhou, "Exploratory Undersampling for
+               Class-Imbalance Learning," in IEEE Transactions on Systems, Man,
+               and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp.
+               539-550, April 2009.
+
+  .. [SKHN2010] Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., &
+                Napolitano, A. "RUSBoost: A hybrid approach to alleviating
+                class imbalance." IEEE Transactions on Systems, Man, and
+                Cybernetics-Part A: Systems and Humans 40.1 (2010): 185-197.
diff --git a/doc/whats_new/v0.0.4.rst b/doc/whats_new/v0.0.4.rst
@@ -37,6 +37,10 @@ New features
   each bootstrap provided to each tree of the forest.
   :issue:`459` by :user:`Guillaume Lemaitre <glemaitre>`.
 
+- Add :class:`imblearn.ensemble.RUSBoostClassifier` which applied a random
+  under-sampling stage before each boosting iteration of AdaBoost.
+  :issue:`469` by :user:`Guillaume Lemaitre <glemaitre>`.
+
 Enhancement
 ...........
 
diff --git a/examples/ensemble/plot_comparison_ensemble_classifier.py b/examples/ensemble/plot_comparison_ensemble_classifier.py
@@ -34,6 +34,7 @@
 from imblearn.ensemble import BalancedBaggingClassifier
 from imblearn.ensemble import BalancedRandomForestClassifier
 from imblearn.ensemble import EasyEnsembleClassifier
+from imblearn.ensemble import RUSBoostClassifier
 
 from imblearn.metrics import geometric_mean_score
 
@@ -197,8 +198,20 @@ def plot_confusion_matrix(cm, classes, ax,
       .format(balanced_accuracy_score(y_test, y_pred_eec),
               geometric_mean_score(y_test, y_pred_eec)))
 cm_eec = confusion_matrix(y_test, y_pred_eec)
-fig, ax = plt.subplots()
-plot_confusion_matrix(cm_eec, classes=np.unique(satimage.target), ax=ax,
+fig, ax = plt.subplots(ncols=2)
+plot_confusion_matrix(cm_eec, classes=np.unique(satimage.target), ax=ax[0],
                       title='Easy ensemble classifier')
 
+rusboost = RUSBoostClassifier(n_estimators=10,
+                              base_estimator=base_estimator)
+rusboost.fit(X_train, y_train)
+y_pred_rusboost = rusboost.predict(X_test)
+print('RUSBoost classifier performance:')
+print('Balanced accuracy: {:.2f} - Geometric mean {:.2f}'
+      .format(balanced_accuracy_score(y_test, y_pred_rusboost),
+              geometric_mean_score(y_test, y_pred_rusboost)))
+cm_rusboost = confusion_matrix(y_test, y_pred_rusboost)
+plot_confusion_matrix(cm_rusboost, classes=np.unique(satimage.target),
+                      ax=ax[1], title='RUSBoost classifier')
+
 plt.show()
diff --git a/imblearn/ensemble/__init__.py b/imblearn/ensemble/__init__.py
@@ -8,7 +8,8 @@
 from ._balance_cascade import BalanceCascade
 from ._bagging import BalancedBaggingClassifier
 from ._forest import BalancedRandomForestClassifier
+from ._weight_boosting import RUSBoostClassifier
 
 __all__ = ['EasyEnsemble', 'EasyEnsembleClassifier',
            'BalancedBaggingClassifier', 'BalanceCascade',
-           'BalancedRandomForestClassifier']
+           'BalancedRandomForestClassifier', 'RUSBoostClassifier']
diff --git a/imblearn/ensemble/_bagging.py b/imblearn/ensemble/_bagging.py
@@ -23,14 +23,13 @@
     sampling_strategy=BaseUnderSampler._sampling_strategy_docstring,
     random_state=_random_state_docstring)
 class BalancedBaggingClassifier(BaggingClassifier):
-    """A Bagging classifier with additional balancing. It is similar to
-    ``EasyEnsemble`` [6]_.
+    """A Bagging classifier with additional balancing.
 
     This implementation of Bagging is similar to the scikit-learn
     implementation. It includes an additional step to balance the training set
     at fit time using a ``RandomUnderSampler``.
 
-    Read more in the :ref:`User Guide <ensemble_meta_estimators>`.
+    Read more in the :ref:`User Guide <bagging>`.
 
     Parameters
     ----------
@@ -68,9 +67,6 @@ class BalancedBaggingClassifier(BaggingClassifier):
         and add more estimators to the ensemble, otherwise, just fit
         a whole new ensemble.
 
-        .. versionadded:: 0.17
-           *warm_start* constructor parameter.
-
     {sampling_strategy}
 
     replacement : bool, optional (default=False)
@@ -127,7 +123,7 @@ class BalancedBaggingClassifier(BaggingClassifier):
     `max_features='auto'` as a base estimator.
 
     See
-    :ref:`sphx_glr_auto_examples_ensemble_plot_comparison_bagging_classifier.py`.
+    :ref:`sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py`.
 
     See also
     --------
@@ -147,10 +143,6 @@ class BalancedBaggingClassifier(BaggingClassifier):
     .. [5] Chen, Chao, Andy Liaw, and Leo Breiman. "Using random forest to
            learn imbalanced data." University of California, Berkeley 110,
            2004.
-    .. [6] X. Y. Liu, J. Wu and Z. H. Zhou, "Exploratory Undersampling for
-           Class-Imbalance Learning," in IEEE Transactions on Systems, Man, and
-           Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539-550,
-           April 2009.
 
     Examples
     --------
diff --git a/imblearn/ensemble/_balance_cascade.py b/imblearn/ensemble/_balance_cascade.py
@@ -30,8 +30,6 @@ class BalanceCascade(BaseEnsembleSampler):
     This method iteratively select subset and make an ensemble of the
     different sets. The selection is performed using a specific classifier.
 
-    Read more in the :ref:`User Guide <ensemble_samplers>`.
-
     Parameters
     ----------
     {sampling_strategy}
diff --git a/imblearn/ensemble/_easy_ensemble.py b/imblearn/ensemble/_easy_ensemble.py
@@ -39,8 +39,6 @@ class EasyEnsemble(BaseEnsembleSampler):
        ``EasyEnsemble`` is deprecated in 0.4 and will be removed in 0.6. Use
        ``EasyEnsembleClassifier`` instead.
 
-    Read more in the :ref:`User Guide <ensemble_samplers>`.
-
     Parameters
     ----------
     {sampling_strategy}
@@ -151,7 +149,7 @@ class EasyEnsembleClassifier(BaggingClassifier):
     ensemble of AdaBoost learners trained on different balanced boostrap
     samples. The balancing is achieved by random under-sampling.
 
-    Read more in the :ref:`User Guide <ensemble_samplers>`.
+    Read more in the :ref:`User Guide <boosting>`.
 
     Parameters
     ----------
@@ -203,7 +201,7 @@ class EasyEnsembleClassifier(BaggingClassifier):
 
     See also
     --------
-    BalanceCascade, BalancedBaggingClassifier
+    BalancedBaggingClassifier, BalancedRandomForestClassifier
 
     References
     ----------
diff --git a/imblearn/ensemble/_forest.py b/imblearn/ensemble/_forest.py
diff --git a/imblearn/ensemble/_weight_boosting.py b/imblearn/ensemble/_weight_boosting.py
diff --git a/imblearn/ensemble/tests/test_weight_boosting.py b/imblearn/ensemble/tests/test_weight_boosting.py