Skip to content

Commit a928f10

Browse files
authored
EHN: Add RUSBoostClassifier (#469)
1 parent 513203c commit a928f10

File tree

11 files changed

+516
-124
lines changed

11 files changed

+516
-124
lines changed

doc/api.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ Prototype selection
112112
ensemble.BalancedRandomForestClassifier
113113
ensemble.EasyEnsemble
114114
ensemble.EasyEnsembleClassifier
115+
ensemble.RUSBoostClassifier
115116

116117
.. _keras_ref:
117118

doc/ensemble.rst

Lines changed: 71 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -6,80 +6,30 @@ Ensemble of samplers
66

77
.. currentmodule:: imblearn.ensemble
88

9-
.. _ensemble_samplers:
10-
11-
Samplers
12-
--------
13-
14-
.. warning::
15-
Note that those:class:`EasyEnsemble` is deprecated and you should use
16-
:class:`EasyEnsembleClassifier` instead. :class:`EasyEnsembleClassifier` is
17-
presented in the next section.
18-
19-
An imbalanced data set can be balanced by creating several balanced
20-
subsets. The module :mod:`imblearn.ensemble` allows to create such sets.
21-
22-
:class:`EasyEnsemble` creates an ensemble of data set by randomly
23-
under-sampling the original set::
24-
25-
>>> from collections import Counter
26-
>>> from sklearn.datasets import make_classification
27-
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
28-
... n_redundant=0, n_repeated=0, n_classes=3,
29-
... n_clusters_per_class=1,
30-
... weights=[0.01, 0.05, 0.94],
31-
... class_sep=0.8, random_state=0)
32-
>>> print(sorted(Counter(y).items()))
33-
[(0, 64), (1, 262), (2, 4674)]
34-
>>> from imblearn.ensemble import EasyEnsemble
35-
>>> ee = EasyEnsemble(random_state=0, n_subsets=10) # doctest: +SKIP
36-
>>> X_resampled, y_resampled = ee.fit_resample(X, y) # doctest: +SKIP
37-
>>> print(X_resampled.shape) # doctest: +SKIP
38-
(10, 192, 2)
39-
>>> print(sorted(Counter(y_resampled[0]).items())) # doctest: +SKIP
40-
[(0, 64), (1, 64), (2, 64)]
41-
42-
:class:`EasyEnsemble` has two important parameters: (i) ``n_subsets`` will be
43-
used to return number of subset and (ii) ``replacement`` to randomly sample
44-
with or without replacement.
45-
46-
:class:`BalanceCascade` differs from the previous method by using a classifier
47-
(using the parameter ``estimator``) to ensure that misclassified samples can
48-
again be selected for the next subset. In fact, the classifier play the role of
49-
a "smart" replacement method. The maximum number of subset can be set using the
50-
parameter ``n_max_subset`` and an additional bootstraping can be activated with
51-
``bootstrap`` set to ``True``::
52-
53-
>>> from imblearn.ensemble import BalanceCascade
54-
>>> from sklearn.linear_model import LogisticRegression
55-
>>> bc = BalanceCascade(random_state=0,
56-
... estimator=LogisticRegression(solver='lbfgs',
57-
... multi_class='auto',
58-
... random_state=0),
59-
... n_max_subset=4)
60-
>>> X_resampled, y_resampled = bc.fit_resample(X, y)
61-
>>> print(X_resampled.shape)
62-
(4, 192, 2)
63-
>>> print(sorted(Counter(y_resampled[0]).items()))
64-
[(0, 64), (1, 64), (2, 64)]
9+
.. _ensemble_meta_estimators:
6510

66-
See
67-
:ref:`sphx_glr_auto_examples_ensemble_plot_easy_ensemble.py` and
68-
:ref:`sphx_glr_auto_examples_ensemble_plot_balance_cascade.py`.
11+
Classifier including inner balancing samplers
12+
=============================================
6913

70-
.. _ensemble_meta_estimators:
14+
.. _bagging:
7115

72-
Chaining ensemble of samplers and estimators
73-
--------------------------------------------
16+
Bagging classifier
17+
------------------
7418

7519
In ensemble classifiers, bagging methods build several estimators on different
7620
randomly selected subset of data. In scikit-learn, this classifier is named
7721
``BaggingClassifier``. However, this classifier does not allow to balance each
7822
subset of data. Therefore, when training on imbalanced data set, this
7923
classifier will favor the majority classes::
8024

25+
>>> from sklearn.datasets import make_classification
26+
>>> X, y = make_classification(n_samples=10000, n_features=2, n_informative=2,
27+
... n_redundant=0, n_repeated=0, n_classes=3,
28+
... n_clusters_per_class=1,
29+
... weights=[0.01, 0.05, 0.94], class_sep=0.8,
30+
... random_state=0)
8131
>>> from sklearn.model_selection import train_test_split
82-
>>> from sklearn.metrics import confusion_matrix
32+
>>> from sklearn.metrics import balanced_accuracy_score
8333
>>> from sklearn.ensemble import BaggingClassifier
8434
>>> from sklearn.tree import DecisionTreeClassifier
8535
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
@@ -88,10 +38,8 @@ classifier will favor the majority classes::
8838
>>> bc.fit(X_train, y_train) #doctest: +ELLIPSIS
8939
BaggingClassifier(...)
9040
>>> y_pred = bc.predict(X_test)
91-
>>> confusion_matrix(y_test, y_pred)
92-
array([[ 9, 1, 2],
93-
[ 0, 54, 5],
94-
[ 1, 6, 1172]])
41+
>>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
42+
0.77...
9543

9644
:class:`BalancedBaggingClassifier` allows to resample each subset of data
9745
before to train each estimator of the ensemble. In short, it combines the
@@ -111,45 +59,77 @@ random under-sampler::
11159
>>> bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
11260
BalancedBaggingClassifier(...)
11361
>>> y_pred = bbc.predict(X_test)
114-
>>> confusion_matrix(y_test, y_pred)
115-
array([[ 9, 1, 2],
116-
[ 0, 55, 4],
117-
[ 42, 46, 1091]])
62+
>>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
63+
0.80...
64+
65+
.. _forest:
66+
67+
Forest of randomized trees
68+
--------------------------
11869

11970
:class:`BalancedRandomForestClassifier` is another ensemble method in which
120-
each tree of the forest will be provided a balanced boostrap sample. This class
71+
each tre1vided a balanced boostrap sample [1CLB2004]_. This class
12172
provides all functionality of the
12273
:class:`sklearn.ensemble.RandomForestClassifier` and notably the
12374
`feature_importances_` attributes::
12475

125-
12676
>>> from imblearn.ensemble import BalancedRandomForestClassifier
127-
>>> brf = BalancedRandomForestClassifier(n_estimators=10, random_state=0)
77+
>>> brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0)
12878
>>> brf.fit(X_train, y_train) # doctest: +ELLIPSIS
12979
BalancedRandomForestClassifier(...)
13080
>>> y_pred = brf.predict(X_test)
131-
>>> confusion_matrix(y_test, y_pred)
132-
array([[ 9, 1, 2],
133-
[ 3, 54, 2],
134-
[ 113, 47, 1019]])
135-
>>> brf.feature_importances_
136-
array([ 0.63501243, 0.36498757])
137-
138-
A specific method which uses ``AdaBoost`` as learners in the bagging
139-
classifier is called EasyEnsemble. The :class:`EasyEnsembleClassifier` allows
140-
to bag AdaBoost learners which are trained on balanced bootstrap samples.
141-
Similarly to the :class:`BalancedBaggingClassifier` API, one can construct
142-
the ensemble as::
81+
>>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
82+
0.81...
83+
>>> brf.feature_importances_ # doctest: +ELLIPSIS
84+
array([ 0.55..., 0.44...])
85+
86+
.. _boosting:
87+
88+
Boosting
89+
--------
90+
91+
Several methods taking advantage of boosting have been designed.
92+
93+
:class:`RUSBoostClassifier` randomly under-sample the dataset before to perform
94+
a boosting iteration [SKHN2010]_::
95+
96+
>>> from imblearn.ensemble import RUSBoostClassifier
97+
>>> rusboost = RUSBoostClassifier(random_state=0)
98+
>>> rusboost.fit(X_train, y_train) # doctest: +ELLIPSIS
99+
RUSBoostClassifier(...)
100+
>>> y_pred = rusboost.predict(X_test)
101+
>>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
102+
0.74770070758043261
103+
104+
A specific method which uses ``AdaBoost`` as learners in the bagging classifier
105+
is called EasyEnsemble. The :class:`EasyEnsembleClassifier` allows to bag
106+
AdaBoost learners which are trained on balanced bootstrap samples [LWZ2009]_.
107+
Similarly to the :class:`BalancedBaggingClassifier` API, one can construct the
108+
ensemble as::
143109

144110
>>> from imblearn.ensemble import EasyEnsembleClassifier
145111
>>> eec = EasyEnsembleClassifier(random_state=0)
146112
>>> eec.fit(X_train, y_train) # doctest: +ELLIPSIS
147113
EasyEnsembleClassifier(...)
148114
>>> y_pred = eec.predict(X_test)
149-
>>> confusion_matrix(y_test, y_pred)
150-
array([[ 9, 1, 2],
151-
[ 5, 52, 2],
152-
[252, 45, 882]])
115+
>>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
116+
0.62484778593026025
153117

154118
See
155-
:ref:`sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py`.
119+
:ref:`sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py`.
120+
121+
.. topic:: References
122+
123+
.. [1CLB2004] Chen, Chao, Andy Liaw, and Leo Breiman. "Using random forest to
124+
learn imbalanced data." University of California, Berkeley 110
125+
(2004): 1-12.
126+
127+
.. [LWZ2009] X. Y. Liu, J. Wu and Z. H. Zhou, "Exploratory Undersampling for
128+
Class-Imbalance Learning," in IEEE Transactions on Systems, Man,
129+
and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp.
130+
539-550, April 2009.
131+
132+
.. [SKHN2010] Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., &
133+
Napolitano, A. "RUSBoost: A hybrid approach to alleviating
134+
class imbalance." IEEE Transactions on Systems, Man, and
135+
Cybernetics-Part A: Systems and Humans 40.1 (2010): 185-197.

doc/whats_new/v0.0.4.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ New features
3737
each bootstrap provided to each tree of the forest.
3838
:issue:`459` by :user:`Guillaume Lemaitre <glemaitre>`.
3939

40+
- Add :class:`imblearn.ensemble.RUSBoostClassifier` which applied a random
41+
under-sampling stage before each boosting iteration of AdaBoost.
42+
:issue:`469` by :user:`Guillaume Lemaitre <glemaitre>`.
43+
4044
Enhancement
4145
...........
4246

examples/ensemble/plot_comparison_ensemble_classifier.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
from imblearn.ensemble import BalancedBaggingClassifier
3535
from imblearn.ensemble import BalancedRandomForestClassifier
3636
from imblearn.ensemble import EasyEnsembleClassifier
37+
from imblearn.ensemble import RUSBoostClassifier
3738

3839
from imblearn.metrics import geometric_mean_score
3940

@@ -197,8 +198,20 @@ def plot_confusion_matrix(cm, classes, ax,
197198
.format(balanced_accuracy_score(y_test, y_pred_eec),
198199
geometric_mean_score(y_test, y_pred_eec)))
199200
cm_eec = confusion_matrix(y_test, y_pred_eec)
200-
fig, ax = plt.subplots()
201-
plot_confusion_matrix(cm_eec, classes=np.unique(satimage.target), ax=ax,
201+
fig, ax = plt.subplots(ncols=2)
202+
plot_confusion_matrix(cm_eec, classes=np.unique(satimage.target), ax=ax[0],
202203
title='Easy ensemble classifier')
203204

205+
rusboost = RUSBoostClassifier(n_estimators=10,
206+
base_estimator=base_estimator)
207+
rusboost.fit(X_train, y_train)
208+
y_pred_rusboost = rusboost.predict(X_test)
209+
print('RUSBoost classifier performance:')
210+
print('Balanced accuracy: {:.2f} - Geometric mean {:.2f}'
211+
.format(balanced_accuracy_score(y_test, y_pred_rusboost),
212+
geometric_mean_score(y_test, y_pred_rusboost)))
213+
cm_rusboost = confusion_matrix(y_test, y_pred_rusboost)
214+
plot_confusion_matrix(cm_rusboost, classes=np.unique(satimage.target),
215+
ax=ax[1], title='RUSBoost classifier')
216+
204217
plt.show()

imblearn/ensemble/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
from ._balance_cascade import BalanceCascade
99
from ._bagging import BalancedBaggingClassifier
1010
from ._forest import BalancedRandomForestClassifier
11+
from ._weight_boosting import RUSBoostClassifier
1112

1213
__all__ = ['EasyEnsemble', 'EasyEnsembleClassifier',
1314
'BalancedBaggingClassifier', 'BalanceCascade',
14-
'BalancedRandomForestClassifier']
15+
'BalancedRandomForestClassifier', 'RUSBoostClassifier']

imblearn/ensemble/_bagging.py

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -23,14 +23,13 @@
2323
sampling_strategy=BaseUnderSampler._sampling_strategy_docstring,
2424
random_state=_random_state_docstring)
2525
class BalancedBaggingClassifier(BaggingClassifier):
26-
"""A Bagging classifier with additional balancing. It is similar to
27-
``EasyEnsemble`` [6]_.
26+
"""A Bagging classifier with additional balancing.
2827
2928
This implementation of Bagging is similar to the scikit-learn
3029
implementation. It includes an additional step to balance the training set
3130
at fit time using a ``RandomUnderSampler``.
3231
33-
Read more in the :ref:`User Guide <ensemble_meta_estimators>`.
32+
Read more in the :ref:`User Guide <bagging>`.
3433
3534
Parameters
3635
----------
@@ -68,9 +67,6 @@ class BalancedBaggingClassifier(BaggingClassifier):
6867
and add more estimators to the ensemble, otherwise, just fit
6968
a whole new ensemble.
7069
71-
.. versionadded:: 0.17
72-
*warm_start* constructor parameter.
73-
7470
{sampling_strategy}
7571
7672
replacement : bool, optional (default=False)
@@ -127,7 +123,7 @@ class BalancedBaggingClassifier(BaggingClassifier):
127123
`max_features='auto'` as a base estimator.
128124
129125
See
130-
:ref:`sphx_glr_auto_examples_ensemble_plot_comparison_bagging_classifier.py`.
126+
:ref:`sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py`.
131127
132128
See also
133129
--------
@@ -147,10 +143,6 @@ class BalancedBaggingClassifier(BaggingClassifier):
147143
.. [5] Chen, Chao, Andy Liaw, and Leo Breiman. "Using random forest to
148144
learn imbalanced data." University of California, Berkeley 110,
149145
2004.
150-
.. [6] X. Y. Liu, J. Wu and Z. H. Zhou, "Exploratory Undersampling for
151-
Class-Imbalance Learning," in IEEE Transactions on Systems, Man, and
152-
Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539-550,
153-
April 2009.
154146
155147
Examples
156148
--------

imblearn/ensemble/_balance_cascade.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,6 @@ class BalanceCascade(BaseEnsembleSampler):
3030
This method iteratively select subset and make an ensemble of the
3131
different sets. The selection is performed using a specific classifier.
3232
33-
Read more in the :ref:`User Guide <ensemble_samplers>`.
34-
3533
Parameters
3634
----------
3735
{sampling_strategy}

imblearn/ensemble/_easy_ensemble.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,6 @@ class EasyEnsemble(BaseEnsembleSampler):
3939
``EasyEnsemble`` is deprecated in 0.4 and will be removed in 0.6. Use
4040
``EasyEnsembleClassifier`` instead.
4141
42-
Read more in the :ref:`User Guide <ensemble_samplers>`.
43-
4442
Parameters
4543
----------
4644
{sampling_strategy}
@@ -151,7 +149,7 @@ class EasyEnsembleClassifier(BaggingClassifier):
151149
ensemble of AdaBoost learners trained on different balanced boostrap
152150
samples. The balancing is achieved by random under-sampling.
153151
154-
Read more in the :ref:`User Guide <ensemble_samplers>`.
152+
Read more in the :ref:`User Guide <boosting>`.
155153
156154
Parameters
157155
----------
@@ -203,7 +201,7 @@ class EasyEnsembleClassifier(BaggingClassifier):
203201
204202
See also
205203
--------
206-
BalanceCascade, BalancedBaggingClassifier
204+
BalancedBaggingClassifier, BalancedRandomForestClassifier
207205
208206
References
209207
----------

0 commit comments

Comments
 (0)