@@ -6,80 +6,30 @@ Ensemble of samplers
6
6
7
7
.. currentmodule :: imblearn.ensemble
8
8
9
- .. _ensemble_samplers :
10
-
11
- Samplers
12
- --------
13
-
14
- .. warning ::
15
- Note that those:class: `EasyEnsemble ` is deprecated and you should use
16
- :class: `EasyEnsembleClassifier ` instead. :class: `EasyEnsembleClassifier ` is
17
- presented in the next section.
18
-
19
- An imbalanced data set can be balanced by creating several balanced
20
- subsets. The module :mod: `imblearn.ensemble ` allows to create such sets.
21
-
22
- :class: `EasyEnsemble ` creates an ensemble of data set by randomly
23
- under-sampling the original set::
24
-
25
- >>> from collections import Counter
26
- >>> from sklearn.datasets import make_classification
27
- >>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
28
- ... n_redundant=0, n_repeated=0, n_classes=3,
29
- ... n_clusters_per_class=1,
30
- ... weights=[0.01, 0.05, 0.94],
31
- ... class_sep=0.8, random_state=0)
32
- >>> print(sorted(Counter(y).items()))
33
- [(0, 64), (1, 262), (2, 4674)]
34
- >>> from imblearn.ensemble import EasyEnsemble
35
- >>> ee = EasyEnsemble(random_state=0, n_subsets=10) # doctest: +SKIP
36
- >>> X_resampled, y_resampled = ee.fit_resample(X, y) # doctest: +SKIP
37
- >>> print(X_resampled.shape) # doctest: +SKIP
38
- (10, 192, 2)
39
- >>> print(sorted(Counter(y_resampled[0]).items())) # doctest: +SKIP
40
- [(0, 64), (1, 64), (2, 64)]
41
-
42
- :class: `EasyEnsemble ` has two important parameters: (i) ``n_subsets `` will be
43
- used to return number of subset and (ii) ``replacement `` to randomly sample
44
- with or without replacement.
45
-
46
- :class: `BalanceCascade ` differs from the previous method by using a classifier
47
- (using the parameter ``estimator ``) to ensure that misclassified samples can
48
- again be selected for the next subset. In fact, the classifier play the role of
49
- a "smart" replacement method. The maximum number of subset can be set using the
50
- parameter ``n_max_subset `` and an additional bootstraping can be activated with
51
- ``bootstrap `` set to ``True ``::
52
-
53
- >>> from imblearn.ensemble import BalanceCascade
54
- >>> from sklearn.linear_model import LogisticRegression
55
- >>> bc = BalanceCascade(random_state=0,
56
- ... estimator=LogisticRegression(solver='lbfgs',
57
- ... multi_class='auto',
58
- ... random_state=0),
59
- ... n_max_subset=4)
60
- >>> X_resampled, y_resampled = bc.fit_resample(X, y)
61
- >>> print(X_resampled.shape)
62
- (4, 192, 2)
63
- >>> print(sorted(Counter(y_resampled[0]).items()))
64
- [(0, 64), (1, 64), (2, 64)]
9
+ .. _ensemble_meta_estimators :
65
10
66
- See
67
- :ref: `sphx_glr_auto_examples_ensemble_plot_easy_ensemble.py ` and
68
- :ref: `sphx_glr_auto_examples_ensemble_plot_balance_cascade.py `.
11
+ Classifier including inner balancing samplers
12
+ =============================================
69
13
70
- .. _ ensemble_meta_estimators :
14
+ .. _ bagging :
71
15
72
- Chaining ensemble of samplers and estimators
73
- --------------------------------------------
16
+ Bagging classifier
17
+ ------------------
74
18
75
19
In ensemble classifiers, bagging methods build several estimators on different
76
20
randomly selected subset of data. In scikit-learn, this classifier is named
77
21
``BaggingClassifier ``. However, this classifier does not allow to balance each
78
22
subset of data. Therefore, when training on imbalanced data set, this
79
23
classifier will favor the majority classes::
80
24
25
+ >>> from sklearn.datasets import make_classification
26
+ >>> X, y = make_classification(n_samples=10000, n_features=2, n_informative=2,
27
+ ... n_redundant=0, n_repeated=0, n_classes=3,
28
+ ... n_clusters_per_class=1,
29
+ ... weights=[0.01, 0.05, 0.94], class_sep=0.8,
30
+ ... random_state=0)
81
31
>>> from sklearn.model_selection import train_test_split
82
- >>> from sklearn.metrics import confusion_matrix
32
+ >>> from sklearn.metrics import balanced_accuracy_score
83
33
>>> from sklearn.ensemble import BaggingClassifier
84
34
>>> from sklearn.tree import DecisionTreeClassifier
85
35
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
@@ -88,10 +38,8 @@ classifier will favor the majority classes::
88
38
>>> bc.fit(X_train, y_train) #doctest: +ELLIPSIS
89
39
BaggingClassifier(...)
90
40
>>> y_pred = bc.predict(X_test)
91
- >>> confusion_matrix(y_test, y_pred)
92
- array([[ 9, 1, 2],
93
- [ 0, 54, 5],
94
- [ 1, 6, 1172]])
41
+ >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
42
+ 0.77...
95
43
96
44
:class: `BalancedBaggingClassifier ` allows to resample each subset of data
97
45
before to train each estimator of the ensemble. In short, it combines the
@@ -111,45 +59,77 @@ random under-sampler::
111
59
>>> bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
112
60
BalancedBaggingClassifier(...)
113
61
>>> y_pred = bbc.predict(X_test)
114
- >>> confusion_matrix(y_test, y_pred)
115
- array([[ 9, 1, 2],
116
- [ 0, 55, 4],
117
- [ 42, 46, 1091]])
62
+ >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
63
+ 0.80...
64
+
65
+ .. _forest :
66
+
67
+ Forest of randomized trees
68
+ --------------------------
118
69
119
70
:class: `BalancedRandomForestClassifier ` is another ensemble method in which
120
- each tree of the forest will be provided a balanced boostrap sample. This class
71
+ each tre1vided a balanced boostrap sample [1CLB2004]_ . This class
121
72
provides all functionality of the
122
73
:class: `sklearn.ensemble.RandomForestClassifier ` and notably the
123
74
`feature_importances_ ` attributes::
124
75
125
-
126
76
>>> from imblearn.ensemble import BalancedRandomForestClassifier
127
- >>> brf = BalancedRandomForestClassifier(n_estimators=10 , random_state=0)
77
+ >>> brf = BalancedRandomForestClassifier(n_estimators=100 , random_state=0)
128
78
>>> brf.fit(X_train, y_train) # doctest: +ELLIPSIS
129
79
BalancedRandomForestClassifier(...)
130
80
>>> y_pred = brf.predict(X_test)
131
- >>> confusion_matrix(y_test, y_pred)
132
- array([[ 9, 1, 2],
133
- [ 3, 54, 2],
134
- [ 113, 47, 1019]])
135
- >>> brf.feature_importances_
136
- array([ 0.63501243, 0.36498757])
137
-
138
- A specific method which uses ``AdaBoost `` as learners in the bagging
139
- classifier is called EasyEnsemble. The :class: `EasyEnsembleClassifier ` allows
140
- to bag AdaBoost learners which are trained on balanced bootstrap samples.
141
- Similarly to the :class: `BalancedBaggingClassifier ` API, one can construct
142
- the ensemble as::
81
+ >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
82
+ 0.81...
83
+ >>> brf.feature_importances_ # doctest: +ELLIPSIS
84
+ array([ 0.55..., 0.44...])
85
+
86
+ .. _boosting :
87
+
88
+ Boosting
89
+ --------
90
+
91
+ Several methods taking advantage of boosting have been designed.
92
+
93
+ :class: `RUSBoostClassifier ` randomly under-sample the dataset before to perform
94
+ a boosting iteration [SKHN2010 ]_::
95
+
96
+ >>> from imblearn.ensemble import RUSBoostClassifier
97
+ >>> rusboost = RUSBoostClassifier(random_state=0)
98
+ >>> rusboost.fit(X_train, y_train) # doctest: +ELLIPSIS
99
+ RUSBoostClassifier(...)
100
+ >>> y_pred = rusboost.predict(X_test)
101
+ >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
102
+ 0.74770070758043261
103
+
104
+ A specific method which uses ``AdaBoost `` as learners in the bagging classifier
105
+ is called EasyEnsemble. The :class: `EasyEnsembleClassifier ` allows to bag
106
+ AdaBoost learners which are trained on balanced bootstrap samples [LWZ2009 ]_.
107
+ Similarly to the :class: `BalancedBaggingClassifier ` API, one can construct the
108
+ ensemble as::
143
109
144
110
>>> from imblearn.ensemble import EasyEnsembleClassifier
145
111
>>> eec = EasyEnsembleClassifier(random_state=0)
146
112
>>> eec.fit(X_train, y_train) # doctest: +ELLIPSIS
147
113
EasyEnsembleClassifier(...)
148
114
>>> y_pred = eec.predict(X_test)
149
- >>> confusion_matrix(y_test, y_pred)
150
- array([[ 9, 1, 2],
151
- [ 5, 52, 2],
152
- [252, 45, 882]])
115
+ >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS
116
+ 0.62484778593026025
153
117
154
118
See
155
- :ref: `sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py `.
119
+ :ref: `sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py `.
120
+
121
+ .. topic :: References
122
+
123
+ .. [1CLB2004 ] Chen, Chao, Andy Liaw, and Leo Breiman. "Using random forest to
124
+ learn imbalanced data." University of California, Berkeley 110
125
+ (2004): 1-12.
126
+
127
+ .. [LWZ2009 ] X. Y. Liu, J. Wu and Z. H. Zhou, "Exploratory Undersampling for
128
+ Class-Imbalance Learning," in IEEE Transactions on Systems, Man,
129
+ and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp.
130
+ 539-550, April 2009.
131
+
132
+ .. [SKHN2010 ] Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., &
133
+ Napolitano, A. "RUSBoost: A hybrid approach to alleviating
134
+ class imbalance." IEEE Transactions on Systems, Man, and
135
+ Cybernetics-Part A: Systems and Humans 40.1 (2010): 185-197.
0 commit comments