Skip to content

Commit 6e9eeff

Browse files
andrealorenzonAndrea Lorenzon
andauthored
Add Random Over-Sampling Examples (ROSE) class Co-authored-by: Andrea Lorenzon <[email protected]>
1 parent 0acd717 commit 6e9eeff

File tree

10 files changed

+383
-1
lines changed

10 files changed

+383
-1
lines changed

README.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,7 @@ Below is a list of the methods currently implemented in this module.
155155
5. SVM SMOTE - Support Vectors SMOTE [10]_
156156
6. ADASYN - Adaptive synthetic sampling approach for imbalanced learning [15]_
157157
7. KMeans-SMOTE [17]_
158+
8. ROSE - Random OverSampling Examples [19]_
158159

159160
* Over-sampling followed by under-sampling
160161
1. SMOTE + Tomek links [12]_
@@ -210,4 +211,6 @@ References:
210211
211212
.. [17] : Felix Last, Georgios Douzas, Fernando Bacao, "Oversampling for Imbalanced Learning Based on K-Means and SMOTE"
212213
213-
.. [18] : Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. "RUSBoost: A hybrid approach to alleviating class imbalance." IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40.1 (2010): 185-197.
214+
.. [18] : Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. "RUSBoost: A hybrid approach to alleviating class imbalance." IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40.1 (2010): 185-197.
215+
216+
.. [19] : Menardi, G., Torelli, N.: "Training and assessing classification rules with unbalanced data", Data Mining and Knowledge Discovery, 28, (2014): 92–122

doc/api.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ Prototype selection
7676
over_sampling.SMOTE
7777
over_sampling.SMOTENC
7878
over_sampling.SVMSMOTE
79+
over_sampling.ROSE
7980

8081

8182
.. _combine_ref:

doc/bibtex/refs.bib

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,3 +193,18 @@ @article{smith2014instance
193193
year={2014},
194194
publisher={Springer}
195195
}
196+
197+
@article{torelli2014rose,
198+
author = {Menardi, Giovanna and Torelli, Nicola},
199+
title={Training and assessing classification rules with imbalanced data},
200+
author={Menardi G and Torelli N},
201+
journal={Data Mining and Knowledge Discovery},
202+
volume={28},
203+
pages={92-122},
204+
year={2014},
205+
publisher={Springer},
206+
issue = {1},
207+
issn = {1573-756X},
208+
url = {https://doi.org/10.1007/s10618-012-0295-5},
209+
doi = {10.1007/s10618-012-0295-5}
210+
}

doc/over_sampling.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,23 @@ Therefore, it can be seen that the samples generated in the first and last
198198
columns are belonging to the same categories originally presented without any
199199
other extra interpolation.
200200

201+
.. _rose:
202+
203+
ROSE (Random Over-Sampling Examples)
204+
------------------------------------
205+
206+
ROSE uses smoothed bootstrapping to draw artificial samples from the
207+
feature space neighborhood around selected classes, using a multivariate
208+
Gaussian kernel around randomly selected samples. First, random samples are
209+
selected from original classes. Then the smoothing kernel distribution
210+
is computed around the samples: :math:`\hat f(x|y=Y_i) = \sum_i^{n_j}
211+
p_i Pr(x|x_i)=\sum_i^{n_j} \frac{1}{n_j} Pr(x|x_i)=\sum_i^{n_j}
212+
\frac{1}{n_j} K_{H_j}(x|x_i)`.
213+
214+
Then new samples are drawn from the computed distribution.
215+
216+
217+
201218
Mathematical formulation
202219
========================
203220

doc/whats_new/v0.7.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,9 @@ Enhancements
6363
- Lazy import `keras` module when importing `imblearn.keras`
6464
:pr:`719` by :user:`Guillaume Lemaitre <glemaitre>`.
6565

66+
- Added Random Over-Sampling Examples (ROSE) class.
67+
:pr:`754` by :user:`Andrea Lorenzon <andrealorenzon>`.
68+
6669
Deprecation
6770
...........
6871

imblearn/over_sampling/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
from ._smote import KMeansSMOTE
1111
from ._smote import SVMSMOTE
1212
from ._smote import SMOTENC
13+
from ._rose import ROSE
1314

1415
__all__ = [
1516
"ADASYN",
@@ -19,4 +20,5 @@
1920
"BorderlineSMOTE",
2021
"SVMSMOTE",
2122
"SMOTENC",
23+
"ROSE"
2224
]

imblearn/over_sampling/_rose.py

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
"""Class to perform over-sampling using ROSE."""
2+
3+
import numpy as np
4+
from scipy import sparse
5+
from sklearn.utils import check_random_state
6+
from .base import BaseOverSampler
7+
from ..utils._validation import _deprecate_positional_args
8+
9+
10+
class ROSE(BaseOverSampler):
11+
"""Random Over-Sampling Examples (ROSE).
12+
13+
This object is the implementation of ROSE algorithm.
14+
It generates new samples by a smoothed bootstrap approach,
15+
taking a random subsample of original data and adding a
16+
multivariate kernel density estimate :math:`f(x|Y_i)` around
17+
them with a smoothing matrix :math:`H_j`, and finally sampling
18+
from this distribution. A shrinking matrix can be provided, to
19+
set the bandwidth of the gaussian kernel.
20+
21+
Read more in the :ref:`User Guide <rose>`.
22+
23+
Parameters
24+
----------
25+
sampling_strategy : float, str, dict or callable, default='auto'
26+
Sampling information to resample the data set.
27+
28+
- When ``float``, it corresponds to the desired ratio of the number of
29+
samples in the minority class over the number of samples in the
30+
majority class after resampling. Therefore, the ratio is expressed as
31+
:math:`\\alpha_{os} = N_{rm} / N_{M}` where :math:`N_{rm}` is the
32+
number of samples in the minority class after resampling and
33+
:math:`N_{M}` is the number of samples in the majority class.
34+
35+
.. warning::
36+
``float`` is only available for **binary** classification. An
37+
error is raised for multi-class classification.
38+
39+
- When ``str``, specify the class targeted by the resampling. The
40+
number of samples in the different classes will be equalized.
41+
Possible choices are:
42+
43+
``'minority'``: resample only the minority class;
44+
45+
``'not minority'``: resample all classes but the minority class;
46+
47+
``'not majority'``: resample all classes but the majority class;
48+
49+
``'all'``: resample all classes;
50+
51+
``'auto'``: equivalent to ``'not majority'``.
52+
53+
- When ``dict``, the keys correspond to the targeted classes. The
54+
values correspond to the desired number of samples for each targeted
55+
class.
56+
57+
- When callable, function taking ``y`` and returns a ``dict``. The keys
58+
correspond to the targeted classes. The values correspond to the
59+
desired number of samples for each class.
60+
61+
shrink_factors : dict, default= 1 for every class
62+
Dict of {classes: shrinkfactors} items, applied to
63+
the gaussian kernels. It can be used to compress/dilate the kernel.
64+
65+
random_state : int, RandomState instance, default=None
66+
Control the randomization of the algorithm.
67+
68+
- If int, ``random_state`` is the seed used by the random number
69+
generator;
70+
- If ``RandomState`` instance, random_state is the random number
71+
generator;
72+
- If ``None``, the random number generator is the ``RandomState``
73+
instance used by ``np.random``.
74+
75+
n_jobs : int, default=None
76+
Number of CPU cores used during the cross-validation loop.
77+
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
78+
``-1`` means using all processors. See
79+
`Glossary <https://scikit-learn.org/stable/glossary.html#term-n-jobs>`_
80+
for more details.
81+
82+
See Also
83+
--------
84+
SMOTE : Over-sample using SMOTE.
85+
86+
Notes
87+
-----
88+
89+
References
90+
----------
91+
.. [1] N. Lunardon, G. Menardi, N.Torelli, "ROSE: A Package for Binary
92+
Imbalanced Learning," R Journal, 6(1), 2014.
93+
94+
.. [2] G Menardi, N. Torelli, "Training and assessing classification
95+
rules with imbalanced data," Data Mining and Knowledge
96+
Discovery, 28(1), pp.92-122, 2014.
97+
98+
Examples
99+
--------
100+
101+
>>> from imblearn.over_sampling import ROSE
102+
>>> from sklearn.datasets import make_classification
103+
>>> from collections import Counter
104+
>>> r = ROSE(shrink_factors={0:1, 1:0.5, 2:0.7})
105+
>>> X, y = make_classification(n_classes=3, class_sep=2,
106+
... weights=[0.1, 0.7, 0.2], n_informative=3, n_redundant=1, flip_y=0,
107+
... n_features=20, n_clusters_per_class=1, n_samples=2000, random_state=10)
108+
>>> print('Original dataset shape %s' % Counter(y))
109+
Original dataset shape Counter({1: 1400, 2: 400, 0: 200})
110+
>>> X_res, y_res = r.fit_resample(X, y)
111+
>>> print('Resampled dataset shape %s' % Counter(y_res))
112+
Resampled dataset shape Counter({2: 1400, 1: 1400, 0: 1400})
113+
"""
114+
115+
@_deprecate_positional_args
116+
def __init__(self, *, sampling_strategy="auto", shrink_factors=None,
117+
random_state=None, n_jobs=None):
118+
super().__init__(sampling_strategy=sampling_strategy)
119+
self.random_state = random_state
120+
self.shrink_factors = shrink_factors
121+
self.n_jobs = n_jobs
122+
123+
def _make_samples(self,
124+
X,
125+
class_indices,
126+
n_class_samples,
127+
h_shrink):
128+
""" A support function that returns artificial samples constructed
129+
from a random subsample of the data, by adding a multiviariate
130+
gaussian kernel and sampling from this distribution. An optional
131+
shrink factor can be included, to compress/dilate the kernel.
132+
133+
Parameters
134+
----------
135+
X : {array-like, sparse matrix}, shape (n_samples, n_features)
136+
Observations from which the samples will be created.
137+
138+
class_indices : ndarray, shape (n_class_samples,)
139+
The target class indices
140+
141+
n_class_samples : int
142+
The total number of samples per class to generate
143+
144+
h_shrink : int
145+
the shrink factor
146+
147+
Returns
148+
-------
149+
X_new : {ndarray, sparse matrix}, shape (n_samples, n_features)
150+
Synthetically generated samples.
151+
152+
y_new : ndarray, shape (n_samples,)
153+
Target values for synthetic samples.
154+
155+
"""
156+
157+
number_of_features = X.shape[1]
158+
random_state = check_random_state(self.random_state)
159+
samples_indices = random_state.choice(
160+
class_indices, size=n_class_samples, replace=True)
161+
minimize_amise = (4 / ((number_of_features + 2) * len(
162+
class_indices))) ** (1 / (number_of_features + 4))
163+
if sparse.issparse(X):
164+
variances = np.diagflat(
165+
np.std(X[class_indices, :].toarray(), axis=0, ddof=1))
166+
else:
167+
variances = np.diagflat(
168+
np.std(X[class_indices, :], axis=0, ddof=1))
169+
h_opt = h_shrink * minimize_amise * variances
170+
randoms = random_state.standard_normal(size=(n_class_samples,
171+
number_of_features))
172+
Xrose = np.matmul(randoms, h_opt) + X[samples_indices, :]
173+
if sparse.issparse(X):
174+
return sparse.csr_matrix(Xrose)
175+
return Xrose
176+
177+
def _fit_resample(self, X, y):
178+
179+
X_resampled = X.copy()
180+
y_resampled = y.copy()
181+
182+
if self.shrink_factors is None:
183+
self.shrink_factors = {
184+
key: 1 for key in self.sampling_strategy_.keys()}
185+
186+
for class_sample, n_samples in self.sampling_strategy_.items():
187+
class_indices = np.flatnonzero(y == class_sample)
188+
n_class_samples = n_samples
189+
X_new = self._make_samples(X,
190+
class_indices,
191+
n_samples,
192+
self.shrink_factors[class_sample])
193+
y_new = np.array([class_sample] * n_class_samples)
194+
195+
if sparse.issparse(X_new):
196+
X_resampled = sparse.vstack([X_resampled, X_new])
197+
else:
198+
X_resampled = np.concatenate((X_resampled, X_new))
199+
200+
y_resampled = np.hstack((y_resampled, y_new))
201+
202+
return X_resampled.astype(X.dtype), y_resampled.astype(y.dtype)

0 commit comments

Comments
 (0)