Skip to content

Commit eede039

Browse files
author
Guillaume Lemaitre
committed
Update the version and the README file
1 parent 8303f1a commit eede039

File tree

2 files changed

+43
-25
lines changed

2 files changed

+43
-25
lines changed

README.md

Lines changed: 42 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,9 @@ Installation
1818

1919
### Dependencies
2020

21-
* scipy(>=0.17.1)
21+
UnbalancedDataset is tested to work under Python 2.7 and Python 3.5.
22+
23+
* scipy(>=0.17.0)
2224
* numpy(>=1.10.4)
2325
* scikit-learn(>=0.17.1)
2426

@@ -36,6 +38,14 @@ copy from Github and install all dependencies:
3638
cd UnbalancedDataset
3739
python setup.py install
3840

41+
### Testing
42+
43+
After installation, you can use `nose` to run the test suite:
44+
45+
```
46+
make coverage
47+
```
48+
3949
About
4050
=====
4151

@@ -46,45 +56,53 @@ One way of addresing this issue is by re-sampling the dataset as to offset this
4656
Re-sampling techniques are divided in two categories:
4757
1. Under-sampling the majority class(es).
4858
2. Over-sampling the minority class.
59+
3. Combining over- and under-sampling.
60+
4. Create ensemble balanced sets.
4961

5062
Bellow is a list of the methods currently implemented in this module.
5163

5264
* Under-sampling
5365
1. Random majority under-sampling with replacement
54-
2. Extraction of majority-minority Tomek links
66+
2. [Extraction of majority-minority Tomek links][1]
5567
3. Under-sampling with Cluster Centroids
56-
4. NearMiss-(1 & 2 & 3)
57-
5. Condensend Nearest Neighbour
58-
6. One-Sided Selection
59-
7. Neighboorhood Cleaning Rule
68+
4. [NearMiss-(1 & 2 & 3)][2]
69+
5. [Condensend Nearest Neighbour][3]
70+
6. [One-Sided Selection][4]
71+
7. [Neighboorhood Cleaning Rule][5]
72+
8. [Edited Nearest Neighbours][6]
73+
9. [Instance Hardness Threshold][7]
6074

6175
* Over-sampling
6276
1. Random minority over-sampling with replacement
63-
2. SMOTE - Synthetic Minority Over-sampling Technique
64-
3. bSMOTE(1&2) - Borderline SMOTE of types 1 and 2
65-
4. SVM_SMOTE - Support Vectors SMOTE
77+
2. [SMOTE - Synthetic Minority Over-sampling Technique][8]
78+
3. [bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2][9]
79+
4. [SVM SMOTE - Support Vectors SMOTE][10]
6680

6781
* Over-sampling followed by under-sampling
68-
1. SMOTE + Tomek links
69-
2. SMOTE + ENN
82+
1. [SMOTE + Tomek links][12]
83+
2. [SMOTE + ENN][11]
7084

7185
* Ensemble sampling
72-
1. EasyEnsemble
73-
2. BalanceCascade
86+
1. [EasyEnsemble][13]
87+
2. [BalanceCascade][13]
7488

7589
The different algorithms are presented in the [following notebook](https://github.com/fmfn/UnbalancedDataset/blob/master/examples/plot_unbalanced_dataset.ipynb).
7690

7791
This is a work in progress. Any comments, suggestions or corrections are welcome.
7892

7993
References:
80-
81-
1. NearMiss - ["kNN approach to unbalanced data distributions: A case study involving information extraction"](http://web0.site.uottawa.ca:4321/~nat/Workshop2003/jzhang.pdf), by Zhang et al., 2003.
82-
1. CNN - ["Addressing the Curse of Imbalanced Training Sets: One-Sided Selection"](http://sci2s.ugr.es/keel/pdf/algorithm/congreso/kubat97addressing.pdf), by Kubat et al., 1997.
83-
1. One-Sided Selection - ["Addressing the Curse of Imbalanced Training Sets: One-Sided Selection"](http://sci2s.ugr.es/keel/pdf/algorithm/congreso/kubat97addressing.pdf), by Kubat et al., 1997.
84-
1. NCL - ["Improving identification of difficult small classes by balancing class distribution"](http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2001-Laurikkala-LNCS.pdf), by Laurikkala et al., 2001.
85-
1. SMOTE - ["SMOTE: synthetic minority over-sampling technique"](https://www.jair.org/media/953/live-953-2037-jair.pdf), by Chawla et al., 2002.
86-
1. Borderline SMOTE - ["Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning"](http://sci2s.ugr.es/keel/keel-dataset/pdfs/2005-Han-LNCS.pdf), by Han et al., 2005
87-
1. SVM_SMOTE - ["Borderline Over-sampling for Imbalanced Data Classification"](https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CDAQFjABahUKEwjH7qqamr_HAhWLthoKHUr0BIo&url=http%3A%2F%2Fousar.lib.okayama-u.ac.jp%2Ffile%2F19617%2FIWCIA2009_A1005.pdf&ei=a7zZVYeNDIvtasrok9AI&usg=AFQjCNHoQ6oC_dH1M1IncBP0ZAaKj8a8Cw&sig2=lh32CHGjs5WBqxa_l0ylbg), Nguyen et al., 2011.
88-
1. SMOTE + Tomek - ["Balancing training data for automated annotation of keywords: a case study"](http://www.icmc.usp.br/~gbatista/files/wob2003.pdf), Batista et al., 2003.
89-
1. SMOTE + ENN - ["A study of the behavior of several methods for balancing machine learning training data"](http://www.sigkdd.org/sites/default/files/issues/6-1-2004-06/batista.pdf), Batista et al., 2004.
90-
1. EasyEnsemble & BalanceCascade - ["Exploratory Understanding for Class-Imbalance Learning"](http://cse.seu.edu.cn/people/xyliu/publication/tsmcb09.pdf), by Liu et al., 2009.
94+
===========
95+
96+
[1]: I. Tomek, [“Two modifications of CNN,”](http://sci2s.ugr.es/keel/pdf/algorithm/articulo/1976-Tomek-IEEETSMC(2).pdf) In Systems, Man, and Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 2010.
97+
[2]: I. Mani, I. Zhang. [“kNN approach to unbalanced data distributions: a case study involving information extraction,”](http://web0.site.uottawa.ca:4321/~nat/Workshop2003/jzhang.pdf) In Proceedings of workshop on learning from imbalanced datasets, 2003.
98+
[3]: P. Hart, [“The condensed nearest neighbor rule,”](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1054155&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1054155) In Information Theory, IEEE Transactions on, vol. 14(3), pp. 515-516, 1968.
99+
[4]: M. Kubat, S. Matwin, [“Addressing the curse of imbalanced training sets: one-sided selection,”](http://sci2s.ugr.es/keel/pdf/algorithm/congreso/kubat97addressing.pdf) In ICML, vol. 97, pp. 179-186, 1997.
100+
[5]: J. Laurikkala, [“Improving identification of difficult small classes by balancing class distribution,”](http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2001-Laurikkala-LNCS.pdf) Springer Berlin Heidelberg, 2001.
101+
[6]: D. Wilson, [“Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,”](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4309137&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4309137) In IEEE Transactions on Systems, Man, and Cybernetrics, vol. 2 (3), pp. 408-421, 1972.
102+
[7]: D. Smith, Michael R., Tony Martinez, and Christophe Giraud-Carrier. [“An instance level analysis of data complexity.”](http://axon.cs.byu.edu/papers/smith.ml2013.pdf) Machine learning 95.2 (2014): 225-256.
103+
[8]: N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, [“SMOTE: synthetic minority over-sampling technique,”](https://www.jair.org/media/953/live-953-2037-jair.pdf) Journal of artificial intelligence research, 321-357, 2002.
104+
[9]: H. Han, W. Wen-Yuan, M. Bing-Huan, [“Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,”](http://sci2s.ugr.es/keel/keel-dataset/pdfs/2005-Han-LNCS.pdf) Advances in intelligent computing, 878-887, 2005.
105+
[10]: H. M. Nguyen, E. W. Cooper, K. Kamei, [“Borderline over-sampling for imbalanced data classification,”](https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CDAQFjABahUKEwjH7qqamr_HAhWLthoKHUr0BIo&url=http%3A%2F%2Fousar.lib.okayama-u.ac.jp%2Ffile%2F19617%2FIWCIA2009_A1005.pdf&ei=a7zZVYeNDIvtasrok9AI&usg=AFQjCNHoQ6oC_dH1M1IncBP0ZAaKj8a8Cw&sig2=lh32CHGjs5WBqxa_l0ylbg) International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1), pp.4-21, 2001.
106+
[11]: G. Batista, R. C. Prati, M. C. Monard. [“A study of the behavior of several methods for balancing machine learning training data,”](http://www.sigkdd.org/sites/default/files/issues/6-1-2004-06/batista.pdf) ACM Sigkdd Explorations Newsletter 6 (1), 20-29, 2004.
107+
[12]: G. Batista, B. Bazzan, M. Monard, [“Balancing Training Data for Automated Annotation of Keywords: a Case Study,”)[(http://www.icmc.usp.br/~gbatista/files/wob2003.pdf)] In WOB, 10-18, 2003.
108+
[13]: X. Y. Liu, J. Wu and Z. H. Zhou, [“Exploratory Undersampling for Class-Imbalance Learning,”](http://cse.seu.edu.cn/people/xyliu/publication/tsmcb09.pdf) in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539-550, April 2009.

unbalanced_dataset/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
'required_at_installation': True,
3636
'install_info': _UNBALANCED_DATASET_INSTALL_MSG}),
3737
('scipy', {
38-
'min_version': '0.17.1',
38+
'min_version': '0.17.0',
3939
'required_at_installation': True,
4040
'install_info': _UNBALANCED_DATASET_INSTALL_MSG}),
4141
('sklearn', {

0 commit comments

Comments
 (0)