Skip to content

Commit 80a74c1

Browse files
committed
- unified boruta_py and boruta_py_plus, now boruta_py has all previously added features of the latter (two step multiple correction, percentile)
- default perc=100 equals to choosing the maximum of the shadow importances as in original boruta.R - added two_step Boolean param, if set to False, then simple Bonferroni is performed as in boruta.R - replaced the nanrankdata method with _nanrankdata to decouple the bottleneck dependency - include multipletests from statsmodels as _fdrcorrection, to decouple this dependency - delete _check_pandas, adds unnecessary dependency as Mike said, plus check_X_y returns numpy array anyway - updated .init.py - updated class documentation and README.md to reflect these changes - changed multi_alpha param to alpha - changed iter to _iter because iter is used by Python internally
1 parent 4f17e2d commit 80a74c1

File tree

5 files changed

+216
-643
lines changed

5 files changed

+216
-643
lines changed

README.md

Lines changed: 33 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,7 @@ This project hosts Python implementations of the [Boruta all-relevant feature se
88

99
* numpy
1010
* scipy
11-
* bottleneck
1211
* scikit-learn
13-
* statsmodels
14-
1512

1613
## How to use ##
1714
Download, import and do as you would with any other scikit-learn method:
@@ -46,7 +43,6 @@ by definition depends on your classifier choice).
4643

4744
### BorutaPy ###
4845

49-
5046
It is the original R package recoded in Python with a few added extra features.
5147
Some improvements include:
5248

@@ -59,48 +55,48 @@ Some improvements include:
5955
* Automatic n_estimator selection
6056

6157
* Ranking of features
62-
58+
6359
For more details, please check the top of the docstring.
6460

6561
We highly recommend using pruned trees with a depth between 3-7.
6662

67-
### BorutaPyPlus ###
68-
69-
After playing around a lot with the original code I identified a few areas
70-
where the core algorithm could be improved. I basically ran lots of
71-
benchmarking tests on simulated datasets (scikit-learn's amazing
72-
make_classification was used for generating these).
63+
Also, after playing around a lot with the original code I identified a few areas
64+
where the core algorithm could be improved/altered to make it less strict and
65+
more applicable to biological data, where the Bonferroni correction might be
66+
overly harsh.
7367

7468
__Percentile as threshold__
7569
The original method uses the maximum of the shadow features as a threshold in
7670
deciding which real feature is doing better than the shadow ones. This could be
77-
overly harsh.
71+
overly harsh.
7872

79-
To control this, in the 2nd version the perc parameter sets the
80-
percentile of the shadow features' importances, the algorithm uses as the
81-
threshold. The default of 99 is close to taking the maximum, but as it's a
82-
percentile, it changes with the size of the dataset. With several thousands of
73+
To control this, I added the perc parameter, which sets the
74+
percentile of the shadow features' importances, the algorithm uses as the
75+
threshold. The default of 100 which is equivalent to taking the maximum as the
76+
R version of Boruta does, but it could be relaxed. Note, since this is the
77+
percentile, it changes with the size of the dataset. With several thousands of
8378
features it isn't as stringent as with a few dozens at the end of a Boruta run.
8479

8580

8681
__Two step correction for multiple testing__
87-
The correction for multiple testing was improved by making it a two step
82+
The correction for multiple testing was relaxed by making it a two step
8883
process, rather than a harsh one step Bonferroni correction.
8984

90-
We need to correct firstly because in each iteration we test a number of
85+
We need to correct firstly because in each iteration we test a number of
9186
features against the null hypothesis (does a feature perform better than
92-
expected by random). For this the Bonferroni correction is used in the original
93-
code which is known to be too stringent in such scenarios, and also the
94-
original code corrects for n features, even if we are in the 50th iteration
95-
where we only have k<<n features left. For this reason the first step of
96-
correction is the widely used Benjamini Hochberg FDR.
97-
87+
expected by random). For this the Bonferroni correction is used in the original
88+
code which is known to be too stringent in such scenarios (at least for
89+
biological data), and also the original code corrects for n features, even if
90+
we are in the 50th iteration where we only have k<<n features left. For this
91+
reason the first step of correction is the widely used Benjamini Hochberg FDR.
92+
9893
Following that however we also need to account for the fact that we have been
9994
testing the same features over and over again in each iteration with the
10095
same test. For this scenario the Bonferroni is perfect, so it is applied by
10196
deviding the p-value threshold with the current iteration index.
102-
103-
We highly recommend using pruned trees with a depth between 3-7.
97+
98+
If this two step correction is not required, the two_step parameter has to be
99+
set to False, then (with perc=100) BorutaPy behaves exactly as the R version.
104100

105101
* * *
106102

@@ -119,32 +115,22 @@ __n_estimators__ : int or string, default = 1000
119115
> dataset. The other parameters of the used estimators need to be set
120116
> with initialisation.
121117
122-
__multi_corr_method__ : string, default = 'bonferroni' - only in BorutaPy
123-
>Method for correcting for multiple testing during the feature selection process. statsmodels' multiple test is used, so one of the following:
124-
125-
>* 'bonferroni' : one-step correction
126-
>* 'sidak' : one-step correction
127-
>* 'holm-sidak' : step down method using Sidak adjustments
128-
>* 'holm' : step-down method using Bonferroni adjustments
129-
>* 'simes-hochberg' : step-up method (independent)
130-
>* 'hommel' : closed method based on Simes tests (non-negative)
131-
>* 'fdr_bh' : Benjamini/Hochberg (non-negative)
132-
>* 'fdr_by' : Benjamini/Yekutieli (negative)
133-
>* 'fdr_tsbh' : two stage fdr correction (non-negative)
134-
>* 'fdr_tsbky' : two stage fdr correction (non-negative)
135-
136-
__perc__ : int, default = 99 - only in BorutaPy2
118+
__perc__ : int, default = 100
137119
> Instead of the max we use the percentile defined by the user, to pick
138120
> our threshold for comparison between shadow and real features. The max
139121
> tend to be too stringent. This provides a finer control over this. The
140-
> lower perc is the more false positives will be picked as relevant but
122+
> lower perc is the more false positives will be picked as relevant but
141123
> also the less relevant features will be left out. The usual trade-off.
124+
> The default is essentially the vanilla Boruta corresponding to the max.
142125
143-
144-
__multi_alpha__ : float, default = 0.05
126+
__alpha__ : float, default = 0.05
145127
> Level at which the corrected p-values will get rejected in both correction
146128
steps.
147129

130+
__two_step__ : Boolean, default = True
131+
> If you want to use the original implementation of Boruta with Bonferroni
132+
> correction only set this to False.
133+
148134
__max_iter__ : int, default = 100
149135
> The number of maximum iterations to perform.
150136
@@ -177,10 +163,10 @@ __verbose__ : int, default=0
177163

178164
import pandas as pd
179165
from sklearn.ensemble import RandomForestClassifier
180-
from boruta_py import boruta_py
166+
from boruta_py import BorutaPy
181167

182168
# load X and y
183-
# NOTE BorutaPy accepts numpy arrays only, hence .values
169+
# NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
184170
X = pd.read_csv('my_X_table.csv', index_col=0).values
185171
y = pd.read_csv('my_y_vector.csv', index_col=0).values
186172

@@ -189,7 +175,7 @@ __verbose__ : int, default=0
189175
rf = RandomForestClassifier(n_jobs=-1, class_weight='auto', max_depth=5)
190176

191177
# define Boruta feature selection method
192-
feat_selector = boruta_py.BorutaPy(rf, n_estimators='auto', verbose=2)
178+
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2)
193179

194180
# find all relevant features
195181
feat_selector.fit(X, y)

boruta/__init__.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1 @@
11
from .boruta_py import BorutaPy
2-
from .boruta_py_plus import BorutaPyPlus

0 commit comments

Comments
 (0)