@@ -8,10 +8,7 @@ This project hosts Python implementations of the [Boruta all-relevant feature se
8
8
9
9
* numpy
10
10
* scipy
11
- * bottleneck
12
11
* scikit-learn
13
- * statsmodels
14
-
15
12
16
13
## How to use ##
17
14
Download, import and do as you would with any other scikit-learn method:
@@ -46,7 +43,6 @@ by definition depends on your classifier choice).
46
43
47
44
### BorutaPy ###
48
45
49
-
50
46
It is the original R package recoded in Python with a few added extra features.
51
47
Some improvements include:
52
48
@@ -59,48 +55,48 @@ Some improvements include:
59
55
* Automatic n_estimator selection
60
56
61
57
* Ranking of features
62
-
58
+
63
59
For more details, please check the top of the docstring.
64
60
65
61
We highly recommend using pruned trees with a depth between 3-7.
66
62
67
- ### BorutaPyPlus ###
68
-
69
- After playing around a lot with the original code I identified a few areas
70
- where the core algorithm could be improved. I basically ran lots of
71
- benchmarking tests on simulated datasets (scikit-learn's amazing
72
- make_classification was used for generating these).
63
+ Also, after playing around a lot with the original code I identified a few areas
64
+ where the core algorithm could be improved/altered to make it less strict and
65
+ more applicable to biological data, where the Bonferroni correction might be
66
+ overly harsh.
73
67
74
68
__ Percentile as threshold__
75
69
The original method uses the maximum of the shadow features as a threshold in
76
70
deciding which real feature is doing better than the shadow ones. This could be
77
- overly harsh.
71
+ overly harsh.
78
72
79
- To control this, in the 2nd version the perc parameter sets the
80
- percentile of the shadow features' importances, the algorithm uses as the
81
- threshold. The default of 99 is close to taking the maximum, but as it's a
82
- percentile, it changes with the size of the dataset. With several thousands of
73
+ To control this, I added the perc parameter, which sets the
74
+ percentile of the shadow features' importances, the algorithm uses as the
75
+ threshold. The default of 100 which is equivalent to taking the maximum as the
76
+ R version of Boruta does, but it could be relaxed. Note, since this is the
77
+ percentile, it changes with the size of the dataset. With several thousands of
83
78
features it isn't as stringent as with a few dozens at the end of a Boruta run.
84
79
85
80
86
81
__ Two step correction for multiple testing__
87
- The correction for multiple testing was improved by making it a two step
82
+ The correction for multiple testing was relaxed by making it a two step
88
83
process, rather than a harsh one step Bonferroni correction.
89
84
90
- We need to correct firstly because in each iteration we test a number of
85
+ We need to correct firstly because in each iteration we test a number of
91
86
features against the null hypothesis (does a feature perform better than
92
- expected by random). For this the Bonferroni correction is used in the original
93
- code which is known to be too stringent in such scenarios, and also the
94
- original code corrects for n features, even if we are in the 50th iteration
95
- where we only have k<<n features left. For this reason the first step of
96
- correction is the widely used Benjamini Hochberg FDR.
97
-
87
+ expected by random). For this the Bonferroni correction is used in the original
88
+ code which is known to be too stringent in such scenarios (at least for
89
+ biological data), and also the original code corrects for n features, even if
90
+ we are in the 50th iteration where we only have k<<n features left. For this
91
+ reason the first step of correction is the widely used Benjamini Hochberg FDR.
92
+
98
93
Following that however we also need to account for the fact that we have been
99
94
testing the same features over and over again in each iteration with the
100
95
same test. For this scenario the Bonferroni is perfect, so it is applied by
101
96
deviding the p-value threshold with the current iteration index.
102
-
103
- We highly recommend using pruned trees with a depth between 3-7.
97
+
98
+ If this two step correction is not required, the two_step parameter has to be
99
+ set to False, then (with perc=100) BorutaPy behaves exactly as the R version.
104
100
105
101
* * *
106
102
@@ -119,32 +115,22 @@ __n_estimators__ : int or string, default = 1000
119
115
> dataset. The other parameters of the used estimators need to be set
120
116
> with initialisation.
121
117
122
- __ multi_corr_method__ : string, default = 'bonferroni' - only in BorutaPy
123
- > Method for correcting for multiple testing during the feature selection process. statsmodels' multiple test is used, so one of the following:
124
-
125
- > * 'bonferroni' : one-step correction
126
- > * 'sidak' : one-step correction
127
- > * 'holm-sidak' : step down method using Sidak adjustments
128
- > * 'holm' : step-down method using Bonferroni adjustments
129
- > * 'simes-hochberg' : step-up method (independent)
130
- > * 'hommel' : closed method based on Simes tests (non-negative)
131
- > * 'fdr_bh' : Benjamini/Hochberg (non-negative)
132
- > * 'fdr_by' : Benjamini/Yekutieli (negative)
133
- > * 'fdr_tsbh' : two stage fdr correction (non-negative)
134
- > * 'fdr_tsbky' : two stage fdr correction (non-negative)
135
-
136
- __ perc__ : int, default = 99 - only in BorutaPy2
118
+ __ perc__ : int, default = 100
137
119
> Instead of the max we use the percentile defined by the user, to pick
138
120
> our threshold for comparison between shadow and real features. The max
139
121
> tend to be too stringent. This provides a finer control over this. The
140
- > lower perc is the more false positives will be picked as relevant but
122
+ > lower perc is the more false positives will be picked as relevant but
141
123
> also the less relevant features will be left out. The usual trade-off.
124
+ > The default is essentially the vanilla Boruta corresponding to the max.
142
125
143
-
144
- __ multi_alpha__ : float, default = 0.05
126
+ __ alpha__ : float, default = 0.05
145
127
> Level at which the corrected p-values will get rejected in both correction
146
128
steps.
147
129
130
+ __ two_step__ : Boolean, default = True
131
+ > If you want to use the original implementation of Boruta with Bonferroni
132
+ > correction only set this to False.
133
+
148
134
__ max_iter__ : int, default = 100
149
135
> The number of maximum iterations to perform.
150
136
@@ -177,10 +163,10 @@ __verbose__ : int, default=0
177
163
178
164
import pandas as pd
179
165
from sklearn.ensemble import RandomForestClassifier
180
- from boruta_py import boruta_py
166
+ from boruta_py import BorutaPy
181
167
182
168
# load X and y
183
- # NOTE BorutaPy accepts numpy arrays only, hence .values
169
+ # NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
184
170
X = pd.read_csv('my_X_table.csv', index_col=0).values
185
171
y = pd.read_csv('my_y_vector.csv', index_col=0).values
186
172
@@ -189,7 +175,7 @@ __verbose__ : int, default=0
189
175
rf = RandomForestClassifier(n_jobs=-1, class_weight='auto', max_depth=5)
190
176
191
177
# define Boruta feature selection method
192
- feat_selector = boruta_py. BorutaPy(rf, n_estimators='auto', verbose=2)
178
+ feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2)
193
179
194
180
# find all relevant features
195
181
feat_selector.fit(X, y)
0 commit comments