[ENH] Preprocess: implement Select Relevant Feature's percentile#3588
[ENH] Preprocess: implement Select Relevant Feature's percentile#3588janezd merged 7 commits intobiolab:masterfrom
Conversation
Orange/preprocess/fss.py
Outdated
| if isinstance(self.k, float): | ||
| idx_attr = np.ceil(self.k * n_attrs).astype(int) | ||
| # edge case: 0th percentile would result in selection of `(n_attrs + 1)` attrs | ||
| self.k = min(n_attrs - idx_attr + 1, n_attrs) |
There was a problem hiding this comment.
__call__ shouldn't change the state of the object (unless when caching, for instance). In this case, imagine you have
fss = SelectBestFeatures(k=0.5)
data1 = <some data set with 10 attributes>
data2 = <some data set with 100 attributes>
fss(data1) # this sets fss.k to 5!
fss(data2) # and so this selects just 5 instead of 50 attributes
One option is to introduce
effective_k = min(n_attrs - idx_attr + 1, n_attrs)
else:
effective_k = self.k
and replace further occurrences of self.k in this method with effective_k.
There was a problem hiding this comment.
Good catch, I didn't think about that.
I've fixed it (done the same thing for selection of random features as well, since it had the same problem) and added a test for this.
…call__ does not change state of the object
Codecov Report
@@ Coverage Diff @@
## master #3588 +/- ##
=========================================
Coverage ? 83.98%
=========================================
Files ? 370
Lines ? 66981
Branches ? 0
=========================================
Hits ? 56253
Misses ? 10728
Partials ? 0 |
Codecov Report
@@ Coverage Diff @@
## master #3588 +/- ##
==========================================
+ Coverage 84.02% 84.06% +0.04%
==========================================
Files 370 370
Lines 67232 67252 +20
==========================================
+ Hits 56489 56538 +49
+ Misses 10743 10714 -29 |
|
I added another commit to switch from percentiles to proportion, so that |
fee790e to
326ea19
Compare
|
I changed percentiles to proportions, so that |
1e687c2 to
8407200
Compare
8407200 to
4b2599c
Compare
Issue
Fixes #3226.
Description of changes
Implements the "Percentile" option for Select Relevant Features preprocessor.
The option enables the selection of all features that fall >= the entered percentile. The rank of the first feature to be taken is computed using nearest-rank method. There are 2 "special" cases: 0th percentile is defined as all the features and 100th percentile is defined as the best feature.
I've also renamed "Strategy" to "Number of features" in Select Random Features and Select Relevant Features as was suggested in the original issue.
Includes