Skip to content

[ENH] Preprocess: implement Select Relevant Feature's percentile#3588

Merged
janezd merged 7 commits intobiolab:masterfrom
matejklemen:preprocess-percentile
Feb 16, 2019
Merged

[ENH] Preprocess: implement Select Relevant Feature's percentile#3588
janezd merged 7 commits intobiolab:masterfrom
matejklemen:preprocess-percentile

Conversation

@matejklemen
Copy link
Copy Markdown
Contributor

Issue

Fixes #3226.

Description of changes

Implements the "Percentile" option for Select Relevant Features preprocessor.
The option enables the selection of all features that fall >= the entered percentile. The rank of the first feature to be taken is computed using nearest-rank method. There are 2 "special" cases: 0th percentile is defined as all the features and 100th percentile is defined as the best feature.

I've also renamed "Strategy" to "Number of features" in Select Random Features and Select Relevant Features as was suggested in the original issue.

Includes
  • Code changes
  • Tests
  • Documentation

if isinstance(self.k, float):
idx_attr = np.ceil(self.k * n_attrs).astype(int)
# edge case: 0th percentile would result in selection of `(n_attrs + 1)` attrs
self.k = min(n_attrs - idx_attr + 1, n_attrs)
Copy link
Copy Markdown
Contributor

@janezd janezd Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__call__ shouldn't change the state of the object (unless when caching, for instance). In this case, imagine you have

fss = SelectBestFeatures(k=0.5)
data1 = <some data set with 10 attributes>
data2 = <some data set with 100 attributes>
fss(data1)  # this sets fss.k to 5!
fss(data2)  # and so this selects just 5 instead of 50 attributes

One option is to introduce

            effective_k = min(n_attrs - idx_attr + 1, n_attrs)
        else:
            effective_k = self.k

and replace further occurrences of self.k in this method with effective_k.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I didn't think about that.

I've fixed it (done the same thing for selection of random features as well, since it had the same problem) and added a test for this.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 12, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@2056481). Click here to learn what that means.
The diff coverage is 80%.

@@            Coverage Diff            @@
##             master    #3588   +/-   ##
=========================================
  Coverage          ?   83.98%           
=========================================
  Files             ?      370           
  Lines             ?    66981           
  Branches          ?        0           
=========================================
  Hits              ?    56253           
  Misses            ?    10728           
  Partials          ?        0

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 12, 2019

Codecov Report

Merging #3588 into master will increase coverage by 0.04%.
The diff coverage is 94.44%.

@@            Coverage Diff             @@
##           master    #3588      +/-   ##
==========================================
+ Coverage   84.02%   84.06%   +0.04%     
==========================================
  Files         370      370              
  Lines       67232    67252      +20     
==========================================
+ Hits        56489    56538      +49     
+ Misses      10743    10714      -29

@janezd janezd self-assigned this Feb 14, 2019
@janezd
Copy link
Copy Markdown
Contributor

janezd commented Feb 15, 2019

I added another commit to switch from percentiles to proportion, so that k (in the class) and spin boxes (in the widget) have parallel meanings, and that higher numbers always mean more attributes.

@janezd janezd force-pushed the preprocess-percentile branch from fee790e to 326ea19 Compare February 15, 2019 14:06
@janezd
Copy link
Copy Markdown
Contributor

janezd commented Feb 15, 2019

I changed percentiles to proportions, so that k (in the class) and combos (in widget) have parallel meaning (more is always more).

@janezd janezd force-pushed the preprocess-percentile branch 3 times, most recently from 1e687c2 to 8407200 Compare February 15, 2019 22:08
@janezd janezd force-pushed the preprocess-percentile branch from 8407200 to 4b2599c Compare February 15, 2019 22:50
@janezd janezd merged commit 9817fa8 into biolab:master Feb 16, 2019
@matejklemen matejklemen deleted the preprocess-percentile branch February 17, 2019 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants