Skip to content

[ENH] Preprocess: Add filtering by missing values#4266

Merged
lanzagar merged 4 commits intobiolab:masterfrom
AndrejaKovacic:filter_nans
Jan 10, 2020
Merged

[ENH] Preprocess: Add filtering by missing values#4266
lanzagar merged 4 commits intobiolab:masterfrom
AndrejaKovacic:filter_nans

Conversation

@AndrejaKovacic
Copy link
Copy Markdown
Contributor

@AndrejaKovacic AndrejaKovacic commented Dec 13, 2019

Description of changes

I extended Filter sparse features preprocessor with filtering columns by Nan's. We spoke about having 3 options, filter by 0, Nan's or both. I chose not to implement the third option, since the order of operation matters here and the user just use this preprocessor two times and have complete control that way.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 13, 2019

Codecov Report

Merging #4266 into master will increase coverage by 0.76%.
The diff coverage is 98.91%.

@@            Coverage Diff             @@
##           master    #4266      +/-   ##
==========================================
+ Coverage   86.05%   86.82%   +0.76%     
==========================================
  Files         394      396       +2     
  Lines       70228    71622    +1394     
==========================================
+ Hits        60435    62185    +1750     
+ Misses       9793     9437     -356


class RemoveSparseEditor(BaseEditor):

options = ["Nan's", "0's"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaNs and 0s, no apostrophe. I prefer missing instead of NaN, but that's ok, too. 0 should probably be written with a word, 'zeros'.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use "missing". Muggles won't understand NaN.

@ajdapretnar
Copy link
Copy Markdown
Contributor

Other preprocessors have a vertical layout, so I suggest placing the two filtering options vertically instead of horizontally.
Also, 'Select random features' and 'Select relevant features' have the threshold options as 'Fixed' and 'Percentage'. I think it is nice we have the same layout across all preprocessors. That said, there's an option in Text add-on where words are filtered by their absolute frequency if the input is an integer and by their relative value if the input is a float (e.g. 0.1 == 10%). Perhaps something we could have in preprocessors, too. Not sure about the user perspective here, though.

@janezd janezd force-pushed the filter_nans branch 2 times, most recently from 1b930a3 to e565f56 Compare December 20, 2019 14:18
Minimal proportion of non-zero entries of a feature
threshold: int or float
if >= 1, the argument represents the allowed number of 0s or NaNs;
if below 0, it represents the allowed proportion of 0s or NaNs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below 0 -> below 1

if >= 1, the argument represents the allowed number of 0s or NaNs;
if below 0, it represents the allowed proportion of 0s or NaNs
filter0: bool
if True (default), preprocessor counts 0s, otherwise NoNs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoNs -> NaNs

"""
Remove sparse features. Sparseness is determined according to
user-defined treshold.
Filter out the features with too many nan's or 0. Threshold is user defined.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter out features with too many (>threshold) zeros or missing values.

"""

def __init__(self, threshold=0.05):
def __init__(self, threshold=5, filter0=True):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to leave the default value at threshold=0.05 so it remains backwards compatible.
(this does not impact the widget as it always calls it with a defined threshold anyway)

@lanzagar lanzagar changed the title [ENH] Add filtering by nans [ENH] Preprocess: Add filtering by missing values Jan 10, 2020
@lanzagar lanzagar merged commit 27aabbf into biolab:master Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants