[ENH] Preprocess: Add filtering by missing values#4266
[ENH] Preprocess: Add filtering by missing values#4266lanzagar merged 4 commits intobiolab:masterfrom
Conversation
d70df3e to
3a2b007
Compare
Codecov Report
@@ Coverage Diff @@
## master #4266 +/- ##
==========================================
+ Coverage 86.05% 86.82% +0.76%
==========================================
Files 394 396 +2
Lines 70228 71622 +1394
==========================================
+ Hits 60435 62185 +1750
+ Misses 9793 9437 -356 |
f72f29c to
ce493b0
Compare
Orange/widgets/data/owpreprocess.py
Outdated
|
|
||
| class RemoveSparseEditor(BaseEditor): | ||
|
|
||
| options = ["Nan's", "0's"] |
There was a problem hiding this comment.
NaNs and 0s, no apostrophe. I prefer missing instead of NaN, but that's ok, too. 0 should probably be written with a word, 'zeros'.
There was a problem hiding this comment.
I think we should use "missing". Muggles won't understand NaN.
|
Other preprocessors have a vertical layout, so I suggest placing the two filtering options vertically instead of horizontally. |
1b930a3 to
e565f56
Compare
Orange/preprocess/preprocess.py
Outdated
| Minimal proportion of non-zero entries of a feature | ||
| threshold: int or float | ||
| if >= 1, the argument represents the allowed number of 0s or NaNs; | ||
| if below 0, it represents the allowed proportion of 0s or NaNs |
Orange/preprocess/preprocess.py
Outdated
| if >= 1, the argument represents the allowed number of 0s or NaNs; | ||
| if below 0, it represents the allowed proportion of 0s or NaNs | ||
| filter0: bool | ||
| if True (default), preprocessor counts 0s, otherwise NoNs |
Orange/preprocess/preprocess.py
Outdated
| """ | ||
| Remove sparse features. Sparseness is determined according to | ||
| user-defined treshold. | ||
| Filter out the features with too many nan's or 0. Threshold is user defined. |
There was a problem hiding this comment.
Filter out features with too many (>threshold) zeros or missing values.
Orange/preprocess/preprocess.py
Outdated
| """ | ||
|
|
||
| def __init__(self, threshold=0.05): | ||
| def __init__(self, threshold=5, filter0=True): |
There was a problem hiding this comment.
I suggest to leave the default value at threshold=0.05 so it remains backwards compatible.
(this does not impact the widget as it always calls it with a defined threshold anyway)
e565f56 to
5cc1a3b
Compare
5cc1a3b to
bb801df
Compare
Description of changes
I extended Filter sparse features preprocessor with filtering columns by Nan's. We spoke about having 3 options, filter by 0, Nan's or both. I chose not to implement the third option, since the order of operation matters here and the user just use this preprocessor two times and have complete control that way.
Includes