-
Notifications
You must be signed in to change notification settings - Fork 202
Description
Problem Description
In the Cleaner and TableVectorizer we can set a drop_null_fraction, columns that have more nulls than this are dropped. We could consider, rather than dropping, replacing with a boolean column that indicates the positions of nulls as done by scikit-learn's MissingIndicator . it still acheives the goal of not wasting many feature dimensions on such columns but retains the information that the field was present or not, which is often informative. for example if there is a column like "blood pressure medication" we may not want to spend lots of dimension to one-hot or string-encode it but retain the information that the patient is taking such a medication or not
Feature Description
instead of dropping columns with too many nulls, they would be replaced by their missingness indicator. (columns that are entirely null should still be dropped, which will happen by default because we drop all constant columns and the missingness mask would be constant)
Alternative Solutions
keep the current behavior (dropping), and users can use the missing indicator by themselves where relevant
Additional Context
No response