Skip to content

Add missing indicator instead of dropping columns above null fraction threshold #1714

@jeromedockes

Description

@jeromedockes

Problem Description

In the Cleaner and TableVectorizer we can set a drop_null_fraction, columns that have more nulls than this are dropped. We could consider, rather than dropping, replacing with a boolean column that indicates the positions of nulls as done by scikit-learn's MissingIndicator . it still acheives the goal of not wasting many feature dimensions on such columns but retains the information that the field was present or not, which is often informative. for example if there is a column like "blood pressure medication" we may not want to spend lots of dimension to one-hot or string-encode it but retain the information that the patient is taking such a medication or not

Feature Description

instead of dropping columns with too many nulls, they would be replaced by their missingness indicator. (columns that are entirely null should still be dropped, which will happen by default because we drop all constant columns and the missingness mask would be constant)

Alternative Solutions

keep the current behavior (dropping), and users can use the missing indicator by themselves where relevant

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionSomething somewhat open-ended to discussenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions