Skip to content

Investigate whether we should update the default missing_value_replacement strategy to be 'random' (instead of 'mean') #943

@npatki

Description

@npatki

Background

Many of our transformers have a parameter called missing_value_replacement that controls how missing values are replaced during the forward transform phase. Options for this parameter include:

  • 'mean' -- aka replace all missing values with the average value, as learned during fit
  • 'random' -- aka replace missing values with a random value chosen between the [min, max] range, as learned during fit
  • None -- aka do not replace missing values at all

When we first created the RDT library, the default was 'mean', as this is a common practice in many data science algorithms.

I'm filing this issue as a follow up for #730, to investigate whether 'mean' or 'random' is the better default strategy.

Note that regardless of what the default is, you can always apply any strategy you want by updating the parameter.

Description

We should compare the quality of the synthetic data (particularly the missing values, but also the overall column shapes/correlations) for the two cases:

  • (control) All transformers use 'mean' as the missing value replacement strategy
  • (experiment) All transformers use 'random' as the missing value replacement strategy

We should also see how the quality is correlated with the overall % of missing values present in the real data.

Hypothesis: If there are are high % of values missing, then the 'mean' strategy will NOT work as well because we are replacing all the missing values with a singular, static value. This leads to a very peaky distribution that may not be easy to model and create synthetic data for. Otherwise, the 'mean' strategy will probably be fine.

Note that even the 'random' strategy does affect the distribution because it essentially makes the data more uniform (by randomly choosing between the [min, max] values). However, the output distribution is probably easier to model and create synthetic data for.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionGeneral question about the software

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions