-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Background
Many of our transformers have a parameter called missing_value_replacement that controls how missing values are replaced during the forward transform phase. Options for this parameter include:
'mean'-- aka replace all missing values with the average value, as learned during fit'random'-- aka replace missing values with a random value chosen between the [min, max] range, as learned during fitNone-- aka do not replace missing values at all
When we first created the RDT library, the default was 'mean', as this is a common practice in many data science algorithms.
- In Replace missing values with variable (random) values from the dataset #606, we added the
'random'strategy - In Make the default missing value imputation
'mean'#730, we decided to keep the default as'mean'until we had more evidence that'random'strategy is the better default.
I'm filing this issue as a follow up for #730, to investigate whether 'mean' or 'random' is the better default strategy.
Note that regardless of what the default is, you can always apply any strategy you want by updating the parameter.
Description
We should compare the quality of the synthetic data (particularly the missing values, but also the overall column shapes/correlations) for the two cases:
- (control) All transformers use
'mean'as the missing value replacement strategy - (experiment) All transformers use
'random'as the missing value replacement strategy
We should also see how the quality is correlated with the overall % of missing values present in the real data.
Hypothesis: If there are are high % of values missing, then the 'mean' strategy will NOT work as well because we are replacing all the missing values with a singular, static value. This leads to a very peaky distribution that may not be easy to model and create synthetic data for. Otherwise, the 'mean' strategy will probably be fine.
Note that even the 'random' strategy does affect the distribution because it essentially makes the data more uniform (by randomly choosing between the [min, max] values). However, the output distribution is probably easier to model and create synthetic data for.