Investigate whether we should update the default `missing_value_replacement` strategy to be `'random'` (instead of `'mean'`)

### Background

Many of our transformers have a parameter called `missing_value_replacement` that controls how missing values are replaced during the **forward transform** phase. Options for this parameter include:

- `'mean'` -- aka replace all missing values with the average value, as learned during fit
- `'random'` -- aka replace missing values with a random value chosen between the [min, max] range, as learned during fit
- `None` -- aka do not replace missing values at all

When we first created the RDT library, the default was `'mean'`, as this is a common practice in many data science algorithms. 

- In #606, we added the `'random'` strategy
- In #730, we decided to keep the default as `'mean'` until we had more evidence that `'random'` strategy is the better default.

I'm filing this issue as a follow up for #730, to investigate whether `'mean'` or `'random'` is the better default strategy. 

Note that **regardless of what the default is, you can always apply any strategy you want by updating the parameter.**

### Description
We should compare the quality of the synthetic data (particularly the missing values, but also the overall column shapes/correlations) for the two cases:
- (control) All transformers use `'mean'` as the missing value replacement strategy
- (experiment) All transformers use `'random'` as the missing value replacement strategy

We should also see how the quality is correlated with the overall % of missing values present in the real data.

Hypothesis: If there are are high % of values missing, then the `'mean'` strategy will NOT work as well because we are replacing all the missing values with a singular, static value. This leads to a very peaky distribution that may not be easy to model and create synthetic data for. Otherwise, the `'mean'` strategy will probably be fine.

Note that even the `'random'` strategy does affect the distribution because it essentially makes the data more uniform (by randomly choosing between the [min, max] values). However, the output distribution is probably easier to model and create synthetic data for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate whether we should update the default `missing_value_replacement` strategy to be `'random'` (instead of `'mean'`) #943

Background

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate whether we should update the default missing_value_replacement strategy to be 'random' (instead of 'mean') #943

Description

Background

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Investigate whether we should update the default `missing_value_replacement` strategy to be `'random'` (instead of `'mean'`) #943