Skip to content

ClusterBasedNormalizer should only select the minimum number of required components #700

@fealho

Description

@fealho

Problem Description

The ClusterBasedNormalizer usually uses the maximum number of clusters possible, when fewer clusters would be sufficient to properly represent the data. This affects the performance of CTGAN, so ideally it would select as few components as necessary.

Investigation

There are three values that can be tweaked to improve the component selection process:

  • weight_threshold: this attribute controls which components are selected in the line below. However, the threshold is usually to small to properly filter the components, so it should either be increased, removed, or detected automatically based on the data.
    self.valid_component_indicator = self._bgm_transformer.weights_ > self.weight_threshold
  • weight_concentration_prior: it's not obvious that this parameter helps achieve our goal at all. If that's the case, it should be removed.
  • max_clusters: the default value of 10 is quite frequently higher than what the dataset actually needs. If we cannot find a good value for weight_threshold perhaps we can detect the max_clusters automatically instead (in which case we can remove the entire logic for valid_component_indicator).

Additional Notes

Ensure CTGAN works well with these changes, as well as that it works for any type of dataset. If it is not possible to find a strict improvement over the current implementation, then perhaps it's best to leave the code as is.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions