Skip to content

Add a "mixed data" transformer to RDT, for use in CTGAN. #955

@wilcovanvorstenbosch

Description

@wilcovanvorstenbosch

I've come across the paper for CTAB-GAN, which implements a transformer I would very much like to see in RDT, to use with SDV's synthesizers.

They created a mixed-type encoder to deal with continuous variables that have some categorical property.
I've been running into this issue myself. When dealing with loan data, the amount of outstanding debt is treated as a continuous variable by CTGAN, but this approach misses some of the nuances.

In many cases, the outstanding debt is 0. Exactly 0. I've found that CTGAN has a hard time grasping this idea, using the FloatFormatter. The synthesized data will have lots of values close, but not exactly zero. Post-processing would be an option, but I feel like this does not solve the underlying problem. Plausibly, the occurence of such mixed variables makes it very easy for the discriminator, and difficult for the generator. For "exactly 0" on some columns might arise as an easy-to-spot characteristic of the real data.

I'm very interested in what your opinion is on this.
Particularly, do you think this would have an impact on the CTGAN loss function?
Is there, currently, an easy way to mimic the idea of CTAB-GAN, using just the SDV package?
Would the implementation of this mixed-encoder be a valuable addition to the SDV ecosystem?

Kind regards,
Wilco

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions