-
Notifications
You must be signed in to change notification settings - Fork 27
Description
I've come across the paper for CTAB-GAN, which implements a transformer I would very much like to see in RDT, to use with SDV's synthesizers.
They created a mixed-type encoder to deal with continuous variables that have some categorical property.
I've been running into this issue myself. When dealing with loan data, the amount of outstanding debt is treated as a continuous variable by CTGAN, but this approach misses some of the nuances.
In many cases, the outstanding debt is 0. Exactly 0. I've found that CTGAN has a hard time grasping this idea, using the FloatFormatter. The synthesized data will have lots of values close, but not exactly zero. Post-processing would be an option, but I feel like this does not solve the underlying problem. Plausibly, the occurence of such mixed variables makes it very easy for the discriminator, and difficult for the generator. For "exactly 0" on some columns might arise as an easy-to-spot characteristic of the real data.
I'm very interested in what your opinion is on this.
Particularly, do you think this would have an impact on the CTGAN loss function?
Is there, currently, an easy way to mimic the idea of CTAB-GAN, using just the SDV package?
Would the implementation of this mixed-encoder be a valuable addition to the SDV ecosystem?
Kind regards,
Wilco