-
Notifications
You must be signed in to change notification settings - Fork 411
Description
This feedback was first noted in #2618.
Problem Description
I have a dataset that contains information about different patients' visits to the hospital. In this dataset, the patient_id is the sequence key, and the patient_birthdate and patient_sex are context columns (they do not vary per patient). The remainder of the columns are sequential columns that vary based on every visit a patient makes to the hospital.
| patient_id | patient_birthdate | patient_sex | visit_date | weight | ... |
|---|---|---|---|---|---|
| p_1934 | 2009-02-12 | M | 2025-01-29 | 169 | |
| p_1934 | 2009-02-12 | M | 2025-04-08 | 174 | |
| p_1934 | 2009-02-12 | M | 2025-07-23 | 171 | |
| p_1210 | 1995-06-15 | F | 2025-02-19 | 135 | |
| p_1210 | 1995-06-15 | F | 2025-05-02 | 128 |
Based on this data, I would like to input a constraint to ensure that the patient_birthdate <= visit_date for every single row of the table. Unfortunately, I am unable to do this right now because PARSynthesizer doesn't support constraints between contextual and non-contextual columns.
Expected behavior
Allow me to add an Inequality constraint where one of the columns is a context column and the other is a non-context columns. I expect to be able to apply this just like any other constraint to the PARSynthesizer.
from sdv.cag import Inequality
from sdv.sequential import PARSynthesizer
my_constraint = Inequality(
low_column_name='patient_birthdate',
high_column_name='visit_date'
)
synthesizer = PARSynthesizer(metadata, context_columns=['patient_birthdate', 'patient_sex'])
synthesizer.add_constraints([my_constraint])
synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_sequences=2)Workarounds
Until this is fixed, it's possible to fix this via a custom constraints. However, since a custom constraint is not support on PARSynthesizer at the moment, you'd have to use the pre- and post-processing outside of the synthesizer as a workaround.
my_constraint = MyCustomInequalityConstraint()
# allow the constraint to transform the input data and metadata
new_data = my_constraint.transform(data)
new_metadata = my_constraint.get_updated_metadata(metadata)
synth = PARSynthesizer(new_metadata, epochs=2, context_columns=['patient_birthday', 'patient_sex'])
synth.fit(new_data)
synthetic_data = synth.sample(2)
# allow the constraint to reverse transform the outputted synthetic data
post_synthetic_data = my_constraint.reverse_transform(synthetic_data)Note that MyCustomInequalityConstraint here would do the following:
- On the transform, modify the
visit_dateto represent the # of days after the birthdate instead - On the reverse transform, recalculate the actual visit date
Additional context
If we can do this for Inequality, we should also be able to support the related constraints:
- Range
- ChainedInequality