-
Notifications
You must be signed in to change notification settings - Fork 411
Description
Hello,
As part of my master thesis on synthetic data generation for healthcare, we tested and evaluated SDV’s ParSynthesizer on different segments of a clinical dataset.
Our main conclusions and findings were:
-> The model is easy to configure and performs efficiently on smaller data samples.
-> However, the generated data showed some consistency and quality issues, particularly in preserving correlations between fields and capturing more subtle patterns from the original dataset
-> In particular, it struggled to preserve correlations between fields and missed subtle but important patterns present in the original data. One example we observed was the occurrence of diagnostic exams dated before the corresponding medical consultations
or the presence of medical consultations dated after the recorded date of death.
-> Additionally, due to its full in-memory processing design, we faced scalability issues when attempting to synthesize larger datasets, even after filtering and simplifying the input.
Could you please share with us any suggestions for improving results, such as advanced configurations, alternative preprocessing steps, or known limitations to consider, we would be happy to incorporate them into our testing.
Best regards.