Skip to content
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
2550c85
WIP needs to save results
lotif Nov 11, 2025
8146b8a
Done the single table
lotif Nov 11, 2025
00d879a
Finished adding the multi table example
lotif Nov 12, 2025
4148922
Adding test for the bug fix
lotif Nov 12, 2025
f075e46
Better docstrings
lotif Nov 12, 2025
7d890f9
Fixing typo
lotif Nov 12, 2025
402e35d
Fixing the config yaml link
lotif Nov 12, 2025
5fd0cc8
CR by coderabbit
lotif Nov 12, 2025
e9d385b
Fixing the config yaml link
lotif Nov 12, 2025
ac1a885
CR by coderabbit
lotif Nov 12, 2025
c904e9b
Actually fixing the config file links
lotif Nov 12, 2025
8537535
Merge branch 'marcelo/trainer-example' into marcelo/synthesizer-example
lotif Nov 12, 2025
3d0b9c6
Synthesizing single table first files
lotif Nov 13, 2025
baa9824
Finishing the sythesizer single table example
lotif Nov 13, 2025
d480dd9
Small tweak in the readmes
lotif Nov 13, 2025
a397025
Merge branch 'marcelo/trainer-example' into marcelo/synthesizer-example
lotif Nov 13, 2025
e81aa91
Final synthesizer config
lotif Nov 13, 2025
1c55e91
actual final configs
lotif Nov 13, 2025
37ebe7d
removing one extra zero
lotif Nov 13, 2025
8974cfe
finishing the synthsizer example code
lotif Nov 13, 2025
a1a12f1
Merge branch 'main' into marcelo/synthesizer-example
lotif Nov 13, 2025
3a3814d
making the save dir in case it does't exist
lotif Nov 13, 2025
32e9ab3
Fixing tests
lotif Nov 13, 2025
12f6491
Finishing the instructions for the multi table example
lotif Nov 13, 2025
bc9af11
CR by coderabbit
lotif Nov 13, 2025
02bcf42
David's CR
lotif Nov 17, 2025
4ffa4c0
Merge branch 'main' into marcelo/synthesizer-example
lotif Nov 17, 2025
ea1f5e8
Merge branch 'main' into marcelo/synthesizer-example
lotif Nov 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,7 @@ examples/training/single_table/data/**
examples/training/single_table/results/**
examples/training/multi_table/data/**
examples/training/multi_table/results/**
examples/synthesizing/single_table/data/**
examples/synthesizing/single_table/results/**
examples/synthesizing/multi_table/data/**
examples/synthesizing/multi_table/results/**
60 changes: 60 additions & 0 deletions examples/synthesizing/multi_table/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Multi-Table Synthesizing Example

This example will go over synthesizing data for a multi-table dataset from the ground
up using the code in this toolkit.


## Downloading data

First, we need the data. Download it from this
[Google Drive link](https://drive.google.com/file/d/1Ao222l4AJjG54-HDEGCWkIfzRbl9_IKa/view?usp=drive_link),
extract the files and place them in a `/data` folder in within this folder
(`examples/synthesizing/multi_table`).

> [!NOTE]
> If you wish to change the data folder, you can do so by editing the `base_data_dir` attribute
> of the (`config.yaml`)[config.yaml] file.
It will contain data for 8 tables: `account`, `card`, `client`, `disp`, `district`, `loan`, `order`,
and `trans`. For each table there will be two files:
- `{table_name}.csv`: The table's data.
- `{table_name}_domain.json`: Metadata about the columns in the table's data, such as data types and sizes.

Additionally, you will find one more file:
- `dataset_meta.json`: Metadata about the relationship between the tables. It will describe which tables
are associated with which other tables.


## Kicking off synthesizing

If there is a `/results` folder within this folder (`examples/synthesizing/multi_table`)
from a previous training run, we will use that data to kick off synthesizing.
For example, you can copy the results from another run (e.g. `examples.training.multi_table.run_training`)
and paste them here and it will be picked up by this example.

The [`config.yaml`](config.yaml) file contains the parameters for the synthesizing and also
for training, in case there is a need to run that. Please take a look at them before kicking
off the synthesizing process and edit them as necessary.

To kick off synthesizing, simply run the command below from the project's root folder:

```bash
python -m examples.synthesizing.multi_table.run_synthesizing
```

## Results

It will save the result files inside a `/results` folder within this folder
(`examples/synthesizing/multi_table`).

> [!NOTE]
> If you wish to change the save folder, you can do so by editing the `results_dir` attribute
> of the (`config.yaml`)[config.yaml] file.
In the `/results/before_matching/` folder, there will be a file called `synthetic_tables.pkl`,
which is a pickle file containing the synthetic data before the matching process, in case
it's needed.

The `/results/single_table_synthesizing` folder will contain the final synthesized
data, organized per table. In this single-table example, there is only going to be one
synthesized table under `/results/single_table_synthesizing/trans/_final/trans_synthetic.csv`.
46 changes: 46 additions & 0 deletions examples/synthesizing/multi_table/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Training example configuration
# Base data directory (can be overridden from command line)
base_data_dir: examples/synthesizing/multi_table/data
results_dir: examples/synthesizing/multi_table/results

diffusion_config:
d_layers: [512, 1024, 1024, 1024, 1024, 512]
dropout: 0.0
num_timesteps: 2000
model_type: mlp
iterations: 20000
batch_size: 4096
lr: 0.0006
gaussian_loss_type: mse
weight_decay: 1e-05
scheduler: cosine
data_split_ratios: [0.99, 0.005, 0.005]

clustering_config:
parent_scale: 1.0
num_clusters: 50
clustering_method: kmeans_and_gmm

classifier_config:
d_layers: [128, 256, 512, 1024, 512, 256, 128]
lr: 0.0001
dim_t: 128
batch_size: 4096
iterations: 20000

general_config:
data_dir: examples/synthesizing/multi_table/data
test_data_dir: examples/synthesizing/multi_table/data
exp_name: multi_table_synthesizing
workspace_dir: examples/synthesizing/multi_table/results
sample_prefix: ""

sampling_config:
batch_size: 20000
classifier_scale: 1.0

matching_config:
num_matching_clusters: 1
matching_batch_size: 1000
unique_matching: True
no_matching: False
81 changes: 81 additions & 0 deletions examples/synthesizing/multi_table/run_synthesizing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
import pickle
from logging import INFO
from pathlib import Path

import hydra
from omegaconf import DictConfig

from examples.training.multi_table import run_training
from midst_toolkit.common.config import GeneralConfig, MatchingConfig, SamplingConfig
from midst_toolkit.common.logger import TOOLKIT_LOGGER, log
from midst_toolkit.models.clavaddpm.data_loaders import load_tables
from midst_toolkit.models.clavaddpm.synthesizer import clava_synthesizing


# Preventing some excessive logging
TOOLKIT_LOGGER.setLevel(INFO)


@hydra.main(config_path=".", config_name="config", version_base=None)
def main(config: DictConfig) -> None:
"""
Run the synthesizing pipeline for a multi-table diffusion model.
It will load the config and then data from the `config.base_data_dir` folder,
train the model, synthesize the data and save the results in the
`config.results_dir` folder.
It will first look for a pre-trained model in the `config.results_dir` folder.
If it doesn't find one, it will train a new model from scratch.
Args:
config: Training and synthesizing configuration as an OmegaConf DictConfig object.
"""
log(INFO, f"Checking for a pre-trained model in {config.results_dir}...")

_, relation_order, _ = load_tables(Path(config.base_data_dir))

model_file_paths = {}
for relation in relation_order:
model_file_path = Path(config.results_dir) / "models" / f"{relation[0]}_{relation[1]}_ckpt.pkl"
model_file_paths[relation] = model_file_path

clustering_results_file = Path(config.results_dir) / "cluster_ckpt.pkl"

if all(model_file.exists() for model_file in model_file_paths.values()) and clustering_results_file.exists():
log(INFO, f"Found a pre-trained models in {config.results_dir}. Skipping training.")
else:
log(INFO, "No pre-trained models found, training a new model from scratch...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a more accurate message would be that "Not all required checkpoints were found..." since we need a number to exist to synthesize from?

It also might be helpful to log the ones that are missing?

run_training.main(config)

log(INFO, "Loading models...")

models = {}
for relation in relation_order:
with open(model_file_paths[relation], "rb") as f:
models[relation] = pickle.load(f)

with open(clustering_results_file, "rb") as f:
clustering_result = pickle.load(f)

tables = clustering_result["tables"]
all_group_lengths_prob_dicts = clustering_result["all_group_lengths_prob_dicts"]

log(INFO, "Synthesizing data...")

clava_synthesizing(
tables,
relation_order,
Path(config.results_dir),
models,
GeneralConfig(**config.general_config),
SamplingConfig(**config.sampling_config),
MatchingConfig(**config.matching_config),
all_group_lengths_prob_dicts,
)

log(INFO, "Data synthesized successfully.")


if __name__ == "__main__":
main()
58 changes: 58 additions & 0 deletions examples/synthesizing/single_table/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Single-Table Synthesizing Example

This example will go over synthesizing data for a single-table dataset from the ground
up using the code in this toolkit.


## Downloading data

First, we need the data. Download it from this
[Google Drive link](https://drive.google.com/file/d/1J5qDuMHHg4dm9c3ISmb41tcTHSu1SVUC/view?usp=drive_link),
extract the files and place them in a `/data` folder in within this folder
(`examples/synthesizing/single_table`).

> [!NOTE]
> If you wish to change the data folder, you can do so by editing the `base_data_dir` attribute
> of the (`config.yaml`)[config.yaml] file.
Here is a description of the files that have been extracted:
- `trans.csv`: The training data. It consists of information about bank transactions and it
contains 20,000 data points.
- `trans_domain.json`: Metadata about the columns in `trans.csv`, such as data types and sizes.
- `dataset_meta.json`: Metadata about the relationship between the tables. Since this is a
single-table example, it will only contain information about the `trans` table.


## Kicking off synthesizing

If there is a `/results` folder within this folder (`examples/synthesizing/single_table`)
from a previous training run, we will use that data to kick off synthesizing.
For example, you can copy the results from another run (e.g. `examples.training.single_table.run_training`)
and paste them here and it will be picked up by this example.

The [`config.yaml`](config.yaml) file contains the parameters for the synthesizing and also
for training, in case there is a need to run that. Please take a look at them before kicking
off the synthesizing process and edit them as necessary.

To kick off synthesizing, simply run the command below from the project's root folder:

```bash
python -m examples.synthesizing.single_table.run_synthesizing
```

## Results

It will save the result files inside a `/results` folder within this folder
(`examples/synthesizing/single_table`).

> [!NOTE]
> If you wish to change the save folder, you can do so by editing the `results_dir` attribute
> of the (`config.yaml`)[config.yaml] file.
In the `/results/before_matching/` folder, there will be a file called `synthetic_tables.pkl`,
which is a pickle file containing the synthetic data before the matching process, in case
it's needed.

The `/results/single_table_synthesizing` folder will contain the final synthesized
data, organized per table. In this single-table example, there is only going to be one
synthesized table under `/results/single_table_synthesizing/trans/_final/trans_synthetic.csv`.
34 changes: 34 additions & 0 deletions examples/synthesizing/single_table/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Training example configuration
# Base data directory (can be overridden from command line)
base_data_dir: examples/synthesizing/single_table/data
results_dir: examples/synthesizing/single_table/results

diffusion_config:
Copy link
Collaborator

@emersodb emersodb Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same potential suggestion about a comment here 🙂

d_layers: [512, 1024, 1024, 1024, 1024, 512]
dropout: 0.0
num_timesteps: 2000
model_type: mlp
iterations: 20000
batch_size: 4096
lr: 0.0006
gaussian_loss_type: mse
weight_decay: 1e-05
scheduler: cosine
data_split_ratios: [0.99, 0.005, 0.005]

general_config:
data_dir: examples/synthesizing/single_table/data
test_data_dir: examples/synthesizing/single_table/data
exp_name: single_table_synthesizing
workspace_dir: examples/synthesizing/single_table/results
sample_prefix: ""

sampling_config:
batch_size: 20000
classifier_scale: 1.0

matching_config:
num_matching_clusters: 1
matching_batch_size: 1000
unique_matching: True
no_matching: False
72 changes: 72 additions & 0 deletions examples/synthesizing/single_table/run_synthesizing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import pickle
from logging import INFO
from pathlib import Path

import hydra
from omegaconf import DictConfig

from examples.training.single_table import run_training
from midst_toolkit.common.config import GeneralConfig, MatchingConfig, SamplingConfig
from midst_toolkit.common.logger import TOOLKIT_LOGGER, log
from midst_toolkit.models.clavaddpm.data_loaders import load_tables
from midst_toolkit.models.clavaddpm.synthesizer import clava_synthesizing


# Preventing some excessive logging
TOOLKIT_LOGGER.setLevel(INFO)


@hydra.main(config_path=".", config_name="config", version_base=None)
def main(config: DictConfig) -> None:
"""
Run the synthesizing pipeline for a single-table diffusion model.

It will load the config and then data from the `config.base_data_dir` folder,
train the model, synthesize the data and save the results in the
`config.results_dir` folder.

It will first look for a pre-trained model in the `config.results_dir` folder.
If it doesn't find one, it will train a new model from scratch.

Args:
config: Training and synthesizing configuration as an OmegaConf DictConfig object.
"""
log(INFO, f"Checking for a pre-trained model in {config.results_dir}...")

tables, relation_order, _ = load_tables(Path(config.base_data_dir))

model_file_paths = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a naive question, but if we're doing single table synthesis, what role does relation_order have in synthesis and should we actually be loading multiple pickle files below?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we're treating each table as an individual unrelated table in this synthesis, so we'll still have multiple tables to generate? If so we should also be clear in the readme that we'll still generate multiple tables, they'll just be "independent"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need those files and their relations as synthesizing works with the table relations as well. If the tables are not related, you can declare the relation as (None, table_name) as it's being done on single table, but the relationship between the tables is taken into account when synthesizing data for each one of them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationship between the tables is taken into account even when doing single table synthesis or you're just saying that we need those files to "make it work"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assert that the first relation is None somewhere do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put that assert for single table, but multi table I think that's not necessarily true for all cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationship between the tables is taken into account even when doing single table synthesis or you're just saying that we need those files to "make it work"?

The relationship is not taken into account for single table. I think I get your point, the config can be made simpler for single table. I'll put a ticket in the backlog so we can track and come back to it later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for multi-table, I don't think the assert is valid.

Yeah, I guess that's what I'm getting at (sorry for the round-about way of communicating it). Basically, it seems like relationship data isn't leveraged in single table, so was thinking it could be left out in some way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely can, we could add some code that would make that relationship from just the table name, for example. This way, it's simpler for the user to set up as they won't need a dataset_meta.json file anymore. I added the ticket:

https://app.clickup.com/t/868gd4w1c

for relation in relation_order:
model_file_path = Path(config.results_dir) / "models" / f"{relation[0]}_{relation[1]}_ckpt.pkl"
model_file_paths[relation] = model_file_path

if all(model_file.exists() for model_file in model_file_paths.values()):
log(INFO, f"Found a pre-trained models in {config.results_dir}. Skipping training.")
else:
log(INFO, "No pre-trained models found, training a new model from scratch...")
run_training.main(config)

log(INFO, "Loading models...")

models = {}
for relation in relation_order:
with open(model_file_paths[relation], "rb") as f:
models[relation] = pickle.load(f)

log(INFO, "Synthesizing data...")

clava_synthesizing(
tables,
relation_order,
Path(config.results_dir),
models,
GeneralConfig(**config.general_config),
SamplingConfig(**config.sampling_config),
MatchingConfig(**config.matching_config),
)

log(INFO, "Data synthesized successfully.")


if __name__ == "__main__":
main()
4 changes: 2 additions & 2 deletions examples/training/multi_table/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ extract the files and place them in a `/data` folder in within this folder

> [!NOTE]
> If you wish to change the data folder, you can do so by editing the `base_data_dir` attribute
> of the [`config.yaml`](config.yaml) file.
> of the (`config.yaml`)[config.yaml] file.

It will contain data for 8 tables: `account`, `card`, `client`, `disp`, `district`, `loan`, `order`,
and `trans`. For each table there will be two files:
Expand Down Expand Up @@ -44,7 +44,7 @@ The result files will be saved inside a `/results` folder within this folder

> [!NOTE]
> If you wish to change the save folder, you can do so by editing the `results_dir` attribute
> of the [`config.yaml`](config.yaml) file.
> of the (`config.yaml`)[config.yaml] file.

One of the results file is `/results/cluster_ckpt.pkl`, which will contain the results
of the clustering step.
Expand Down
Loading
Loading