Adding synthesizer example #93

lotif · 2025-11-13T20:35:51Z

PR Type

Feature

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868g5jctq

Adding examples for synthesizing single-table and multi-table data.

Also, fixing a couple of bugs that appeared while testing the examples.

Tests Added

NA

coderabbitai · 2025-11-13T20:43:41Z

📝 Walkthrough

Walkthrough

This pull request introduces multi-table synthesis capabilities to the toolkit alongside documentation and configuration for single-table synthesis. Key changes include: new example scripts (run_synthesizing.py) and configuration files for both single and multi-table synthesis workflows; updates to core synthesis logic in clava_synthesizing to accept an optional all_group_lengths_prob_dicts parameter; minor logic improvements (floating-point tolerance in validation, directory creation before file writes, relation order tuple conversion); documentation formatting updates; and consistent reordering of function arguments across call sites to reflect the new parameter position.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Argument positioning changes: The all_group_lengths_prob_dicts parameter is repositioned across multiple call sites (shadow_model_utils.py, test_synthesizer.py, synthesizer.py signature). Verify consistency and correctness across all invocations to prevent runtime mismatches.
Core synthesizer modifications: Changes to clava_synthesizing signature and logic (new optional parameter, floating-point tolerance relaxation, NO_PARENT_COLUMN_NAME placeholder handling) require careful verification against both single-table and multi-table code paths.
New public entry points: Two new main(config: DictConfig) functions in the example scripts need review for proper workflow orchestration and error handling.
Multi-table assertion logic: The runtime assertion that all_group_lengths_prob_dicts is not None for multi-table synthesis must be validated to ensure correct branching behavior.
Heterogeneous changes across layers: Logic changes span example orchestration, core modeling, and test files, requiring context switching during review.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Adding synthesizer example' is partially related but overly broad and vague. It does not clearly convey the scope (single-table and multi-table examples) or that bug fixes are included.	Consider a more specific title like 'Add single-table and multi-table synthesizing examples' to better reflect the comprehensive scope of changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description includes all required template sections (PR Type, Short Description, Tests Added) with appropriate detail about the examples being added and bugs fixed.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch marcelo/synthesizer-example

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (2)

examples/synthesizing/multi_table/config.yaml (1)

1-46: Config shape matches runtime expectations; only minor duplication

The structure and keys here line up with how examples.synthesizing.multi_table.run_synthesizing and GeneralConfig / SamplingConfig / MatchingConfig consume the config. Indentation is valid YAML and shouldn’t cause issues.

You do have base_data_dir / results_dir and also repeat the same paths under general_config.{data_dir,test_data_dir,workspace_dir}. Not a blocker, but if this example evolves it might be worth deriving those from a single source to avoid drift.

examples/synthesizing/single_table/config.yaml (1)

1-34: Single‑table config is consistent with the orchestrator; minor duplication

This config lines up with examples.synthesizing.single_table.run_synthesizing and the GeneralConfig / SamplingConfig / MatchingConfig schemas. Paths also match what the README describes.

As with the multi‑table config, base_data_dir / results_dir duplicate the values in general_config.{data_dir,test_data_dir,workspace_dir}. Not urgent, but consider centralizing these to avoid divergence later.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc2ae2d and 3a3814d.

📒 Files selected for processing (15)

.gitignore (1 hunks)
examples/synthesizing/multi_table/README.md (1 hunks)
examples/synthesizing/multi_table/config.yaml (1 hunks)
examples/synthesizing/multi_table/run_synthesizing.py (1 hunks)
examples/synthesizing/single_table/README.md (1 hunks)
examples/synthesizing/single_table/config.yaml (1 hunks)
examples/synthesizing/single_table/run_synthesizing.py (1 hunks)
examples/training/multi_table/README.md (2 hunks)
examples/training/single_table/README.md (2 hunks)
examples/training/single_table/config.yaml (1 hunks)
src/midst_toolkit/attacks/ensemble/shadow_model_utils.py (2 hunks)
src/midst_toolkit/models/clavaddpm/clustering.py (1 hunks)
src/midst_toolkit/models/clavaddpm/data_loaders.py (1 hunks)
src/midst_toolkit/models/clavaddpm/synthesizer.py (6 hunks)
tests/integration/models/clavaddpm/test_synthesizer.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

examples/synthesizing/multi_table/run_synthesizing.py (4)

src/midst_toolkit/common/config.py (3)

GeneralConfig (11-18)

MatchingConfig (77-83)

SamplingConfig (70-74)

src/midst_toolkit/models/clavaddpm/data_loaders.py (1)

load_tables (72-128)

src/midst_toolkit/models/clavaddpm/synthesizer.py (1)

clava_synthesizing (710-824)

examples/synthesizing/single_table/run_synthesizing.py (1)

main (20-68)

examples/synthesizing/single_table/run_synthesizing.py (4)

src/midst_toolkit/common/config.py (3)

GeneralConfig (11-18)

MatchingConfig (77-83)

SamplingConfig (70-74)

src/midst_toolkit/models/clavaddpm/data_loaders.py (1)

load_tables (72-128)

src/midst_toolkit/models/clavaddpm/synthesizer.py (1)

clava_synthesizing (710-824)

examples/synthesizing/multi_table/run_synthesizing.py (1)

main (20-77)

🪛 LanguageTool

examples/synthesizing/multi_table/README.md

[style] ~36-~36: To make your writing clearer, consider a more direct alternative.
Context: ...ase there is a need to run that. Please take a look at them before kicking off the syn...

(TAKE_A_LOOK)

examples/synthesizing/single_table/README.md

[style] ~34-~34: To make your writing clearer, consider a more direct alternative.
Context: ...ase there is a need to run that. Please take a look at them before kicking off the syn...

(TAKE_A_LOOK)

🪛 Ruff (0.14.4)

examples/synthesizing/multi_table/run_synthesizing.py

56-56: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

59-59: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

examples/synthesizing/single_table/run_synthesizing.py

54-54: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: integration-tests
GitHub Check: run-code-check
GitHub Check: unit-tests

🔇 Additional comments (16)

src/midst_toolkit/models/clavaddpm/clustering.py (1)

62-63: LGTM! Good defensive fix.

Ensuring the directory exists before writing the pickle file prevents FileNotFoundError when save_dir doesn't exist. The use of parents=True and exist_ok=True is appropriate.

examples/training/single_table/config.yaml (1)

11-11: LGTM! Reasonable reduction for example code.

Reducing iterations from 200,000 to 20,000 makes the example run faster while still demonstrating the functionality.

.gitignore (1)

50-53: LGTM! Consistent with existing patterns.

The new ignore patterns for synthesizing examples match the structure used for training examples.

src/midst_toolkit/models/clavaddpm/data_loaders.py (1)

99-99: LGTM! Tuples are the right choice for immutable relations.

Converting relation_order entries to tuples ensures they cannot be accidentally modified and better represents the immutable nature of table relationships.

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py (2)

141-151: LGTM! Argument reordering matches updated signature.

The repositioning of all_group_lengths_prob_dicts is consistent with the updated clava_synthesizing signature for multi-table synthesis support.

231-241: LGTM! Consistent argument reordering.

The same signature update is correctly applied here as in the train_tabddpm_and_synthesize function above.

tests/integration/models/clavaddpm/test_synthesizer.py (1)

91-100: LGTM! Test updated for new signature.

The test correctly reflects the updated argument ordering in clava_synthesizing, consistent with changes throughout the codebase.

src/midst_toolkit/models/clavaddpm/synthesizer.py (1)

20-21: All clava_synthesizing call sites properly updated; no positional sample_scale conflicts detected

Verification shows all 5 call sites in the codebase have been correctly updated for the signature change:

examples/synthesizing/single_table/run_synthesizing.py: 7 positional args (through matching_config), uses both new parameter defaults

examples/synthesizing/multi_table/run_synthesizing.py: 8 positional args including all_group_lengths_prob_dicts, correct placement

tests/integration/models/clavaddpm/test_synthesizer.py: 8 positional args including all_group_lengths_prob_dicts, correct placement

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py (both calls): 8 positional args with all_group_lengths_prob_dicts positioned correctly, sample_scale passed as keyword argument

No call site passes sample_scale positionally after matching_config, eliminating the risk of accidental argument collision the review comment flagged. The implementation is sound.

examples/synthesizing/single_table/run_synthesizing.py (1)

19-68: Hydra path resolution concern confirmed; pickle warning and grammar nit remain valid

The documentation in the README confirms running "from the project's root folder," but this doesn't address Hydra's default behavior: @hydra.main changes the working directory to hydra.run.dir (typically ./outputs/YYYY-MM-DD/HH-MM-SS). The config uses relative paths like examples/synthesizing/single_table/data without hydra.utils.to_absolute_path() wrapping them, so these paths will fail to resolve after the directory change. To fix:

Either wrap config.base_data_dir and config.results_dir with to_absolute_path(), or

Set hydra.job.chdir: false in the config to prevent Hydra from changing directories.

Verify the example actually works when run as documented; if it fails to find the data or results directory, the Hydra path fix is required.

The pickle.load() call at line 51 deserves a comment noting that checkpoints must be produced locally by run_training; loading untrusted pickles is unsafe. Line 44: "Found a pre-trained models" should be "Found pre-trained models" (grammar).

examples/synthesizing/multi_table/run_synthesizing.py (7)

1-12: LGTM!

Imports are well-organized and all necessary for the multi-table synthesis workflow.

15-16: LGTM!

Setting the logger to INFO level is appropriate for an example script.

19-33: LGTM!

The function signature and documentation clearly explain the multi-table synthesis pipeline workflow.

51-62: LGTM!

The model and clustering result loading logic is correct. The static analysis warnings about pickle (S301) are false positives—this example script loads trusted local artifacts generated by its own training process.

64-77: LGTM!

The call to clava_synthesizing correctly passes all required parameters for multi-table synthesis, including the all_group_lengths_prob_dicts parameter that was added to support multi-table workflows.

80-81: LGTM!

Standard entry point for a Hydra-based script.

36-62: Verify that relation_order from load_tables() matches the relation structure expected by clustered tables.

The cluster_ckpt dictionary saved to disk contains only "tables" and "all_group_lengths_prob_dicts" (it does not include relation_order). This means the script relies on the assumption that the relation_order loaded from load_tables() at line 36 matches the same order used during clustering in training. If the base data or load_tables() behavior changes between training and synthesis runs, relation_order could become inconsistent with the clustering structure, potentially causing failures. Verify this assumption or add a safeguard to detect such mismatches.

examples/synthesizing/multi_table/README.md

examples/synthesizing/multi_table/run_synthesizing.py

examples/synthesizing/single_table/README.md

examples/training/multi_table/README.md

examples/training/single_table/README.md

lotif · 2025-11-13T22:33:18Z

src/midst_toolkit/models/clavaddpm/data_loaders.py

        dataset_meta = json.load(f)

-    relation_order = dataset_meta["relation_order"]
+    relation_order = [tuple(relation) for relation in dataset_meta["relation_order"]]


Converting those to tuples as they should be, since .yaml and .json files have no concept of tuple.

lotif · 2025-11-13T22:34:02Z

src/midst_toolkit/models/clavaddpm/synthesizer.py

        The sampled key.
    """
-    assert sum(probabilities.values()) == 1.0, "The sum of all probabilities must be 1.0."
+    assert np.isclose(sum(probabilities.values()), 1), "The sum of all probabilities must be 1."


There was a little precision error here in one of the tables in the multi-table example.

lotif · 2025-11-13T22:34:22Z

src/midst_toolkit/models/clavaddpm/synthesizer.py

    general_config: GeneralConfig,
    sampling_config: SamplingConfig,
    matching_config: MatchingConfig,
+    all_group_lengths_prob_dicts: GroupLengthsProbDicts | None = None,


This is not really required for single-table training, so making it optional.

emersodb

Awesome examples, just a few minor comments and one that is most likely me misunderstanding the requirements for single-table generaton.

examples/synthesizing/multi_table/config.yaml

emersodb · 2025-11-14T16:10:46Z

examples/synthesizing/multi_table/run_synthesizing.py

+    if all(model_file.exists() for model_file in model_file_paths.values()) and clustering_results_file.exists():
+        log(INFO, f"Found pre-trained models in {config.results_dir}. Skipping training.")
+    else:
+        log(INFO, "No pre-trained models found, training a new model from scratch...")


Perhaps a more accurate message would be that "Not all required checkpoints were found..." since we need a number to exist to synthesize from?

It also might be helpful to log the ones that are missing?

emersodb · 2025-11-14T16:15:18Z

examples/synthesizing/single_table/config.yaml

+base_data_dir: examples/synthesizing/single_table/data
+results_dir: examples/synthesizing/single_table/results
+
+diffusion_config:


Same potential suggestion about a comment here 🙂

emersodb · 2025-11-14T16:16:53Z

examples/synthesizing/single_table/run_synthesizing.py

+
+    tables, relation_order, _ = load_tables(Path(config.base_data_dir))
+
+    model_file_paths = {}


This is going to be a naive question, but if we're doing single table synthesis, what role does relation_order have in synthesis and should we actually be loading multiple pickle files below?

Perhaps we're treating each table as an individual unrelated table in this synthesis, so we'll still have multiple tables to generate? If so we should also be clear in the readme that we'll still generate multiple tables, they'll just be "independent"

We need those files and their relations as synthesizing works with the table relations as well. If the tables are not related, you can declare the relation as (None, table_name) as it's being done on single table, but the relationship between the tables is taken into account when synthesizing data for each one of them.

The relationship between the tables is taken into account even when doing single table synthesis or you're just saying that we need those files to "make it work"?

Should we assert that the first relation is None somewhere do you think?

We can put that assert for single table, but multi table I think that's not necessarily true for all cases.

The relationship between the tables is taken into account even when doing single table synthesis or you're just saying that we need those files to "make it work"?

The relationship is not taken into account for single table. I think I get your point, the config can be made simpler for single table. I'll put a ticket in the backlog so we can track and come back to it later.

Yeah for multi-table, I don't think the assert is valid.

Yeah, I guess that's what I'm getting at (sorry for the round-about way of communicating it). Basically, it seems like relationship data isn't leveraged in single table, so was thinking it could be left out in some way.

It definitely can, we could add some code that would make that relationship from just the table name, for example. This way, it's simpler for the user to set up as they won't need a dataset_meta.json file anymore. I added the ticket:

https://app.clickup.com/t/868gd4w1c

src/midst_toolkit/models/clavaddpm/synthesizer.py

emersodb

Generally, I think my comments are address. Just a touch of confusion for me on the single vs. multi-table relation requirements. I'll leave it to your judgement how you want to document/address it (if at all).

lotif added 22 commits November 11, 2025 16:27

WIP needs to save results

2550c85

Done the single table

8146b8a

Finished adding the multi table example

00d879a

Adding test for the bug fix

4148922

Better docstrings

f075e46

Fixing typo

7d890f9

Fixing the config yaml link

402e35d

CR by coderabbit

5fd0cc8

Fixing the config yaml link

e9d385b

CR by coderabbit

ac1a885

Actually fixing the config file links

c904e9b

Merge branch 'marcelo/trainer-example' into marcelo/synthesizer-example

8537535

Synthesizing single table first files

3d0b9c6

Finishing the sythesizer single table example

baa9824

Small tweak in the readmes

d480dd9

Merge branch 'marcelo/trainer-example' into marcelo/synthesizer-example

a397025

Final synthesizer config

e81aa91

actual final configs

1c55e91

removing one extra zero

37ebe7d

finishing the synthsizer example code

8974cfe

Merge branch 'main' into marcelo/synthesizer-example

a1a12f1

making the save dir in case it does't exist

3a3814d

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

lotif added 3 commits November 13, 2025 15:56

Fixing tests

32e9ab3

Finishing the instructions for the multi table example

12f6491

CR by coderabbit

bc9af11

lotif requested review from amrit110, fatemetkl and sarakodeiri November 13, 2025 22:30

lotif requested review from ElahehBassak, bzamanlooy, emersodb and masi-sh November 13, 2025 22:30

lotif commented Nov 13, 2025

View reviewed changes

emersodb reviewed Nov 14, 2025

View reviewed changes

emersodb approved these changes Nov 14, 2025

View reviewed changes

lotif and others added 3 commits November 17, 2025 15:31

David's CR

02bcf42

Merge branch 'main' into marcelo/synthesizer-example

4ffa4c0

Merge branch 'main' into marcelo/synthesizer-example

ea1f5e8

lotif merged commit 45f4689 into main Nov 18, 2025
6 checks passed

lotif deleted the marcelo/synthesizer-example branch November 18, 2025 16:25


		tables, relation_order, _ = load_tables(Path(config.base_data_dir))

		model_file_paths = {}

Adding synthesizer example #93

Adding synthesizer example #93

Uh oh!

Conversation

lotif commented Nov 13, 2025

PR Type

Short Description

Tests Added

Uh oh!

coderabbitai bot commented Nov 13, 2025

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emersodb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emersodb Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emersodb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

emersodb Nov 14, 2025 •

edited

Loading