Skip to content

Conversation

@lotif
Copy link
Collaborator

@lotif lotif commented Nov 13, 2025

PR Type

Feature

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868g5jctq

Adding examples for synthesizing single-table and multi-table data.

Also, fixing a couple of bugs that appeared while testing the examples.

Tests Added

NA

@coderabbitai
Copy link

coderabbitai bot commented Nov 13, 2025

📝 Walkthrough

Walkthrough

This pull request introduces multi-table synthesis capabilities to the toolkit alongside documentation and configuration for single-table synthesis. Key changes include: new example scripts (run_synthesizing.py) and configuration files for both single and multi-table synthesis workflows; updates to core synthesis logic in clava_synthesizing to accept an optional all_group_lengths_prob_dicts parameter; minor logic improvements (floating-point tolerance in validation, directory creation before file writes, relation order tuple conversion); documentation formatting updates; and consistent reordering of function arguments across call sites to reflect the new parameter position.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

  • Argument positioning changes: The all_group_lengths_prob_dicts parameter is repositioned across multiple call sites (shadow_model_utils.py, test_synthesizer.py, synthesizer.py signature). Verify consistency and correctness across all invocations to prevent runtime mismatches.
  • Core synthesizer modifications: Changes to clava_synthesizing signature and logic (new optional parameter, floating-point tolerance relaxation, NO_PARENT_COLUMN_NAME placeholder handling) require careful verification against both single-table and multi-table code paths.
  • New public entry points: Two new main(config: DictConfig) functions in the example scripts need review for proper workflow orchestration and error handling.
  • Multi-table assertion logic: The runtime assertion that all_group_lengths_prob_dicts is not None for multi-table synthesis must be validated to ensure correct branching behavior.
  • Heterogeneous changes across layers: Logic changes span example orchestration, core modeling, and test files, requiring context switching during review.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Adding synthesizer example' is partially related but overly broad and vague. It does not clearly convey the scope (single-table and multi-table examples) or that bug fixes are included. Consider a more specific title like 'Add single-table and multi-table synthesizing examples' to better reflect the comprehensive scope of changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The description includes all required template sections (PR Type, Short Description, Tests Added) with appropriate detail about the examples being added and bugs fixed.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch marcelo/synthesizer-example

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (2)
examples/synthesizing/multi_table/config.yaml (1)

1-46: Config shape matches runtime expectations; only minor duplication

The structure and keys here line up with how examples.synthesizing.multi_table.run_synthesizing and GeneralConfig / SamplingConfig / MatchingConfig consume the config. Indentation is valid YAML and shouldn’t cause issues.

You do have base_data_dir / results_dir and also repeat the same paths under general_config.{data_dir,test_data_dir,workspace_dir}. Not a blocker, but if this example evolves it might be worth deriving those from a single source to avoid drift.

examples/synthesizing/single_table/config.yaml (1)

1-34: Single‑table config is consistent with the orchestrator; minor duplication

This config lines up with examples.synthesizing.single_table.run_synthesizing and the GeneralConfig / SamplingConfig / MatchingConfig schemas. Paths also match what the README describes.

As with the multi‑table config, base_data_dir / results_dir duplicate the values in general_config.{data_dir,test_data_dir,workspace_dir}. Not urgent, but consider centralizing these to avoid divergence later.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc2ae2d and 3a3814d.

📒 Files selected for processing (15)
  • .gitignore (1 hunks)
  • examples/synthesizing/multi_table/README.md (1 hunks)
  • examples/synthesizing/multi_table/config.yaml (1 hunks)
  • examples/synthesizing/multi_table/run_synthesizing.py (1 hunks)
  • examples/synthesizing/single_table/README.md (1 hunks)
  • examples/synthesizing/single_table/config.yaml (1 hunks)
  • examples/synthesizing/single_table/run_synthesizing.py (1 hunks)
  • examples/training/multi_table/README.md (2 hunks)
  • examples/training/single_table/README.md (2 hunks)
  • examples/training/single_table/config.yaml (1 hunks)
  • src/midst_toolkit/attacks/ensemble/shadow_model_utils.py (2 hunks)
  • src/midst_toolkit/models/clavaddpm/clustering.py (1 hunks)
  • src/midst_toolkit/models/clavaddpm/data_loaders.py (1 hunks)
  • src/midst_toolkit/models/clavaddpm/synthesizer.py (6 hunks)
  • tests/integration/models/clavaddpm/test_synthesizer.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/synthesizing/multi_table/run_synthesizing.py (4)
src/midst_toolkit/common/config.py (3)
  • GeneralConfig (11-18)
  • MatchingConfig (77-83)
  • SamplingConfig (70-74)
src/midst_toolkit/models/clavaddpm/data_loaders.py (1)
  • load_tables (72-128)
src/midst_toolkit/models/clavaddpm/synthesizer.py (1)
  • clava_synthesizing (710-824)
examples/synthesizing/single_table/run_synthesizing.py (1)
  • main (20-68)
examples/synthesizing/single_table/run_synthesizing.py (4)
src/midst_toolkit/common/config.py (3)
  • GeneralConfig (11-18)
  • MatchingConfig (77-83)
  • SamplingConfig (70-74)
src/midst_toolkit/models/clavaddpm/data_loaders.py (1)
  • load_tables (72-128)
src/midst_toolkit/models/clavaddpm/synthesizer.py (1)
  • clava_synthesizing (710-824)
examples/synthesizing/multi_table/run_synthesizing.py (1)
  • main (20-77)
🪛 LanguageTool
examples/synthesizing/multi_table/README.md

[style] ~36-~36: To make your writing clearer, consider a more direct alternative.
Context: ...ase there is a need to run that. Please take a look at them before kicking off the syn...

(TAKE_A_LOOK)

examples/synthesizing/single_table/README.md

[style] ~34-~34: To make your writing clearer, consider a more direct alternative.
Context: ...ase there is a need to run that. Please take a look at them before kicking off the syn...

(TAKE_A_LOOK)

🪛 Ruff (0.14.4)
examples/synthesizing/multi_table/run_synthesizing.py

56-56: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)


59-59: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

examples/synthesizing/single_table/run_synthesizing.py

54-54: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: integration-tests
  • GitHub Check: run-code-check
  • GitHub Check: unit-tests
🔇 Additional comments (16)
src/midst_toolkit/models/clavaddpm/clustering.py (1)

62-63: LGTM! Good defensive fix.

Ensuring the directory exists before writing the pickle file prevents FileNotFoundError when save_dir doesn't exist. The use of parents=True and exist_ok=True is appropriate.

examples/training/single_table/config.yaml (1)

11-11: LGTM! Reasonable reduction for example code.

Reducing iterations from 200,000 to 20,000 makes the example run faster while still demonstrating the functionality.

.gitignore (1)

50-53: LGTM! Consistent with existing patterns.

The new ignore patterns for synthesizing examples match the structure used for training examples.

src/midst_toolkit/models/clavaddpm/data_loaders.py (1)

99-99: LGTM! Tuples are the right choice for immutable relations.

Converting relation_order entries to tuples ensures they cannot be accidentally modified and better represents the immutable nature of table relationships.

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py (2)

141-151: LGTM! Argument reordering matches updated signature.

The repositioning of all_group_lengths_prob_dicts is consistent with the updated clava_synthesizing signature for multi-table synthesis support.


231-241: LGTM! Consistent argument reordering.

The same signature update is correctly applied here as in the train_tabddpm_and_synthesize function above.

tests/integration/models/clavaddpm/test_synthesizer.py (1)

91-100: LGTM! Test updated for new signature.

The test correctly reflects the updated argument ordering in clava_synthesizing, consistent with changes throughout the codebase.

src/midst_toolkit/models/clavaddpm/synthesizer.py (1)

20-21: All clava_synthesizing call sites properly updated; no positional sample_scale conflicts detected

Verification shows all 5 call sites in the codebase have been correctly updated for the signature change:

  • examples/synthesizing/single_table/run_synthesizing.py: 7 positional args (through matching_config), uses both new parameter defaults
  • examples/synthesizing/multi_table/run_synthesizing.py: 8 positional args including all_group_lengths_prob_dicts, correct placement
  • tests/integration/models/clavaddpm/test_synthesizer.py: 8 positional args including all_group_lengths_prob_dicts, correct placement
  • src/midst_toolkit/attacks/ensemble/shadow_model_utils.py (both calls): 8 positional args with all_group_lengths_prob_dicts positioned correctly, sample_scale passed as keyword argument

No call site passes sample_scale positionally after matching_config, eliminating the risk of accidental argument collision the review comment flagged. The implementation is sound.

examples/synthesizing/single_table/run_synthesizing.py (1)

19-68: Hydra path resolution concern confirmed; pickle warning and grammar nit remain valid

The documentation in the README confirms running "from the project's root folder," but this doesn't address Hydra's default behavior: @hydra.main changes the working directory to hydra.run.dir (typically ./outputs/YYYY-MM-DD/HH-MM-SS). The config uses relative paths like examples/synthesizing/single_table/data without hydra.utils.to_absolute_path() wrapping them, so these paths will fail to resolve after the directory change. To fix:

  • Either wrap config.base_data_dir and config.results_dir with to_absolute_path(), or
  • Set hydra.job.chdir: false in the config to prevent Hydra from changing directories.

Verify the example actually works when run as documented; if it fails to find the data or results directory, the Hydra path fix is required.

The pickle.load() call at line 51 deserves a comment noting that checkpoints must be produced locally by run_training; loading untrusted pickles is unsafe. Line 44: "Found a pre-trained models" should be "Found pre-trained models" (grammar).

examples/synthesizing/multi_table/run_synthesizing.py (7)

1-12: LGTM!

Imports are well-organized and all necessary for the multi-table synthesis workflow.


15-16: LGTM!

Setting the logger to INFO level is appropriate for an example script.


19-33: LGTM!

The function signature and documentation clearly explain the multi-table synthesis pipeline workflow.


51-62: LGTM!

The model and clustering result loading logic is correct. The static analysis warnings about pickle (S301) are false positives—this example script loads trusted local artifacts generated by its own training process.


64-77: LGTM!

The call to clava_synthesizing correctly passes all required parameters for multi-table synthesis, including the all_group_lengths_prob_dicts parameter that was added to support multi-table workflows.


80-81: LGTM!

Standard entry point for a Hydra-based script.


36-62: Verify that relation_order from load_tables() matches the relation structure expected by clustered tables.

The cluster_ckpt dictionary saved to disk contains only "tables" and "all_group_lengths_prob_dicts" (it does not include relation_order). This means the script relies on the assumption that the relation_order loaded from load_tables() at line 36 matches the same order used during clustering in training. If the base data or load_tables() behavior changes between training and synthesis runs, relation_order could become inconsistent with the clustering structure, potentially causing failures. Verify this assumption or add a safeguard to detect such mismatches.

dataset_meta = json.load(f)

relation_order = dataset_meta["relation_order"]
relation_order = [tuple(relation) for relation in dataset_meta["relation_order"]]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting those to tuples as they should be, since .yaml and .json files have no concept of tuple.

The sampled key.
"""
assert sum(probabilities.values()) == 1.0, "The sum of all probabilities must be 1.0."
assert np.isclose(sum(probabilities.values()), 1), "The sum of all probabilities must be 1."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a little precision error here in one of the tables in the multi-table example.

general_config: GeneralConfig,
sampling_config: SamplingConfig,
matching_config: MatchingConfig,
all_group_lengths_prob_dicts: GroupLengthsProbDicts | None = None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really required for single-table training, so making it optional.

Copy link
Collaborator

@emersodb emersodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome examples, just a few minor comments and one that is most likely me misunderstanding the requirements for single-table generaton.

if all(model_file.exists() for model_file in model_file_paths.values()) and clustering_results_file.exists():
log(INFO, f"Found pre-trained models in {config.results_dir}. Skipping training.")
else:
log(INFO, "No pre-trained models found, training a new model from scratch...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a more accurate message would be that "Not all required checkpoints were found..." since we need a number to exist to synthesize from?

It also might be helpful to log the ones that are missing?

base_data_dir: examples/synthesizing/single_table/data
results_dir: examples/synthesizing/single_table/results

diffusion_config:
Copy link
Collaborator

@emersodb emersodb Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same potential suggestion about a comment here 🙂


tables, relation_order, _ = load_tables(Path(config.base_data_dir))

model_file_paths = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a naive question, but if we're doing single table synthesis, what role does relation_order have in synthesis and should we actually be loading multiple pickle files below?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we're treating each table as an individual unrelated table in this synthesis, so we'll still have multiple tables to generate? If so we should also be clear in the readme that we'll still generate multiple tables, they'll just be "independent"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need those files and their relations as synthesizing works with the table relations as well. If the tables are not related, you can declare the relation as (None, table_name) as it's being done on single table, but the relationship between the tables is taken into account when synthesizing data for each one of them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationship between the tables is taken into account even when doing single table synthesis or you're just saying that we need those files to "make it work"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assert that the first relation is None somewhere do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put that assert for single table, but multi table I think that's not necessarily true for all cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationship between the tables is taken into account even when doing single table synthesis or you're just saying that we need those files to "make it work"?

The relationship is not taken into account for single table. I think I get your point, the config can be made simpler for single table. I'll put a ticket in the backlog so we can track and come back to it later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for multi-table, I don't think the assert is valid.

Yeah, I guess that's what I'm getting at (sorry for the round-about way of communicating it). Basically, it seems like relationship data isn't leveraged in single table, so was thinking it could be left out in some way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely can, we could add some code that would make that relationship from just the table name, for example. This way, it's simpler for the user to set up as they won't need a dataset_meta.json file anymore. I added the ticket:

https://app.clickup.com/t/868gd4w1c

Copy link
Collaborator

@emersodb emersodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I think my comments are address. Just a touch of confusion for me on the single vs. multi-table relation requirements. I'll leave it to your judgement how you want to document/address it (if at all).

@lotif lotif merged commit 45f4689 into main Nov 18, 2025
6 checks passed
@lotif lotif deleted the marcelo/synthesizer-example branch November 18, 2025 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants