-
Notifications
You must be signed in to change notification settings - Fork 1
Adding synthesizer example #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis pull request introduces multi-table synthesis capabilities to the toolkit alongside documentation and configuration for single-table synthesis. Key changes include: new example scripts ( Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (2)
examples/synthesizing/multi_table/config.yaml (1)
1-46: Config shape matches runtime expectations; only minor duplicationThe structure and keys here line up with how
examples.synthesizing.multi_table.run_synthesizingandGeneralConfig/SamplingConfig/MatchingConfigconsume the config. Indentation is valid YAML and shouldn’t cause issues.You do have
base_data_dir/results_dirand also repeat the same paths undergeneral_config.{data_dir,test_data_dir,workspace_dir}. Not a blocker, but if this example evolves it might be worth deriving those from a single source to avoid drift.examples/synthesizing/single_table/config.yaml (1)
1-34: Single‑table config is consistent with the orchestrator; minor duplicationThis config lines up with
examples.synthesizing.single_table.run_synthesizingand theGeneralConfig/SamplingConfig/MatchingConfigschemas. Paths also match what the README describes.As with the multi‑table config,
base_data_dir/results_dirduplicate the values ingeneral_config.{data_dir,test_data_dir,workspace_dir}. Not urgent, but consider centralizing these to avoid divergence later.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.gitignore(1 hunks)examples/synthesizing/multi_table/README.md(1 hunks)examples/synthesizing/multi_table/config.yaml(1 hunks)examples/synthesizing/multi_table/run_synthesizing.py(1 hunks)examples/synthesizing/single_table/README.md(1 hunks)examples/synthesizing/single_table/config.yaml(1 hunks)examples/synthesizing/single_table/run_synthesizing.py(1 hunks)examples/training/multi_table/README.md(2 hunks)examples/training/single_table/README.md(2 hunks)examples/training/single_table/config.yaml(1 hunks)src/midst_toolkit/attacks/ensemble/shadow_model_utils.py(2 hunks)src/midst_toolkit/models/clavaddpm/clustering.py(1 hunks)src/midst_toolkit/models/clavaddpm/data_loaders.py(1 hunks)src/midst_toolkit/models/clavaddpm/synthesizer.py(6 hunks)tests/integration/models/clavaddpm/test_synthesizer.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/synthesizing/multi_table/run_synthesizing.py (4)
src/midst_toolkit/common/config.py (3)
GeneralConfig(11-18)MatchingConfig(77-83)SamplingConfig(70-74)src/midst_toolkit/models/clavaddpm/data_loaders.py (1)
load_tables(72-128)src/midst_toolkit/models/clavaddpm/synthesizer.py (1)
clava_synthesizing(710-824)examples/synthesizing/single_table/run_synthesizing.py (1)
main(20-68)
examples/synthesizing/single_table/run_synthesizing.py (4)
src/midst_toolkit/common/config.py (3)
GeneralConfig(11-18)MatchingConfig(77-83)SamplingConfig(70-74)src/midst_toolkit/models/clavaddpm/data_loaders.py (1)
load_tables(72-128)src/midst_toolkit/models/clavaddpm/synthesizer.py (1)
clava_synthesizing(710-824)examples/synthesizing/multi_table/run_synthesizing.py (1)
main(20-77)
🪛 LanguageTool
examples/synthesizing/multi_table/README.md
[style] ~36-~36: To make your writing clearer, consider a more direct alternative.
Context: ...ase there is a need to run that. Please take a look at them before kicking off the syn...
(TAKE_A_LOOK)
examples/synthesizing/single_table/README.md
[style] ~34-~34: To make your writing clearer, consider a more direct alternative.
Context: ...ase there is a need to run that. Please take a look at them before kicking off the syn...
(TAKE_A_LOOK)
🪛 Ruff (0.14.4)
examples/synthesizing/multi_table/run_synthesizing.py
56-56: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
59-59: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
examples/synthesizing/single_table/run_synthesizing.py
54-54: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: integration-tests
- GitHub Check: run-code-check
- GitHub Check: unit-tests
🔇 Additional comments (16)
src/midst_toolkit/models/clavaddpm/clustering.py (1)
62-63: LGTM! Good defensive fix.Ensuring the directory exists before writing the pickle file prevents
FileNotFoundErrorwhensave_dirdoesn't exist. The use ofparents=Trueandexist_ok=Trueis appropriate.examples/training/single_table/config.yaml (1)
11-11: LGTM! Reasonable reduction for example code.Reducing iterations from 200,000 to 20,000 makes the example run faster while still demonstrating the functionality.
.gitignore (1)
50-53: LGTM! Consistent with existing patterns.The new ignore patterns for synthesizing examples match the structure used for training examples.
src/midst_toolkit/models/clavaddpm/data_loaders.py (1)
99-99: LGTM! Tuples are the right choice for immutable relations.Converting relation_order entries to tuples ensures they cannot be accidentally modified and better represents the immutable nature of table relationships.
src/midst_toolkit/attacks/ensemble/shadow_model_utils.py (2)
141-151: LGTM! Argument reordering matches updated signature.The repositioning of
all_group_lengths_prob_dictsis consistent with the updatedclava_synthesizingsignature for multi-table synthesis support.
231-241: LGTM! Consistent argument reordering.The same signature update is correctly applied here as in the
train_tabddpm_and_synthesizefunction above.tests/integration/models/clavaddpm/test_synthesizer.py (1)
91-100: LGTM! Test updated for new signature.The test correctly reflects the updated argument ordering in
clava_synthesizing, consistent with changes throughout the codebase.src/midst_toolkit/models/clavaddpm/synthesizer.py (1)
20-21: All clava_synthesizing call sites properly updated; no positional sample_scale conflicts detectedVerification shows all 5 call sites in the codebase have been correctly updated for the signature change:
examples/synthesizing/single_table/run_synthesizing.py: 7 positional args (throughmatching_config), uses both new parameter defaultsexamples/synthesizing/multi_table/run_synthesizing.py: 8 positional args includingall_group_lengths_prob_dicts, correct placementtests/integration/models/clavaddpm/test_synthesizer.py: 8 positional args includingall_group_lengths_prob_dicts, correct placementsrc/midst_toolkit/attacks/ensemble/shadow_model_utils.py(both calls): 8 positional args withall_group_lengths_prob_dictspositioned correctly,sample_scalepassed as keyword argumentNo call site passes
sample_scalepositionally aftermatching_config, eliminating the risk of accidental argument collision the review comment flagged. The implementation is sound.examples/synthesizing/single_table/run_synthesizing.py (1)
19-68: Hydra path resolution concern confirmed; pickle warning and grammar nit remain validThe documentation in the README confirms running "from the project's root folder," but this doesn't address Hydra's default behavior:
@hydra.mainchanges the working directory tohydra.run.dir(typically./outputs/YYYY-MM-DD/HH-MM-SS). The config uses relative paths likeexamples/synthesizing/single_table/datawithouthydra.utils.to_absolute_path()wrapping them, so these paths will fail to resolve after the directory change. To fix:
- Either wrap
config.base_data_dirandconfig.results_dirwithto_absolute_path(), or- Set
hydra.job.chdir: falsein the config to prevent Hydra from changing directories.Verify the example actually works when run as documented; if it fails to find the data or results directory, the Hydra path fix is required.
The
pickle.load()call at line 51 deserves a comment noting that checkpoints must be produced locally byrun_training; loading untrusted pickles is unsafe. Line 44:"Found a pre-trained models"should be"Found pre-trained models"(grammar).examples/synthesizing/multi_table/run_synthesizing.py (7)
1-12: LGTM!Imports are well-organized and all necessary for the multi-table synthesis workflow.
15-16: LGTM!Setting the logger to INFO level is appropriate for an example script.
19-33: LGTM!The function signature and documentation clearly explain the multi-table synthesis pipeline workflow.
51-62: LGTM!The model and clustering result loading logic is correct. The static analysis warnings about pickle (S301) are false positives—this example script loads trusted local artifacts generated by its own training process.
64-77: LGTM!The call to
clava_synthesizingcorrectly passes all required parameters for multi-table synthesis, including theall_group_lengths_prob_dictsparameter that was added to support multi-table workflows.
80-81: LGTM!Standard entry point for a Hydra-based script.
36-62: Verify thatrelation_orderfromload_tables()matches the relation structure expected by clustered tables.The
cluster_ckptdictionary saved to disk contains only"tables"and"all_group_lengths_prob_dicts"(it does not includerelation_order). This means the script relies on the assumption that therelation_orderloaded fromload_tables()at line 36 matches the same order used during clustering in training. If the base data orload_tables()behavior changes between training and synthesis runs,relation_ordercould become inconsistent with the clustering structure, potentially causing failures. Verify this assumption or add a safeguard to detect such mismatches.
| dataset_meta = json.load(f) | ||
|
|
||
| relation_order = dataset_meta["relation_order"] | ||
| relation_order = [tuple(relation) for relation in dataset_meta["relation_order"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting those to tuples as they should be, since .yaml and .json files have no concept of tuple.
| The sampled key. | ||
| """ | ||
| assert sum(probabilities.values()) == 1.0, "The sum of all probabilities must be 1.0." | ||
| assert np.isclose(sum(probabilities.values()), 1), "The sum of all probabilities must be 1." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a little precision error here in one of the tables in the multi-table example.
| general_config: GeneralConfig, | ||
| sampling_config: SamplingConfig, | ||
| matching_config: MatchingConfig, | ||
| all_group_lengths_prob_dicts: GroupLengthsProbDicts | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really required for single-table training, so making it optional.
emersodb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome examples, just a few minor comments and one that is most likely me misunderstanding the requirements for single-table generaton.
| if all(model_file.exists() for model_file in model_file_paths.values()) and clustering_results_file.exists(): | ||
| log(INFO, f"Found pre-trained models in {config.results_dir}. Skipping training.") | ||
| else: | ||
| log(INFO, "No pre-trained models found, training a new model from scratch...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a more accurate message would be that "Not all required checkpoints were found..." since we need a number to exist to synthesize from?
It also might be helpful to log the ones that are missing?
| base_data_dir: examples/synthesizing/single_table/data | ||
| results_dir: examples/synthesizing/single_table/results | ||
|
|
||
| diffusion_config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same potential suggestion about a comment here 🙂
|
|
||
| tables, relation_order, _ = load_tables(Path(config.base_data_dir)) | ||
|
|
||
| model_file_paths = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be a naive question, but if we're doing single table synthesis, what role does relation_order have in synthesis and should we actually be loading multiple pickle files below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we're treating each table as an individual unrelated table in this synthesis, so we'll still have multiple tables to generate? If so we should also be clear in the readme that we'll still generate multiple tables, they'll just be "independent"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need those files and their relations as synthesizing works with the table relations as well. If the tables are not related, you can declare the relation as (None, table_name) as it's being done on single table, but the relationship between the tables is taken into account when synthesizing data for each one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relationship between the tables is taken into account even when doing single table synthesis or you're just saying that we need those files to "make it work"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assert that the first relation is None somewhere do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put that assert for single table, but multi table I think that's not necessarily true for all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relationship between the tables is taken into account even when doing single table synthesis or you're just saying that we need those files to "make it work"?
The relationship is not taken into account for single table. I think I get your point, the config can be made simpler for single table. I'll put a ticket in the backlog so we can track and come back to it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for multi-table, I don't think the assert is valid.
Yeah, I guess that's what I'm getting at (sorry for the round-about way of communicating it). Basically, it seems like relationship data isn't leveraged in single table, so was thinking it could be left out in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely can, we could add some code that would make that relationship from just the table name, for example. This way, it's simpler for the user to set up as they won't need a dataset_meta.json file anymore. I added the ticket:
emersodb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I think my comments are address. Just a touch of confusion for me on the single vs. multi-table relation requirements. I'll leave it to your judgement how you want to document/address it (if at all).
PR Type
Feature
Short Description
Clickup Ticket(s): https://app.clickup.com/t/868g5jctq
Adding examples for synthesizing single-table and multi-table data.
Also, fixing a couple of bugs that appeared while testing the examples.
Tests Added
NA