-
Notifications
You must be signed in to change notification settings - Fork 1
Adding synthesizer example #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 22 commits
2550c85
8146b8a
00d879a
4148922
f075e46
7d890f9
402e35d
5fd0cc8
e9d385b
ac1a885
c904e9b
8537535
3d0b9c6
baa9824
d480dd9
a397025
e81aa91
1c55e91
37ebe7d
8974cfe
a1a12f1
3a3814d
32e9ab3
12f6491
bc9af11
02bcf42
4ffa4c0
ea1f5e8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| # Multi-Table Synthesizing Example | ||
|
|
||
| This example will go over synthesizing data for a multi-table dataset from the ground | ||
| up using the code in this toolkit. | ||
|
|
||
|
|
||
| ## Downloading data | ||
|
|
||
| First, we need the data. Download it from this | ||
| [Google Drive link](https://drive.google.com/file/d/1Ao222l4AJjG54-HDEGCWkIfzRbl9_IKa/view?usp=drive_link), | ||
| extract the files and place them in a `/data` folder in within this folder | ||
| (`examples/synthesizing/multi_table`). | ||
|
|
||
| > [!NOTE] | ||
| > If you wish to change the data folder, you can do so by editing the `base_data_dir` attribute | ||
| > of the (`config.yaml`)[config.yaml] file. | ||
| It will contain data for 8 tables: `account`, `card`, `client`, `disp`, `district`, `loan`, `order`, | ||
| and `trans`. For each table there will be two files: | ||
| - `{table_name}.csv`: The table's data. | ||
| - `{table_name}_domain.json`: Metadata about the columns in the table's data, such as data types and sizes. | ||
|
|
||
| Additionally, you will find one more file: | ||
| - `dataset_meta.json`: Metadata about the relationship between the tables. It will describe which tables | ||
| are associated with which other tables. | ||
|
|
||
|
|
||
| ## Kicking off synthesizing | ||
|
|
||
| If there is a `/results` folder within this folder (`examples/synthesizing/multi_table`) | ||
| from a previous training run, we will use that data to kick off synthesizing. | ||
| For example, you can copy the results from another run (e.g. `examples.training.multi_table.run_training`) | ||
| and paste them here and it will be picked up by this example. | ||
|
|
||
| The [`config.yaml`](config.yaml) file contains the parameters for the synthesizing and also | ||
| for training, in case there is a need to run that. Please take a look at them before kicking | ||
| off the synthesizing process and edit them as necessary. | ||
|
|
||
| To kick off synthesizing, simply run the command below from the project's root folder: | ||
|
|
||
| ```bash | ||
| python -m examples.synthesizing.multi_table.run_synthesizing | ||
| ``` | ||
|
|
||
| ## Results | ||
|
|
||
| It will save the result files inside a `/results` folder within this folder | ||
| (`examples/synthesizing/multi_table`). | ||
|
|
||
| > [!NOTE] | ||
| > If you wish to change the save folder, you can do so by editing the `results_dir` attribute | ||
| > of the (`config.yaml`)[config.yaml] file. | ||
| In the `/results/before_matching/` folder, there will be a file called `synthetic_tables.pkl`, | ||
| which is a pickle file containing the synthetic data before the matching process, in case | ||
| it's needed. | ||
|
|
||
| The `/results/single_table_synthesizing` folder will contain the final synthesized | ||
| data, organized per table. In this single-table example, there is only going to be one | ||
| synthesized table under `/results/single_table_synthesizing/trans/_final/trans_synthetic.csv`. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| # Training example configuration | ||
| # Base data directory (can be overridden from command line) | ||
| base_data_dir: examples/synthesizing/multi_table/data | ||
| results_dir: examples/synthesizing/multi_table/results | ||
|
|
||
| diffusion_config: | ||
emersodb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| d_layers: [512, 1024, 1024, 1024, 1024, 512] | ||
| dropout: 0.0 | ||
| num_timesteps: 2000 | ||
| model_type: mlp | ||
| iterations: 20000 | ||
| batch_size: 4096 | ||
| lr: 0.0006 | ||
| gaussian_loss_type: mse | ||
| weight_decay: 1e-05 | ||
| scheduler: cosine | ||
| data_split_ratios: [0.99, 0.005, 0.005] | ||
|
|
||
| clustering_config: | ||
| parent_scale: 1.0 | ||
| num_clusters: 50 | ||
| clustering_method: kmeans_and_gmm | ||
|
|
||
| classifier_config: | ||
| d_layers: [128, 256, 512, 1024, 512, 256, 128] | ||
| lr: 0.0001 | ||
| dim_t: 128 | ||
| batch_size: 4096 | ||
| iterations: 20000 | ||
|
|
||
| general_config: | ||
| data_dir: examples/synthesizing/multi_table/data | ||
| test_data_dir: examples/synthesizing/multi_table/data | ||
| exp_name: multi_table_synthesizing | ||
| workspace_dir: examples/synthesizing/multi_table/results | ||
| sample_prefix: "" | ||
|
|
||
| sampling_config: | ||
| batch_size: 20000 | ||
| classifier_scale: 1.0 | ||
|
|
||
| matching_config: | ||
| num_matching_clusters: 1 | ||
| matching_batch_size: 1000 | ||
| unique_matching: True | ||
| no_matching: False | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| import pickle | ||
| from logging import INFO | ||
| from pathlib import Path | ||
|
|
||
| import hydra | ||
| from omegaconf import DictConfig | ||
|
|
||
| from examples.training.multi_table import run_training | ||
| from midst_toolkit.common.config import GeneralConfig, MatchingConfig, SamplingConfig | ||
| from midst_toolkit.common.logger import TOOLKIT_LOGGER, log | ||
| from midst_toolkit.models.clavaddpm.data_loaders import load_tables | ||
| from midst_toolkit.models.clavaddpm.synthesizer import clava_synthesizing | ||
|
|
||
|
|
||
| # Preventing some excessive logging | ||
| TOOLKIT_LOGGER.setLevel(INFO) | ||
|
|
||
|
|
||
| @hydra.main(config_path=".", config_name="config", version_base=None) | ||
| def main(config: DictConfig) -> None: | ||
| """ | ||
| Run the synthesizing pipeline for a multi-table diffusion model. | ||
| It will load the config and then data from the `config.base_data_dir` folder, | ||
| train the model, synthesize the data and save the results in the | ||
| `config.results_dir` folder. | ||
| It will first look for a pre-trained model in the `config.results_dir` folder. | ||
| If it doesn't find one, it will train a new model from scratch. | ||
| Args: | ||
| config: Training and synthesizing configuration as an OmegaConf DictConfig object. | ||
| """ | ||
| log(INFO, f"Checking for a pre-trained model in {config.results_dir}...") | ||
|
|
||
| _, relation_order, _ = load_tables(Path(config.base_data_dir)) | ||
|
|
||
| model_file_paths = {} | ||
| for relation in relation_order: | ||
| model_file_path = Path(config.results_dir) / "models" / f"{relation[0]}_{relation[1]}_ckpt.pkl" | ||
| model_file_paths[relation] = model_file_path | ||
|
|
||
| clustering_results_file = Path(config.results_dir) / "cluster_ckpt.pkl" | ||
|
|
||
| if all(model_file.exists() for model_file in model_file_paths.values()) and clustering_results_file.exists(): | ||
| log(INFO, f"Found a pre-trained models in {config.results_dir}. Skipping training.") | ||
lotif marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| else: | ||
| log(INFO, "No pre-trained models found, training a new model from scratch...") | ||
|
||
| run_training.main(config) | ||
|
|
||
| log(INFO, "Loading models...") | ||
|
|
||
| models = {} | ||
| for relation in relation_order: | ||
| with open(model_file_paths[relation], "rb") as f: | ||
| models[relation] = pickle.load(f) | ||
|
|
||
| with open(clustering_results_file, "rb") as f: | ||
| clustering_result = pickle.load(f) | ||
|
|
||
| tables = clustering_result["tables"] | ||
| all_group_lengths_prob_dicts = clustering_result["all_group_lengths_prob_dicts"] | ||
|
|
||
| log(INFO, "Synthesizing data...") | ||
|
|
||
| clava_synthesizing( | ||
| tables, | ||
| relation_order, | ||
| Path(config.results_dir), | ||
| models, | ||
| GeneralConfig(**config.general_config), | ||
| SamplingConfig(**config.sampling_config), | ||
| MatchingConfig(**config.matching_config), | ||
| all_group_lengths_prob_dicts, | ||
| ) | ||
|
|
||
| log(INFO, "Data synthesized successfully.") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| # Single-Table Synthesizing Example | ||
|
|
||
| This example will go over synthesizing data for a single-table dataset from the ground | ||
| up using the code in this toolkit. | ||
|
|
||
|
|
||
| ## Downloading data | ||
|
|
||
| First, we need the data. Download it from this | ||
| [Google Drive link](https://drive.google.com/file/d/1J5qDuMHHg4dm9c3ISmb41tcTHSu1SVUC/view?usp=drive_link), | ||
| extract the files and place them in a `/data` folder in within this folder | ||
| (`examples/synthesizing/single_table`). | ||
|
|
||
| > [!NOTE] | ||
| > If you wish to change the data folder, you can do so by editing the `base_data_dir` attribute | ||
| > of the (`config.yaml`)[config.yaml] file. | ||
lotif marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Here is a description of the files that have been extracted: | ||
| - `trans.csv`: The training data. It consists of information about bank transactions and it | ||
| contains 20,000 data points. | ||
| - `trans_domain.json`: Metadata about the columns in `trans.csv`, such as data types and sizes. | ||
| - `dataset_meta.json`: Metadata about the relationship between the tables. Since this is a | ||
| single-table example, it will only contain information about the `trans` table. | ||
|
|
||
|
|
||
| ## Kicking off synthesizing | ||
|
|
||
| If there is a `/results` folder within this folder (`examples/synthesizing/single_table`) | ||
| from a previous training run, we will use that data to kick off synthesizing. | ||
| For example, you can copy the results from another run (e.g. `examples.training.single_table.run_training`) | ||
| and paste them here and it will be picked up by this example. | ||
|
|
||
| The [`config.yaml`](config.yaml) file contains the parameters for the synthesizing and also | ||
| for training, in case there is a need to run that. Please take a look at them before kicking | ||
| off the synthesizing process and edit them as necessary. | ||
|
|
||
| To kick off synthesizing, simply run the command below from the project's root folder: | ||
|
|
||
| ```bash | ||
| python -m examples.synthesizing.single_table.run_synthesizing | ||
| ``` | ||
|
|
||
| ## Results | ||
|
|
||
| It will save the result files inside a `/results` folder within this folder | ||
| (`examples/synthesizing/single_table`). | ||
|
|
||
| > [!NOTE] | ||
| > If you wish to change the save folder, you can do so by editing the `results_dir` attribute | ||
| > of the (`config.yaml`)[config.yaml] file. | ||
lotif marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| In the `/results/before_matching/` folder, there will be a file called `synthetic_tables.pkl`, | ||
| which is a pickle file containing the synthetic data before the matching process, in case | ||
| it's needed. | ||
|
|
||
| The `/results/single_table_synthesizing` folder will contain the final synthesized | ||
| data, organized per table. In this single-table example, there is only going to be one | ||
| synthesized table under `/results/single_table_synthesizing/trans/_final/trans_synthetic.csv`. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| # Training example configuration | ||
| # Base data directory (can be overridden from command line) | ||
| base_data_dir: examples/synthesizing/single_table/data | ||
| results_dir: examples/synthesizing/single_table/results | ||
|
|
||
| diffusion_config: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same potential suggestion about a comment here 🙂 |
||
| d_layers: [512, 1024, 1024, 1024, 1024, 512] | ||
| dropout: 0.0 | ||
| num_timesteps: 2000 | ||
| model_type: mlp | ||
| iterations: 20000 | ||
| batch_size: 4096 | ||
| lr: 0.0006 | ||
| gaussian_loss_type: mse | ||
| weight_decay: 1e-05 | ||
| scheduler: cosine | ||
| data_split_ratios: [0.99, 0.005, 0.005] | ||
|
|
||
| general_config: | ||
| data_dir: examples/synthesizing/single_table/data | ||
| test_data_dir: examples/synthesizing/single_table/data | ||
| exp_name: single_table_synthesizing | ||
| workspace_dir: examples/synthesizing/single_table/results | ||
| sample_prefix: "" | ||
|
|
||
| sampling_config: | ||
| batch_size: 20000 | ||
| classifier_scale: 1.0 | ||
|
|
||
| matching_config: | ||
| num_matching_clusters: 1 | ||
| matching_batch_size: 1000 | ||
| unique_matching: True | ||
| no_matching: False | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| import pickle | ||
| from logging import INFO | ||
| from pathlib import Path | ||
|
|
||
| import hydra | ||
| from omegaconf import DictConfig | ||
|
|
||
| from examples.training.single_table import run_training | ||
| from midst_toolkit.common.config import GeneralConfig, MatchingConfig, SamplingConfig | ||
| from midst_toolkit.common.logger import TOOLKIT_LOGGER, log | ||
| from midst_toolkit.models.clavaddpm.data_loaders import load_tables | ||
| from midst_toolkit.models.clavaddpm.synthesizer import clava_synthesizing | ||
|
|
||
|
|
||
| # Preventing some excessive logging | ||
| TOOLKIT_LOGGER.setLevel(INFO) | ||
|
|
||
|
|
||
| @hydra.main(config_path=".", config_name="config", version_base=None) | ||
| def main(config: DictConfig) -> None: | ||
| """ | ||
| Run the synthesizing pipeline for a single-table diffusion model. | ||
|
|
||
| It will load the config and then data from the `config.base_data_dir` folder, | ||
| train the model, synthesize the data and save the results in the | ||
| `config.results_dir` folder. | ||
|
|
||
| It will first look for a pre-trained model in the `config.results_dir` folder. | ||
| If it doesn't find one, it will train a new model from scratch. | ||
|
|
||
| Args: | ||
| config: Training and synthesizing configuration as an OmegaConf DictConfig object. | ||
| """ | ||
| log(INFO, f"Checking for a pre-trained model in {config.results_dir}...") | ||
|
|
||
| tables, relation_order, _ = load_tables(Path(config.base_data_dir)) | ||
|
|
||
| model_file_paths = {} | ||
|
||
| for relation in relation_order: | ||
| model_file_path = Path(config.results_dir) / "models" / f"{relation[0]}_{relation[1]}_ckpt.pkl" | ||
| model_file_paths[relation] = model_file_path | ||
|
|
||
| if all(model_file.exists() for model_file in model_file_paths.values()): | ||
| log(INFO, f"Found a pre-trained models in {config.results_dir}. Skipping training.") | ||
| else: | ||
| log(INFO, "No pre-trained models found, training a new model from scratch...") | ||
| run_training.main(config) | ||
|
|
||
| log(INFO, "Loading models...") | ||
|
|
||
| models = {} | ||
| for relation in relation_order: | ||
| with open(model_file_paths[relation], "rb") as f: | ||
| models[relation] = pickle.load(f) | ||
|
|
||
| log(INFO, "Synthesizing data...") | ||
|
|
||
| clava_synthesizing( | ||
| tables, | ||
| relation_order, | ||
| Path(config.results_dir), | ||
| models, | ||
| GeneralConfig(**config.general_config), | ||
| SamplingConfig(**config.sampling_config), | ||
| MatchingConfig(**config.matching_config), | ||
| ) | ||
|
|
||
| log(INFO, "Data synthesized successfully.") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
Uh oh!
There was an error while loading. Please reload this page.