mention ASE-db and additional params related to cond model

pregHosh · pregHosh · commit fd645c8dd046 · 2025-11-04T10:50:05.000+01:00
diff --git a/configs/data/mol_dataset.yaml b/configs/data/mol_dataset.yaml
@@ -1,6 +1,6 @@
 _target_: MolecularDiffusion.runmodes.train.DataModule
 root: /home/pregabalin/RF/blue_edm/data/qm9
-filename: /home/pregabalin/RF/blue_edm/data/qm9/dsgdb9nsd_ready.csv # 4k or ready
+filename: /home/pregabalin/RF/blue_edm/data/qm9/dsgdb9nsd_4k.csv # 4k or ready
 atom_vocab: [H,B,C,N,O,F,Al,Si,P,S,Cl,As,Se,Br,I,Hg,Bi]
 dataset_name: qm9
 with_hydrogen: True
@@ -16,4 +16,5 @@ load_pkl: null
 save_pkl: data/test.pkl
 data_type: pointcloud # pyg or pointcloud
 batch_size: 48
-num_workers: 0
+num_workers: 0
+consider_global_attributes: False
diff --git a/configs/tasks/diffusion.yaml b/configs/tasks/diffusion.yaml
@@ -5,10 +5,10 @@ condition_names: []
 hidden_size: 192
 act_fn:
   _target_: torch.nn.SiLU
-num_layers: 1
+num_layers: 9
 attention: True
 tanh: True
-num_sublayers: 12
+num_sublayers: 1
 sin_embedding: False
 aggregation_method: "sum"
 dropout: 0.0
@@ -19,7 +19,7 @@ normalization_factor: 1.0
 chkpt_path: null
 
 # specific to diffusion
-diffusion_steps : 400
+diffusion_steps : 900
 diffusion_noise_schedule : polynomial_2 # learned, cosine_x, polynomial_x, issnr_x, smld_x
 diffusion_noise_precision: 1e-5
 diffusion_loss_type:  vlb
@@ -40,6 +40,7 @@ sp_regularizer_polynomial_p: 1.1
 sp_regularizer_warm_up_steps: 100
 reference_indices: null # indices of core atoms for the outpainting objective
 # evaluator parameters
-metrics: "Validity Relax and connected" # Validity Relax and connected, Validity Strict and connected, Validity Strict, Validity Relax
+use_posebuster: True
+metrics: valid_posebuster # use_posebuster must be true
 n_samples: 24
 generative_analysis: True
diff --git a/configs/tasks/diffusion_pretrained.yaml b/configs/tasks/diffusion_pretrained.yaml
@@ -5,10 +5,10 @@ condition_names: []
 hidden_size: 192
 act_fn:
   _target_: torch.nn.SiLU
-num_layers: 1
+num_layers: 9
 attention: True
 tanh: True
-num_sublayers: 9
+num_sublayers: 1
 sin_embedding: False
 aggregation_method: "sum"
 dropout: 0.0
@@ -27,8 +27,8 @@ normalize_factors: [1,4,10]
 extra_norm_values: []
 augment_noise: False
 data_augmentation: False
-context_mask_rate: 0.0
-mask_value: null
+context_mask_rate: 0.2
+mask_value: 5
 normalize_condition: value_10 # [None, "maxmin", "mad"]
 sp_regularizer_deploy: False
 sp_regularizer_regularizer: hard
@@ -40,6 +40,7 @@ sp_regularizer_polynomial_p: 1.1
 sp_regularizer_warm_up_steps: 100
 reference_indices: null # indices of core atoms for the outpainting objective
 # evaluator parameters
-metrics: "Validity Relax and connected" # Validity Relax and connected, Validity Strict and connected, Validity Strict, Validity Relax
+use_posebuster: True
+metrics: valid_posebuster # use_posebuster must be true
 n_samples: 24
 generative_analysis: True
diff --git a/src/MolecularDiffusion/runmodes/train/tasks_egcl.py b/src/MolecularDiffusion/runmodes/train/tasks_egcl.py
@@ -321,7 +321,7 @@ def build(self):
                     self.task.std = chk_point["std"]                           
             except FileNotFoundError:
                 logger.warning(f"Checkpoint not found at {self.chkpt_path}. Initializing model without loading.")      
-                
+                raise FileNotFoundError(f"Checkpoint not found at {self.chkpt_path}.")
         self.task.atom_vocab = self.atom_vocab
             
         return self.task
diff --git a/src/MolecularDiffusion/runmodes/train/tasks_egt.py b/src/MolecularDiffusion/runmodes/train/tasks_egt.py
@@ -285,7 +285,7 @@ def build(self):
                     self.task.std = chk_point["std"]                           
             except FileNotFoundError:
                 logger.warning(f"Checkpoint not found at {self.chkpt_path}. Initializing model without loading.")      
-           
+                raise FileNotFoundError(f"Checkpoint not found at {self.chkpt_path}.")
         self.task.atom_vocab = self.atom_vocab
             
         return self.task
diff --git a/tutorials/01_training_diffusion/README.md b/tutorials/01_training_diffusion/README.md
@@ -44,6 +44,7 @@ This is the most important step. You will override the default parameters to con
 | `trainer.output_path` | `trainer: {output_path: "results/my_run"}` | **CRITICAL:** Where all logs and checkpoints are saved. |
 | `data.filename` | `data: {filename: "molecules.csv"}` | The CSV file with molecule information. |
 | `data.xyz_dir` | `data: {xyz_dir: "xyz_files/"}` | The directory containing `.xyz` geometry files. |
+| `data.ase_db_path` | `data: {ase_db_path: "data/qm9.db"}` | Path to an ASE database file (`.db`) or a directory containing `.db` files. This is an alternative to `data.filename` and `data.xyz_dir`. |
 
 #### Data Processing and Caching
 
diff --git a/tutorials/01_training_diffusion/my_first_run.yaml b/tutorials/01_training_diffusion/my_first_run.yaml
@@ -22,8 +22,9 @@ logger:
 data:
   batch_size: 64
   root: data # where the data is stored
-  filename: molecule.csv
-  xyz_dir: xyz_files
+  # filename: molecule.csv # Use this for CSV/XYZ data
+  # xyz_dir: xyz_files # Directory for XYZ files if using CSV
+  ase_db_path: data/qm9.db # Path to an ASE database file or directory containing .db files
   load_pkl: true # to reload the processed data if it exists
 
 
diff --git a/tutorials/04_finetuning/README.md b/tutorials/04_finetuning/README.md
@@ -14,6 +14,9 @@ Fine-tuning is a powerful technique where you take a pre-trained model and conti
 >
 > If the architectures do not match, PyTorch will be unable to load the weights from the checkpoint, and the fine-tuning process will fail. Always ensure your model configuration YAML file matches the settings of the pre-trained model.
 
+If you are adapting from our pre-trained diffusion model, please use the config file `configs/tasks/diffusion_pretrained.yaml` for the tasks
+
+
 ## The Core Concepts of Fine-Tuning
 
 **Important Note:** The configuration files for this tutorial must be placed in the `configs/` directory at the root of the project for the scripts to read the settings.
@@ -91,7 +94,7 @@ tasks:
 | `tasks.condition_names`| `["S1_exc", "T1_exc"]` | A list of property names from your dataset that the model should learn to associate with the molecules. |
 | `tasks.context_mask_rate`| `0.1` | The probability of hiding the condition during training. A value greater than 0 is required to enable Classifier-Free Guidance (CFG) during generation. A common value is 0.1 (10% of the time). |
 | `tasks.mask_value`| `[0, 0]` | The value to use when a condition is masked. This should be a list with the same length as `condition_names`. Typically, this is `0` or the mean value of the property in the dataset. |
-
+| `tasks.normalization_method` | `"maxmin"` | The method to normalize conditional properties. Options are: `"maxmin"` (scales to [-1, 1]), `"mad"` (mean absolute deviation), `"value_N"` (divides by a specific value N), or `null` for no normalization. |
 **Example `finetune_add_condition.yaml`:**
 
 ```yaml
@@ -117,6 +120,7 @@ tasks:
   # KEY CHANGE: Add the conditions to learn
   condition_names: ["S1_exc", "T1_exc"]
   context_mask_rate: 0.1 # Make it CFG-ready
+  normalization_method: value_10
 ```
 
 ---