Skip to content

Commit 5aa452a

Browse files
authored
Merge pull request #1 from BioE-KimLab/add_train_database_files
Add model training files and update README
2 parents 1a6d9ca + 6fadc82 commit 5aa452a

File tree

9 files changed

+62278
-1
lines changed

9 files changed

+62278
-1
lines changed

README.md

Lines changed: 95 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,95 @@
1-
# MulticompSol
1+
# MulticompSol Data Repository
2+
This is a data repository associated with the manuscript titled "Enhancing Predictive Models for Solubility in Multi-Solvent Systems using Semi-Supervised Graph Neural Networks" by Hojin Jung‡, Christopher D. Stubbs‡, Sabari Kumar, Raúl Pérez-Soto, Su-min Song, Yeonjoon Kim, and Seonah Kim (‡Equal contribution).
3+
4+
This repository consists of:
5+
- A novel small molecule solubility database (for solutes in 1-3 solvents)
6+
- Code to train multicomponent solubility models (for solutes in 1-3 solvents)
7+
8+
## Using this Repository
9+
10+
To use the database and models in this repository, you will need a working installation of Python (v3.8-3.10) on your computer alongside the required packages (see "Packages Required"). All code was tested in Windows 10 64-bit and CentOS Stream 8, and so it should work on most modern operating systems. Please report any issues with using this code on GitHub.
11+
12+
13+
### Training Models
14+
15+
- All model training requires a working Python environment, with GPU access and a CUDA setup ideal but not necessary (see "Packages Required" and "Using this Repository"). Getting CUDA and TensorFlow to work together on a GPU can be challenging, so the GNN model code falls back to a CPU if a GPU cannot be found.
16+
- For all GNN models, descriptor generation is included as part of model training. Descriptors used can be changed in gnn_multisol.py (atom_features, bond_features, global_features functions). *Note that changing the number of features will generally require changing the shapes specified in any preprocessor used.*
17+
18+
19+
#### Training GNN Models
20+
21+
- To train GNN models, first check whether your machine has CUDA and TensorFlow GPU support setup. This is often a machine-specific process, and depends on your graphics card, its supported CUDA versions, the CUDA versions installed, and the TensorFlow version installed (among other factors)
22+
- GPU use is *not* required for GNN model training, but significant slowdowns may occur if a GPU is not used
23+
- To train GNN models, use the following code snippets as an example (other options available by using the --help flag or checking source code).
24+
- Subgraph Binary: `nohup python train_subgraph_binary.py -n "Example_BinarySubgraph" > Log_ExampleBinaryConcat.txt &`
25+
- Subgraph Ternary: `nohup python train_subgraph_ternary.py -n "Example_TernarySubgraph" > Log_Example_TernarySubgraph.txt &`
26+
- Concat Binary: `nohup python train_concat_binary.py -n "Example_BinaryConcat" > Log_ExampleBinaryConcat.txt &`
27+
- Concat Ternary: `nohup python train_concat_ternary.py -n "Example_TernaryConcat" > Log_Example_TernaryConcat.txt &`
28+
- Trained GNN models will be saved in models/.../model_files. Each folder has the preprocessor used, the best model (best_model.h5), and the prediction results (kfold_#.csv)
29+
30+
### Loading Models
31+
32+
- GNN Models
33+
- Trained GNN models can be loaded from the .h5 file found in /model_files/.../best_model.h5. To load, you will need to import the nfp package and pass nfp.custom_objects from nfp as custom_objects to the model load call. Rough example code can be found below.
34+
- Model results can be found in the same directory as the h5 file, in the csv file named `kfold_?.csv`, where ? is the fold number for that run (0-4, e.g. kfold_0.csv).
35+
36+
```python
37+
def predict_df(df, model_name, csv_file_dir):
38+
model_dir = Path.cwd()/(f'model_files/{model_name}')
39+
csv_name = Path(csv_file_dir).stem
40+
41+
model = tf.keras.models.load_model(model_dir/'best_model.h5', custom_objects = nfp.custom_objects)
42+
#! Will need to change the preprocessor depending on model - consult the respective training script.
43+
# (e.g. train_subgraph_binary.py for binary subgraph models)
44+
preprocessor = CustomPreprocessor_NFPx2(
45+
explicit_hs=False,
46+
atom_features=atom_features,
47+
bond_features=bond_features)
48+
preprocessor.from_json(model_dir/'preprocessor.json')
49+
50+
output_signature = (preprocessor.output_signature,
51+
tf.TensorSpec(shape=(), dtype=tf.float32),
52+
tf.TensorSpec(shape=(), dtype=tf.float32))
53+
54+
df_data = tf.data.Dataset.from_generator(
55+
#! Will need to change dataset generation function depending on model - consult the respective training script.
56+
# (e.g. train_subgraph_binary.py for binary subgraph models)
57+
lambda: create_tf_dataset_NFPx2(df, preprocessor, 1.0, False), output_signature=output_signature)\
58+
.cache()\
59+
.padded_batch(batch_size=len(df))\
60+
.prefetch(tf.data.experimental.AUTOTUNE)
61+
62+
pred_results = model.predict(df_data).squeeze()
63+
df['predicted'] = pred_results
64+
return df
65+
66+
```
67+
68+
## Packages Required
69+
70+
All of the following were retrieved from PyPI, but should also be available on conda-forge. Most model development was done in Python 3.8.13, but should work fine for Python 3.8 - 3.10 (3.7 may also work, but hasn't been tested). Note that a few packages require specific version numbers (nfp, TensorFlow, pandas, RDKit). Other packages have their version specified for reproducibility, and it is recommended to use the versions specified when possible.
71+
72+
#### Utility
73+
74+
- matplotlib (v3.5.3)
75+
- seaborn (v0.12.0)
76+
- JupyterLab (v3.4.5)
77+
78+
#### Descriptor Generation
79+
80+
- mordred (v1.2.0)
81+
- RDKit (v2022.3.5)
82+
83+
#### ML/Vector Math
84+
85+
- numpy (v1.23.2)
86+
- scipy (v1.9.0)
87+
- pandas (v1.4.3)
88+
- scikit-learn (v1.1.2) (<1.3)
89+
- tensorflow (v2.9.1)
90+
- tensorflow-addons (v0.18.0)
91+
- Keras (v2.9.0)
92+
- nfp (v0.3.0 exactly)
93+
94+
## Filing Issues
95+
Please report all issues or errors with code on GitHub wherever possible.

0 commit comments

Comments
 (0)