|
1 | | -# MulticompSol |
| 1 | +# MulticompSol Data Repository |
| 2 | +This is a data repository associated with the manuscript titled "Enhancing Predictive Models for Solubility in Multi-Solvent Systems using Semi-Supervised Graph Neural Networks" by Hojin Jung‡, Christopher D. Stubbs‡, Sabari Kumar, Raúl Pérez-Soto, Su-min Song, Yeonjoon Kim, and Seonah Kim (‡Equal contribution). |
| 3 | + |
| 4 | +This repository consists of: |
| 5 | +- A novel small molecule solubility database (for solutes in 1-3 solvents) |
| 6 | +- Code to train multicomponent solubility models (for solutes in 1-3 solvents) |
| 7 | + |
| 8 | +## Using this Repository |
| 9 | + |
| 10 | +To use the database and models in this repository, you will need a working installation of Python (v3.8-3.10) on your computer alongside the required packages (see "Packages Required"). All code was tested in Windows 10 64-bit and CentOS Stream 8, and so it should work on most modern operating systems. Please report any issues with using this code on GitHub. |
| 11 | + |
| 12 | + |
| 13 | +### Training Models |
| 14 | + |
| 15 | +- All model training requires a working Python environment, with GPU access and a CUDA setup ideal but not necessary (see "Packages Required" and "Using this Repository"). Getting CUDA and TensorFlow to work together on a GPU can be challenging, so the GNN model code falls back to a CPU if a GPU cannot be found. |
| 16 | +- For all GNN models, descriptor generation is included as part of model training. Descriptors used can be changed in gnn_multisol.py (atom_features, bond_features, global_features functions). *Note that changing the number of features will generally require changing the shapes specified in any preprocessor used.* |
| 17 | + |
| 18 | + |
| 19 | +#### Training GNN Models |
| 20 | + |
| 21 | +- To train GNN models, first check whether your machine has CUDA and TensorFlow GPU support setup. This is often a machine-specific process, and depends on your graphics card, its supported CUDA versions, the CUDA versions installed, and the TensorFlow version installed (among other factors) |
| 22 | +- GPU use is *not* required for GNN model training, but significant slowdowns may occur if a GPU is not used |
| 23 | +- To train GNN models, use the following code snippets as an example (other options available by using the --help flag or checking source code). |
| 24 | + - Subgraph Binary: `nohup python train_subgraph_binary.py -n "Example_BinarySubgraph" > Log_ExampleBinaryConcat.txt &` |
| 25 | + - Subgraph Ternary: `nohup python train_subgraph_ternary.py -n "Example_TernarySubgraph" > Log_Example_TernarySubgraph.txt &` |
| 26 | + - Concat Binary: `nohup python train_concat_binary.py -n "Example_BinaryConcat" > Log_ExampleBinaryConcat.txt &` |
| 27 | + - Concat Ternary: `nohup python train_concat_ternary.py -n "Example_TernaryConcat" > Log_Example_TernaryConcat.txt &` |
| 28 | +- Trained GNN models will be saved in models/.../model_files. Each folder has the preprocessor used, the best model (best_model.h5), and the prediction results (kfold_#.csv) |
| 29 | + |
| 30 | +### Loading Models |
| 31 | + |
| 32 | +- GNN Models |
| 33 | + - Trained GNN models can be loaded from the .h5 file found in /model_files/.../best_model.h5. To load, you will need to import the nfp package and pass nfp.custom_objects from nfp as custom_objects to the model load call. Rough example code can be found below. |
| 34 | + - Model results can be found in the same directory as the h5 file, in the csv file named `kfold_?.csv`, where ? is the fold number for that run (0-4, e.g. kfold_0.csv). |
| 35 | + |
| 36 | +```python |
| 37 | +def predict_df(df, model_name, csv_file_dir): |
| 38 | + model_dir = Path.cwd()/(f'model_files/{model_name}') |
| 39 | + csv_name = Path(csv_file_dir).stem |
| 40 | + |
| 41 | + model = tf.keras.models.load_model(model_dir/'best_model.h5', custom_objects = nfp.custom_objects) |
| 42 | + #! Will need to change the preprocessor depending on model - consult the respective training script. |
| 43 | + # (e.g. train_subgraph_binary.py for binary subgraph models) |
| 44 | + preprocessor = CustomPreprocessor_NFPx2( |
| 45 | + explicit_hs=False, |
| 46 | + atom_features=atom_features, |
| 47 | + bond_features=bond_features) |
| 48 | + preprocessor.from_json(model_dir/'preprocessor.json') |
| 49 | + |
| 50 | + output_signature = (preprocessor.output_signature, |
| 51 | + tf.TensorSpec(shape=(), dtype=tf.float32), |
| 52 | + tf.TensorSpec(shape=(), dtype=tf.float32)) |
| 53 | + |
| 54 | + df_data = tf.data.Dataset.from_generator( |
| 55 | + #! Will need to change dataset generation function depending on model - consult the respective training script. |
| 56 | + # (e.g. train_subgraph_binary.py for binary subgraph models) |
| 57 | + lambda: create_tf_dataset_NFPx2(df, preprocessor, 1.0, False), output_signature=output_signature)\ |
| 58 | + .cache()\ |
| 59 | + .padded_batch(batch_size=len(df))\ |
| 60 | + .prefetch(tf.data.experimental.AUTOTUNE) |
| 61 | + |
| 62 | + pred_results = model.predict(df_data).squeeze() |
| 63 | + df['predicted'] = pred_results |
| 64 | + return df |
| 65 | + |
| 66 | +``` |
| 67 | + |
| 68 | +## Packages Required |
| 69 | + |
| 70 | +All of the following were retrieved from PyPI, but should also be available on conda-forge. Most model development was done in Python 3.8.13, but should work fine for Python 3.8 - 3.10 (3.7 may also work, but hasn't been tested). Note that a few packages require specific version numbers (nfp, TensorFlow, pandas, RDKit). Other packages have their version specified for reproducibility, and it is recommended to use the versions specified when possible. |
| 71 | + |
| 72 | +#### Utility |
| 73 | + |
| 74 | +- matplotlib (v3.5.3) |
| 75 | +- seaborn (v0.12.0) |
| 76 | +- JupyterLab (v3.4.5) |
| 77 | + |
| 78 | +#### Descriptor Generation |
| 79 | + |
| 80 | +- mordred (v1.2.0) |
| 81 | +- RDKit (v2022.3.5) |
| 82 | + |
| 83 | +#### ML/Vector Math |
| 84 | + |
| 85 | +- numpy (v1.23.2) |
| 86 | +- scipy (v1.9.0) |
| 87 | +- pandas (v1.4.3) |
| 88 | +- scikit-learn (v1.1.2) (<1.3) |
| 89 | +- tensorflow (v2.9.1) |
| 90 | +- tensorflow-addons (v0.18.0) |
| 91 | +- Keras (v2.9.0) |
| 92 | +- nfp (v0.3.0 exactly) |
| 93 | + |
| 94 | +## Filing Issues |
| 95 | +Please report all issues or errors with code on GitHub wherever possible. |
0 commit comments