This is a data repository associated with the manuscript titled "Predicting homopolymer and copolymer solubility through machine learning" by Christopher D. Stubbs, Yeonjoon Kim, Ethan C. Quinn, Raúl Pérez-Soto, Eugene Y.-X. Chen, and Seonah Kim.
This repository consists of a few parts:
- 2 polymer solubility databases (homopolymer + copolymer)
- Code to train and analyze ~38 polymer solubility models (classical + GNN for homopolymers/copolymers)
- Code to perform Shapley Additive Value (SHAP) analysis on the homopolymer model
- Code to identify selective solvents for homopolymer additive removal
This repository includes:
- Two new polymer solubility databases (curated by hand from J. Brandrup's Polymer Handbook).
- Homopolymer solubility (1818 datapoints)
- File: data/pkls/df_atactic_NOPE_nr_norad.pkl
- File: data/csvs/df_atactic_NOPE_nr_norad.csv
- Copolymer solubility (270 datapoints)
- File: data/pkls/df_dicopoly_norad.pkl
- File: data/csvs/df_dicopoly_norad.csv
- Homopolymer solubility (1818 datapoints)
- Multiple new ML models of polymer solubility (using molecular and/or fingerprint descriptors)
- Homopolymer classical
- File: scripts/train_homopoly_classical.py
- Copolymer classical
- File: scripts/train_copoly_classical.py
- Homopolymer GNN
- File: model_files/main_homopoly.py
- Copolymer GNN
- File: model_files/main_copoly.py
- Homopolymer classical
- Code to generate ML descriptors for classical models
- Homopolymer
- File: scripts/descriptor_gen_homopoly_classical.ipynb
- Copolymer
- File: scripts/descriptor_gen_copoly_classical.ipynb
- Homopolymer
- Code for analyzing model performance via Shapley Additive Explanations (SHAP)
- File: scripts/analysis_and_SHAP.ipynb
- Code to predict selective solvents for plastic additive removal from polyethylene and polystyrene
- File: scripts/additive_removal.ipynb
To use the database and models in this repository, you will need a working installation of Python (v3.8-3.10) on your computer alongside the required packages (see "Packages Required"). All code was tested in Windows 10 64-bit and CentOS Stream 8, and so it should work on most modern operating systems. Please report any issues with using this code on GitHub.
- To use the homopolymer and copolymer solubility databased (called "PolySolDB" collectively), use the pandas
read_picklemethod to load the database pickle files (located in data/pkls/).
import pandas as pd
df_homopoly = pd.read_pickle("data/pkls/df_atactic_NOPE_nr_norad.pkl")
df_copoly = pd.read_pickle("data/pkls/df_dicopoly_norad.pkl)
print("# Datapoints PolySolDB Homopolymer:", df_homopoly.shape[0])
print("# Datapoints PolySolDB Copolymer:", df_copoly.shape[0])- All model training requires a working Python environment, with GPU access and a CUDA setup ideal but not necessary (see "Packages Required" and "Using this Repository"). Getting CUDA and TensorFlow to work together on a GPU can be challenging, so the GNN model code falls back to a CPU if a GPU cannot be found.
- For the classical ML models, descriptor generation is required before training - see scripts/descriptor_gen_..._classical.ipynb
- Generating these descriptors can be quite resource intensive (30-60 minutes), and 16+ GB of RAM with at least 10GB of storage space is recommended to run the code.
- For the GNN models, descriptor generation is included as part of model training.
- As previously stated, generate descriptors first by running the descriptor_gen jupyter notebooks found in the 'scripts' folder.
- To train the homopolymer model, call the relevant training script as follows (replace the model name with one of your choosing)
- Additional parameters can be seen in the source code or by using the --help flag (e.g. python scripts/train_homopoly_classical.py)
- Homopolymer:
python train_homopoly_classical.py -d='atombd,mordred,atommordred,mfp,atommfp,rdfp,atomrdfp' -m='2,3,4,5' --seed=0 --nprocs=10 - Copolymer:
python train_copoly_classical.py -d='atombd,mordred,atommordred,mfp,atommfp,rdfp,atomrdfp' --seed=0 --nprocs=10
- To change the specific descriptors used for training, change the -d flag. Warning: training with less than the 7 descriptor sets may break the modelGroup parsing code.
- Homopolymer models only: To change the specific architectures used for training, change the -m flag. Warning: training with more/less than the 4 architectures specified may break the modelGroup parsing code.
- Trained classical models can be found in pkls/2D_atactic_NOPE_nr_fm (homopolymer) or pkls/2D_copoly.
- Use the modelGroup defined in scripts/metrics_gen.py to load classical model results.
- To train GNN models, first check whether your machine has CUDA and TensorFlow GPU support setup. This is often a machine-specific process, and depends on your graphics card, its supported CUDA versions, the CUDA versions installed, and the TensorFlow version installed (among other factors)
- GPU use is not required for GNN model training, but significant slowdowns may occur if a GPU is not used
- To train GNN models, use the following code snippets as an example (other options available by using the --help flag or checking source code)
- Homopolymer:
python models/homopoly_solub_gnn_v0/main_homopoly.py -n "ExampleHomopolymerModel" - Copolymer:
python models/copoly_solub_gnn_v0/main_copoly.py -n "ExampleCopolymerModel"
- Homopolymer:
- Trained GNN models will be saved in models/.../model_files. Each folder has the preprocessor used, the best model (best_model.h5), and the prediction results (kfold_#.csv)
- Classical models
- Classical models can be loaded using the modelGroup class found in scripts/metrics_gen.py
- Example usage of the modelGroup class can be found in scripts/analysis_and_SHAP.ipynb
- GNN Models
- Pre-trained GNN models can be loaded from the .h5 file found in models/.../model_files/.../best_model.h5. To load, you will need to import the nfp package and pass nfp.custom_objects from nfp as custom_objects to the model load call. Rough example code is below.
- Model results can be found in the same directory as the h5 file, in the csv file named
kfold_?.csv, where ? is the fold number for that run (0-4, e.g. kfold_0.csv).
def predict_df(df, model_name, csv_file_dir):
model_dir = Path.cwd()/(f'model_files/{model_name}')
csv_name = Path(csv_file_dir).stem
model = tf.keras.models.load_model(model_dir/'best_model.h5', custom_objects = nfp.custom_objects)
preprocessor = CustomPreprocessor(
explicit_hs=False,
atom_features=atom_features,
bond_features=bond_features)
preprocessor.from_json(model_dir/'preprocessor.json')
output_signature = (preprocessor.output_signature,
tf.TensorSpec(shape=(2,), dtype=tf.float32),
tf.TensorSpec(shape=(), dtype=tf.float32))
df_data = tf.data.Dataset.from_generator(
lambda: create_tf_dataset(df, preprocessor, 1.0, False), output_signature=output_signature)\
.cache()\
.padded_batch(batch_size=len(df))\
.prefetch(tf.data.experimental.AUTOTUNE)
pred_results = model.predict(df_data).squeeze()
df['predicted'] = pred_results- All classical model analysis can be found in scripts/analysis_and_SHAP.ipynb, including accuracy/precision/recall and other classification metrics. Additionally, SHAP analysis for the best homopolymer model can be found directly below the model analysis - we use train data for SHAP predictions, but SHAP predictions for test data are also included for comparison.
- All code for plastic additive removal can be found in scripts/additive_removal.ipynb. This code allows us to use polymer solubility model alongside a small molecule solubility model to predict selective solvents for plastic additive removal via dissolution-precipitation.
All of the following were retrieved from PyPI, but should also be available on conda-forge. Most model development was done in Python 3.8.13, but should work fine for Python 3.8 - 3.10 (3.7 may also work, but hasn't been tested). Note that a few packages require specific version numbers (nfp, TensorFlow, pandas, RDKit). Other packages have their version specified for reproducibility, and it is recommended to use the versions specified when possible.
- matplotlib (v3.5.3)
- seaborn (v0.12.0)
- JupyterLab (v3.4.5)
- mordred (v1.2.0)
- RDKit (v2022.3.5)
- numpy (v1.23.2)
- scipy (v1.9.0)
- pandas (v1.4.3)
- scikit-learn (v1.1.2)
- tensorflow (v2.9.1)
- tensorflow-addons (v0.18.0)
- Keras (v2.9.0)
- nfp (v0.3.0 exactly)
- SHAP (v0.41.0 - requires some modifications - see analysis_and_SHAP.ipynb)
Please report all issues or errors with code on GitHub wherever possible.