|
1 |
| -CANDLE benchmark versions are now available. |
| 1 | +# Simple transformers for classification and regression using SMILE string input |
2 | 2 |
|
3 |
| -The original examples are retained and can be run as noted below. |
| 3 | +## Introduction |
| 4 | +The ST1 benchmark represent two versions of a simple transformer, one that can perform regression and the other classification. We chose the transformer architecture to see if we could train directly on SMILE strings. This benchmark brings novel capability to the suite of Pilot1 benchmarks in two ways. First, the featureization of a small molecule is simple its SMILE string. The secone novel aspect to the set of Pilot1 benchmarks is that the model is based on the Transformer architecture, albeit this benchmark is a simpler version of the large Transformer models that train on billions and greater parameters. |
4 | 5 |
|
5 |
| -The CANDLE versions make use of the common network design in smiles_transformer.py, and implement the |
6 |
| -models in `sct_baseline_keras.py` and `srt_baseline_keras.py`, for classification and regression, respectively. |
7 |
| -All the relevant arguments are contained in the respective default model files, |
8 |
| -`class_default_model.txt` and `regress_default_model.txt`. |
9 |
| -They can be invoked with `python sct_baseline_keras.py` and all variables can be overwritten from the command line. |
10 |
| -The datasets will be automatically downloaded and stored in the `../../Data/Pilot1 directory`. |
| 6 | +Both the original code and the CANDLE versions are available. The original examples are retained and can be run as noted below. The CANDLE versions make use of the common network design in smiles_transformer.py, and implement the models in `sct_baseline2_keras.py` and `srt_baseline_keras2.py`, for classification and regression, respectively. |
11 | 7 |
|
12 |
| -Running the original versions: |
| 8 | +The example classification problem takes as input SMILE strings and trains a model to predict whether or not a compound is 'drug-like' based on Lipinski criteria. The example regression problem takes as input SMILE strings and trains a model to predict the molecular weight. Data are freely downloadable and automatically downloaded by the CANDLE versions. |
13 | 9 |
|
14 |
| -The datasets |
| 10 | +For the CANDLE versions, all the relevant arguments are contained in the respective default model files. All variables can be overwritten from the command line. The datasets will be automatically downloaded and stored in the `../../Data/Pilot1 directory`. The respective default model files and commands to invoke the classifier and regressor are: |
| 11 | +``` |
| 12 | +class_default_model.txt |
| 13 | +python sct_baseline_keras2.py |
15 | 14 |
|
16 |
| -Classification problem. |
| 15 | +and |
17 | 16 |
|
18 |
| -CHEMBL -- 1.5M training examples.. for Lipinski (1/0) (Lipinski criteria for drug likeness) validation 100K samples non-overlapping |
| 17 | +regress_default_model.txt |
| 18 | +python srt_baseline_keras2.py |
| 19 | +``` |
19 | 20 |
|
20 |
| -Classification validation accuracy is about 91% after 10-20 epochs |
| 21 | +## Running the original versions |
| 22 | +The original code demonstrating a simple transformer regressor and a simple transformer classifier are available as |
| 23 | +``` |
| 24 | +smiles_regress_transformer.py |
21 | 25 |
|
22 |
| -Regression problem |
| 26 | +and |
23 | 27 |
|
24 |
| -CHEMBL -- 1.5M training examples (shuffled and resampled so not same 1.5M as classification) .. predicting molecular Weight validation |
25 |
| -is also 100K samples non-overlapping. |
| 28 | +smiles_class_transformer.py |
| 29 | +``` |
26 | 30 |
|
27 |
| -Regression problem achieves R^2 about .95 after ~20 epochs. |
| 31 | +The example data sets are the same as for the CANDLE versions, and allow one to predict whether a small molecule is "drug-like" based on Lipinski criteria (classification problem), or predict the molecular weight (regression) from a SMILE string as input. The example data sets are downloadable using the information in the regress_default_model.txt or class_default_model.txt files. These data files must be downloaded manually and specified on the command line for execution. |
28 | 32 |
|
29 |
| -See the log files for trace. |
| 33 | +``` |
| 34 | +# for regression |
| 35 | +train_data = https://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Examples/xform-smiles-data/chm.weight.trn.csv |
| 36 | +val_data = https://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Examples/xform-smiles-data/chm.weight.val.csv |
30 | 37 |
|
31 |
| -We save the best validation loss in the *.h5 dumps. |
32 | 38 |
|
| 39 | +# for classification |
| 40 | +train_data = https://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Examples/xform-smiles-data/chm.lipinski.trn.csv |
| 41 | +val_data = https://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Examples/xform-smiles-data/chm.lipinski.val.csv |
| 42 | +``` |
33 | 43 |
|
34 | 44 | To run the models
|
35 |
| - |
36 |
| -`CUDA_VISIBLE_DEVICES=1 python smiles_class_transformer.py --in_train chm.lipinski.trn.csv --in_vali chm.lipinski.val.csv --ep 25` |
| 45 | +``` |
| 46 | +CUDA_VISIBLE_DEVICES=1 python smiles_class_transformer.py --in_train chm.lipinski.trn.csv --in_vali chm.lipinski.val.csv --ep 25 |
37 | 47 |
|
38 | 48 | or
|
39 | 49 |
|
40 |
| -`CUDA_VISIBLE_DEVICES=0 python smiles_regress_transformer.py --in_train chm.weight.trn.csv --in_vali chm.weight.val.csv --ep 25` |
41 |
| - |
42 |
| - |
43 |
| - |
44 |
| -Regression output should look something like |
45 |
| - |
| 50 | +CUDA_VISIBLE_DEVICES=0 python smiles_regress_transformer.py --in_train chm.weight.trn.csv --in_vali chm.weight.val.csv --ep 25 |
| 51 | +``` |
| 52 | +The model with the best validation loss is saved in the .h5 dumps. Log files contain the trace. Regression output should look something like this. |
46 | 53 | ```
|
47 | 54 | Epoch 1/25
|
48 | 55 | 2022-03-21 12:53:11.402337: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
|
@@ -147,3 +154,25 @@ Epoch 25/25
|
147 | 154 | Epoch 00025: val_loss did not improve from 800.85254
|
148 | 155 | ```
|
149 | 156 |
|
| 157 | + |
| 158 | +## Background on the example classification problem |
| 159 | + |
| 160 | +CHEMBL -- 1.5M training examples.. for Lipinski (1/0) (Lipinski criteria for drug likeness) validation 100K samples non-overlapping |
| 161 | + |
| 162 | +Classification validation accuracy is about 91% after 10-20 epochs |
| 163 | + |
| 164 | +## Background on the example regression problem |
| 165 | + |
| 166 | +CHEMBL -- 1.5M training examples (shuffled and resampled so not same 1.5M as classification) .. predicting molecular Weight validation |
| 167 | +is also 100K samples non-overlapping. |
| 168 | + |
| 169 | +Regression problem achieves R^2 about .95 after ~20 epochs. |
| 170 | + |
| 171 | + |
| 172 | + |
| 173 | + |
| 174 | + |
| 175 | + |
| 176 | + |
| 177 | + |
| 178 | + |
0 commit comments