Skip to content

Commit 644c22c

Browse files
authored
Update README.md
1 parent bb52504 commit 644c22c

File tree

1 file changed

+56
-27
lines changed

1 file changed

+56
-27
lines changed

Pilot1/ST1/README.md

Lines changed: 56 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,55 @@
1-
CANDLE benchmark versions are now available.
1+
# Simple transformers for classification and regression using SMILE string input
22

3-
The original examples are retained and can be run as noted below.
3+
## Introduction
4+
The ST1 benchmark represent two versions of a simple transformer, one that can perform regression and the other classification. We chose the transformer architecture to see if we could train directly on SMILE strings. This benchmark brings novel capability to the suite of Pilot1 benchmarks in two ways. First, the featureization of a small molecule is simple its SMILE string. The secone novel aspect to the set of Pilot1 benchmarks is that the model is based on the Transformer architecture, albeit this benchmark is a simpler version of the large Transformer models that train on billions and greater parameters.
45

5-
The CANDLE versions make use of the common network design in smiles_transformer.py, and implement the
6-
models in `sct_baseline_keras.py` and `srt_baseline_keras.py`, for classification and regression, respectively.
7-
All the relevant arguments are contained in the respective default model files,
8-
`class_default_model.txt` and `regress_default_model.txt`.
9-
They can be invoked with `python sct_baseline_keras.py` and all variables can be overwritten from the command line.
10-
The datasets will be automatically downloaded and stored in the `../../Data/Pilot1 directory`.
6+
Both the original code and the CANDLE versions are available. The original examples are retained and can be run as noted below. The CANDLE versions make use of the common network design in smiles_transformer.py, and implement the models in `sct_baseline2_keras.py` and `srt_baseline_keras2.py`, for classification and regression, respectively.
117

12-
Running the original versions:
8+
The example classification problem takes as input SMILE strings and trains a model to predict whether or not a compound is 'drug-like' based on Lipinski criteria. The example regression problem takes as input SMILE strings and trains a model to predict the molecular weight. Data are freely downloadable and automatically downloaded by the CANDLE versions.
139

14-
The datasets
10+
For the CANDLE versions, all the relevant arguments are contained in the respective default model files. All variables can be overwritten from the command line. The datasets will be automatically downloaded and stored in the `../../Data/Pilot1 directory`. The respective default model files and commands to invoke the classifier and regressor are:
11+
```
12+
class_default_model.txt
13+
python sct_baseline_keras2.py
1514
16-
Classification problem.
15+
and
1716
18-
CHEMBL -- 1.5M training examples.. for Lipinski (1/0) (Lipinski criteria for drug likeness) validation 100K samples non-overlapping
17+
regress_default_model.txt
18+
python srt_baseline_keras2.py
19+
```
1920

20-
Classification validation accuracy is about 91% after 10-20 epochs
21+
## Running the original versions
22+
The original code demonstrating a simple transformer regressor and a simple transformer classifier are available as
23+
```
24+
smiles_regress_transformer.py
2125
22-
Regression problem
26+
and
2327
24-
CHEMBL -- 1.5M training examples (shuffled and resampled so not same 1.5M as classification) .. predicting molecular Weight validation
25-
is also 100K samples non-overlapping.
28+
smiles_class_transformer.py
29+
```
2630

27-
Regression problem achieves R^2 about .95 after ~20 epochs.
31+
The example data sets are the same as for the CANDLE versions, and allow one to predict whether a small molecule is "drug-like" based on Lipinski criteria (classification problem), or predict the molecular weight (regression) from a SMILE string as input. The example data sets are downloadable using the information in the regress_default_model.txt or class_default_model.txt files. These data files must be downloaded manually and specified on the command line for execution.
2832

29-
See the log files for trace.
33+
```
34+
# for regression
35+
train_data = https://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Examples/xform-smiles-data/chm.weight.trn.csv
36+
val_data = https://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Examples/xform-smiles-data/chm.weight.val.csv
3037
31-
We save the best validation loss in the *.h5 dumps.
3238
39+
# for classification
40+
train_data = https://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Examples/xform-smiles-data/chm.lipinski.trn.csv
41+
val_data = https://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Examples/xform-smiles-data/chm.lipinski.val.csv
42+
```
3343

3444
To run the models
35-
36-
`CUDA_VISIBLE_DEVICES=1 python smiles_class_transformer.py --in_train chm.lipinski.trn.csv --in_vali chm.lipinski.val.csv --ep 25`
45+
```
46+
CUDA_VISIBLE_DEVICES=1 python smiles_class_transformer.py --in_train chm.lipinski.trn.csv --in_vali chm.lipinski.val.csv --ep 25
3747
3848
or
3949
40-
`CUDA_VISIBLE_DEVICES=0 python smiles_regress_transformer.py --in_train chm.weight.trn.csv --in_vali chm.weight.val.csv --ep 25`
41-
42-
43-
44-
Regression output should look something like
45-
50+
CUDA_VISIBLE_DEVICES=0 python smiles_regress_transformer.py --in_train chm.weight.trn.csv --in_vali chm.weight.val.csv --ep 25
51+
```
52+
The model with the best validation loss is saved in the .h5 dumps. Log files contain the trace. Regression output should look something like this.
4653
```
4754
Epoch 1/25
4855
2022-03-21 12:53:11.402337: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
@@ -147,3 +154,25 @@ Epoch 25/25
147154
Epoch 00025: val_loss did not improve from 800.85254
148155
```
149156

157+
158+
## Background on the example classification problem
159+
160+
CHEMBL -- 1.5M training examples.. for Lipinski (1/0) (Lipinski criteria for drug likeness) validation 100K samples non-overlapping
161+
162+
Classification validation accuracy is about 91% after 10-20 epochs
163+
164+
## Background on the example regression problem
165+
166+
CHEMBL -- 1.5M training examples (shuffled and resampled so not same 1.5M as classification) .. predicting molecular Weight validation
167+
is also 100K samples non-overlapping.
168+
169+
Regression problem achieves R^2 about .95 after ~20 epochs.
170+
171+
172+
173+
174+
175+
176+
177+
178+

0 commit comments

Comments
 (0)