Skip to content

Commit 5a4dc28

Browse files
committed
Initial commit of the open-source repo
0 parents  commit 5a4dc28

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+18437
-0
lines changed

LICENSE

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
MIT License
2+
3+
Copyright (c) 2018-present The OpenGNN Authors.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.
22+

OPENNMT.LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2017-present The OpenNMT Authors.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# OpenGNN
2+
3+
OpenGNN is a machine learning library for learning over graph-structured data. It was built with generality in mind and supports tasks such as:
4+
5+
* graph regression
6+
* graph-to-sequence mapping
7+
8+
It supports various graph encoders including GGNNs, GCNs, SequenceGNNs and other variations of [neural graph message passing](https://arxiv.org/pdf/1704.01212.pdf).
9+
10+
This library's design and usage patterns are inspired from [OpenNMT](https://github.com/OpenNMT/OpenNMT-tf) and uses the recent [Dataset](https://www.tensorflow.org/programmers_guide/datasets) and [Estimator](https://www.tensorflow.org/programmers_guide/estimators) APIs.
11+
12+
## Installation
13+
14+
OpenGNN requires
15+
16+
* Python (>= 3.5)
17+
* Tensorflow (>= 1.10 < 2.0)
18+
19+
To install the library aswell as the command-line entry points run
20+
21+
``` pip install -e .```
22+
23+
## Getting Started
24+
25+
To experiment with the library, you can use one datasets provided in the [data](/data) folder.
26+
For example, to experiment with the chemical dataset, first install the `rdkit` library that
27+
can be obtained by running `conda install -c rdkit rdkit`.
28+
Then, in the [data/chem](/data/chem) folder, run `python get_data.py` to download the dataset.
29+
30+
After getting the data, generate a node and edge vocabulary for them using
31+
```bash
32+
ognn-build-vocab --field_name node_labels --save_vocab node.vocab \
33+
molecules_graphs_train.jsonl
34+
ognn-build-vocab --no_pad_token --field_name edges --string_index 0 --save_vocab edge.vocab \
35+
molecules_graphs_train.jsonl
36+
```
37+
38+
### Command Line
39+
40+
The main entry point to the library is the `ognn-main` command
41+
42+
```bash
43+
ognn-main <run_type> --model_type <model> --config <config_file.yml>
44+
```
45+
46+
Currently there are two run types: `train_and_eval` and `infer`
47+
48+
For example, to train a model on the previously extracted chemical data
49+
(again inside [data/chem](/data/chem)) using a predefined model in the
50+
catalog
51+
52+
```bash
53+
ognn-main train_and_eval --model_type chemModel --config config.yml
54+
```
55+
56+
You can also define your own model in a custom python script with a `model` function.
57+
For example, we can train using the a custom model in `model.py` using
58+
59+
```bash
60+
ognn-main train_and_eval --model model.py --config config.yml
61+
```
62+
63+
While the training script doesn't log the training to the standard output,
64+
we can monitor training by using tensorboard on the model directory defined in
65+
[data/chem/config.yml](data/chem/config.yml).
66+
67+
After training, we can perform inference on the valid file running
68+
69+
```
70+
ognn-main infer --model_type chemModel --config config.yml \
71+
--features_file molecules_graphs_valid.jsonl
72+
--prediction_file molecules_predicted_valid.jsonl
73+
```
74+
75+
76+
Examples of other config files can be found in the [data](/data) folder.
77+
78+
### Library
79+
80+
The library can also be easily integrated in your own code.
81+
The following example shows how to create a GGNN Encoder to encode a batch of random graphs.
82+
83+
```python
84+
import tensorflow as tf
85+
import opengnn as ognn
86+
87+
tf.enable_eager_execution()
88+
89+
# build a batch of graphs with random initial features
90+
edges = tf.SparseTensor(
91+
indices=[
92+
[0, 0, 0, 1], [0, 0, 1, 2],
93+
[1, 0, 0, 0],
94+
[2, 0, 1, 0], [2, 0, 2, 1], [2, 0, 3, 2], [2, 0, 4, 3]],
95+
values=[1, 1, 1, 1, 1, 1, 1],
96+
dense_shape=[3, 1, 5, 5])
97+
node_features = tf.random_uniform((3, 5, 256))
98+
graph_sizes = [3, 1, 5]
99+
100+
encoder = ognn.encoders.GGNNEncoder(1, 256)
101+
outputs, state = encoder(
102+
edges,
103+
node_features,
104+
graph_sizes)
105+
106+
print(outputs)
107+
```
108+
109+
Graphs are represented by a sparse adjency matrix with dimensionality
110+
`num_edge_types x num_nodes x num_nodes` and an initial distributed representation for each node.
111+
112+
Similarly to sequences, when batching we need to pad the graphs to the maximum number of nodes in a graph
113+
114+
115+
## Acknowledgments
116+
The design of the library and implementations are based on
117+
* [OpenNMT-tf](https://github.com/OpenNMT/OpenNMT-tf)
118+
* [Gated Graph Neural Networks](https://github.com/Microsoft/gated-graph-neural-network-samples)
119+
120+
Since most of the code adapted from OpenNMT-tf is spread across multiple files, the license for the
121+
library is located in the [base folder](/OPENNMT.LICENSE) rather than in the headers of the files.
122+
123+
## Reference
124+
125+
If you use this library in your own research, please cite
126+
127+
```
128+
@inproceedings{
129+
pfernandes2018structsumm,
130+
title="Structured Neural Summarization",
131+
author={Patrick Fernandes and Miltiadis Allamanis and Marc Brockschmidt },
132+
booktitle={Proceedings of the 7th International Conference on Learning Representations (ICLR)},
133+
year={2019},
134+
url={https://arxiv.org/abs/1811.01824},
135+
}
136+
```
137+
138+
139+
140+
141+
142+

data/chem/config.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Example parameters (this does not cover every parameter)
2+
3+
model_dir: model_dir
4+
5+
data:
6+
train_graphs_file: molecules_graphs_train.jsonl
7+
train_labels_file: molecules_labels_train.jsonl
8+
9+
eval_graphs_file: molecules_graphs_valid.jsonl
10+
eval_labels_file: molecules_labels_valid.jsonl
11+
12+
node_vocabulary: node.vocab
13+
edge_vocabulary: edge.vocab
14+
15+
16+
params:
17+
learning_rate: 0.001
18+
param_init: 0.1
19+
clip_gradients: 1.
20+
maximum_iterations: 250
21+
22+
train:
23+
batch_size: 64
24+
bucket_width: 1
25+
train_steps: 1000000
26+
maximum_features_size: 200
27+
maximum_labels_size: 50
28+
sample_buffer_size: 10000

data/chem/get_data.py

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
import os
2+
from rdkit import Chem
3+
import glob
4+
import json
5+
import numpy as np
6+
7+
if not os.path.exists('data'):
8+
os.mkdir('data')
9+
print('made directory ./data/')
10+
11+
download_path = os.path.join('data', 'dsgdb9nsd.xyz.tar.bz2')
12+
if not os.path.exists(download_path):
13+
print('downloading data to %s ...' % download_path)
14+
source = 'https://ndownloader.figshare.com/files/3195389'
15+
os.system('wget -O %s %s' % (download_path, source))
16+
print('finished downloading')
17+
18+
unzip_path = os.path.join('data', 'qm9_raw')
19+
if not os.path.exists(unzip_path):
20+
print('extracting data to %s ...' % unzip_path)
21+
os.mkdir(unzip_path)
22+
os.system('tar xvjf %s -C %s' % (download_path, unzip_path))
23+
print('finished extracting')
24+
25+
26+
def preprocess():
27+
index_of_mu = 4
28+
29+
def read_xyz(file_path):
30+
with open(file_path, 'r') as f:
31+
lines = f.readlines()
32+
smiles = lines[-2].split('\t')[0]
33+
properties = lines[1].split('\t')
34+
mu = float(properties[index_of_mu])
35+
return {'smiles': smiles, 'mu': mu}
36+
37+
print('loading train/validation split')
38+
with open('valid_idx.json', 'r') as f:
39+
valid_idx = json.load(f)['valid_idxs']
40+
valid_files = [os.path.join(unzip_path, 'dsgdb9nsd_%s.xyz' % i)
41+
for i in valid_idx]
42+
43+
print('reading data...')
44+
raw_data = {'train': [], 'valid': []}
45+
all_files = glob.glob(os.path.join(unzip_path, '*.xyz'))
46+
for file_idx, file_path in enumerate(all_files):
47+
if file_idx % 100 == 0:
48+
print('%.1f %% \r' %
49+
(file_idx / float(len(all_files)) * 100), end=""),
50+
if file_path not in valid_files:
51+
raw_data['train'].append(read_xyz(file_path))
52+
else:
53+
raw_data['valid'].append(read_xyz(file_path))
54+
all_mu = [mol['mu'] for mol in raw_data['train']]
55+
mean_mu = np.mean(all_mu)
56+
std_mu = np.std(all_mu)
57+
58+
def normalize_mu(mu):
59+
return (mu - mean_mu) / std_mu
60+
61+
def onehot(idx, len):
62+
z = [0 for _ in range(len)]
63+
z[idx] = 1
64+
return z
65+
66+
bond_dict = {'SINGLE': 0, 'DOUBLE': 1, 'TRIPLE': 2, "AROMATIC": 3}
67+
68+
def to_graph(smiles):
69+
mol = Chem.MolFromSmiles(smiles)
70+
mol = Chem.AddHs(mol)
71+
edges = []
72+
nodes = []
73+
for bond in mol.GetBonds():
74+
edges.append((str(bond.GetBondType()),
75+
bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()))
76+
for atom in mol.GetAtoms():
77+
nodes.append(atom.GetSymbol())
78+
return nodes, edges
79+
80+
print('parsing smiles as graphs...')
81+
processed_graphs = {'train': [], 'valid': []}
82+
processed_labels = {'train': [], 'valid': []}
83+
for section in ['train', 'valid']:
84+
for i, (smiles, mu) in enumerate([(mol['smiles'], mol['mu']) for mol in raw_data[section]]):
85+
if i % 100 == 0:
86+
print('%s: %.1f %% \r' %
87+
(section, 100*i/float(len(raw_data[section]))), end="")
88+
nodes, edges = to_graph(smiles)
89+
processed_graphs[section].append({
90+
'edges': edges,
91+
'node_labels': nodes
92+
})
93+
processed_labels[section].append([normalize_mu(mu)])
94+
95+
print('%s: 100 %% ' % (section))
96+
with open('molecules_graphs_%s.jsonl' % section, 'w') as f:
97+
for graph in processed_graphs[section]:
98+
f.write(json.dumps(graph) + "\n")
99+
with open('molecules_labels_%s.jsonl' % section, 'w') as f:
100+
for label in processed_labels[section]:
101+
f.write(json.dumps(label) + "\n")
102+
103+
104+
preprocess()

data/chem/model.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import opengnn as ognn
2+
3+
4+
def model():
5+
return ognn.models.GraphRegressor(
6+
source_inputter=ognn.inputters.GraphEmbedder(
7+
edge_vocabulary_file_key="edge_vocabulary",
8+
node_embedder=ognn.inputters.TokenEmbedder(
9+
vocabulary_file_key="node_vocabulary",
10+
embedding_size=64)),
11+
target_inputter=ognn.inputters.FeaturesInputter(),
12+
encoder=ognn.encoders.GGNNEncoder(
13+
num_timesteps=[2, 2],
14+
node_feature_size=64),
15+
name="chemModelCustom")

0 commit comments

Comments
 (0)