1
- # Piplelines.
1
+ # Pipelines
2
2
3
- The project experiments with ways to generate data processing piplelines.
4
- The aim is to generate some re-usable building blocks that can be piped
3
+ [ ![ Build Status] ( https://travis-ci.com/InformaticsMatters/pipelines.svg?branch=master )] ( https://travis-ci.com/InformaticsMatters/pipelines )
4
+ ![ GitHub release (latest SemVer including pre-releases)] ( https://img.shields.io/github/v/release/informaticsmatters/pipelines?include_prereleases )
5
+
6
+ The project experiments with ways to generate data processing piplelines.
7
+ The aim is to generate some re-usable building blocks that can be piped
5
8
together into more functional pipelines. Their prime initial use is as executors
6
9
for the Squonk Computational Notebook (http://squonk.it ) though it is expected
7
10
that they will have uses in other environments.
8
11
9
12
As well as being executable directly they can also be executed in Docker
10
- containers (separately or as a single pipeline). Additionally they can be
11
- executed using Nextflow (http://nextflow.io ) to allow running large jobs
13
+ containers (separately or as a single pipeline). Additionally they can be
14
+ executed using Nextflow (http://nextflow.io ) to allow running large jobs
12
15
on HPC-like environments.
13
16
14
- Currently it has some python scripts using RDKit (http://rdkit.org ) to provide
15
- basic cheminformatics and comp chem functionality, though other tools will
17
+ Currently it has some python scripts using RDKit (http://rdkit.org ) to provide
18
+ basic cheminformatics and comp chem functionality, though other tools will
16
19
be coming soon, including some from the Java ecosystem.
17
20
18
21
* See [ here] ( src/python/pipelines/rdkit/README.md ) for more info on the RDKit components.
@@ -31,11 +34,11 @@ In Jan 2018 some of the core functionality from this repository was broken out i
31
34
32
35
### Modularity
33
36
34
- Each component should be small but useful. Try to split complex tasks into
37
+ Each component should be small but useful. Try to split complex tasks into
35
38
reusable steps. Think how the same steps could be used in other workflows.
36
39
Allow parts of one component to be used in another component where appropriate
37
- but avoid over use. For example see the use of functions in rdkit/conformers.py
38
- to generate conformers in o3dAlign.py
40
+ but avoid over use. For example see the use of functions in rdkit/conformers.py
41
+ to generate conformers in o3dAlign.py
39
42
40
43
### Consistency
41
44
@@ -50,101 +53,101 @@ Generally use consistent coding styles e.g. PEP8 for Python.
50
53
51
54
## Input and output formats
52
55
53
- We aim to provide consistent input and output formats to allow results to be
54
- passed between different implementations. Currently all implementations handle
56
+ We aim to provide consistent input and output formats to allow results to be
57
+ passed between different implementations. Currently all implementations handle
55
58
chemical structures so SD file would typically be used as the lowest common
56
- denominator interchange format, but implementations should also try to support
59
+ denominator interchange format, but implementations should also try to support
57
60
Squonk's JSON based Dataset formats, which potentially allow richer representations
58
- and can be used to describe data other than chemical structures.
59
- The utils.py module provides helper methods to handle IO.
61
+ and can be used to describe data other than chemical structures.
62
+ The utils.py module provides helper methods to handle IO.
60
63
61
64
### Thin output
62
-
65
+
63
66
In addition implementations are encouraged to support "thin" output formats
64
- where this is appropriate. A "thin" representation is a minimal representation
67
+ where this is appropriate. A "thin" representation is a minimal representation
65
68
containing only what is new or changed, and can significantly reduce the bandwith
66
- used and avoid the need for the consumer to interpret values it does not
67
- need to understand. It is not always appropriate to support thin format output
69
+ used and avoid the need for the consumer to interpret values it does not
70
+ need to understand. It is not always appropriate to support thin format output
68
71
(e.g. when the structure is changed by the process).
69
72
70
- In the case of SDF thin format involves using an empty molecule for the molecule
71
- block and all properties that were present in the input or were generated by the
72
- process (the empty molecule is used so that the SDF syntax remains valid).
73
+ In the case of SDF thin format involves using an empty molecule for the molecule
74
+ block and all properties that were present in the input or were generated by the
75
+ process (the empty molecule is used so that the SDF syntax remains valid).
73
76
74
- In the case of Squonk JSON output the thin output would be of type BasicObject
75
- (e.g. containing no structure information) and include all properties that
76
- were present in the input or were generated by the process.
77
+ In the case of Squonk JSON output the thin output would be of type BasicObject
78
+ (e.g. containing no structure information) and include all properties that
79
+ were present in the input or were generated by the process.
77
80
78
- Implicit in this is that some identifier (usually a SD file property, or
79
- the JSON UUID property) that is present in the input is included in the output so
80
- that the full results can be "reassembled" by the consumer of the output.
81
- The input would typically only contain additional information that is required
81
+ Implicit in this is that some identifier (usually a SD file property, or
82
+ the JSON UUID property) that is present in the input is included in the output so
83
+ that the full results can be "reassembled" by the consumer of the output.
84
+ The input would typically only contain additional information that is required
82
85
for execution of the process e.g. the structure.
83
86
84
- For consistency implementations should try to honor these command line
87
+ For consistency implementations should try to honor these command line
85
88
switches relating to input and output:
86
89
87
- -i and --input: For specifying the location of the single input. If not specified
88
- then STDIN should be used. File names ending with .gz should be interpreted as
89
- gzipped files. Input on STDIN should not be gzipped.
90
+ -i and --input: For specifying the location of the single input. If not specified
91
+ then STDIN should be used. File names ending with .gz should be interpreted as
92
+ gzipped files. Input on STDIN should not be gzipped.
90
93
91
- -if and --informat: For specifying the input format where it cannot be inferred
94
+ -if and --informat: For specifying the input format where it cannot be inferred
92
95
from the file name (e.g. when using STDIN). Values would be sdf or json.
93
96
94
97
-o and --output: For specifying the base name of the ouputs (there could be multiple
95
98
output files each using the same base name but with a different file extension.
96
- If not specified then STDOUT should be used. Output file names ending with
97
- .gz should be compressed using gzip. Output on STDOUT would not be gzipped.
99
+ If not specified then STDOUT should be used. Output file names ending with
100
+ .gz should be compressed using gzip. Output on STDOUT would not be gzipped.
98
101
99
- -of and --outformat: For specifying the output format where it cannot be inferred
102
+ -of and --outformat: For specifying the output format where it cannot be inferred
100
103
from the file name (e.g. when using STDOUT). Values would be sdf or json.
101
-
102
- --meta: Write additional metadata and metrics (mostly relevant to Squonk's
104
+
105
+ --meta: Write additional metadata and metrics (mostly relevant to Squonk's
103
106
JSON format - see below). Default is not to write.
104
107
105
108
--thin: Write output in thin format (only present where this makes sense).
106
109
Default is not to use thin format.
107
110
108
111
### UUIDs
109
112
110
- The JSON format for input and oputput makes heavy use of UUIDs that uniquely
111
- identify each structure. Generally speaking, if the structure is not changed
112
- (e.g. properties are just being added to input structures) then the existing
113
+ The JSON format for input and oputput makes heavy use of UUIDs that uniquely
114
+ identify each structure. Generally speaking, if the structure is not changed
115
+ (e.g. properties are just being added to input structures) then the existing
113
116
UUID should be retained so that UUIDs in the output match those from the input.
114
117
However if new structures are being generated (e.g. in reaction enumeration
115
118
or conformer generation) then new UUIDs MUST be generated as there is no longer
116
119
a straight relationship between the input and output structures. Instead you
117
- probably want to store the UUID of the source structure(s) as a field(s) in
120
+ probably want to store the UUID of the source structure(s) as a field(s) in
118
121
the output. To allow correlation of the outputs to the inputs (e.g. for conformer
119
- generation output the source molecule UUID as a field so that each conformer
122
+ generation output the source molecule UUID as a field so that each conformer
120
123
identifies which source molecule it was derived from.
121
124
122
125
When not using JSON format the need to handle UUIDs does not necessarily apply
123
- (though if there is a field named 'uuid' in the input it will be respected accordingly).
126
+ (though if there is a field named 'uuid' in the input it will be respected accordingly).
124
127
To accommodate this you are recommended to ALSO specify the input molecule number
125
128
(1 based index) as an output field independent of whether UUIDs are being handled
126
129
as a "poor man's" approach to correlating the outputs to the inputs.
127
130
128
131
### Filtering
129
132
130
- When a service that filters molecules special attention is needed to ensure
133
+ When a service that filters molecules special attention is needed to ensure
131
134
that the molecules are output in the same order as the input (obviously skipping
132
135
structures that are filtered out). Also the service descriptor (.dsd.json) file needs special care. For
133
136
instance take a look at the "thinDescriptors" section of src/pipelines/rdkit/screen.dsd.json
134
137
135
- When using multi-threaded execution this is especially important as results
138
+ When using multi-threaded execution this is especially important as results
136
139
will usually not come back in exactly the same order as the input.
137
140
138
141
### Metrics
139
142
140
143
To provide information about what happened you are strongly recommended to generate
141
- a metrics output file (e.g. output_metrics.txt). This file allows to provide
144
+ a metrics output file (e.g. output_metrics.txt). This file allows to provide
142
145
feedback about what happened. The contents of this file are fairly simple,
143
146
each line having a
144
147
145
148
` key=value `
146
149
147
- syntax. Keys beginning and ending with __ (2 underscores) have magical meaning.
150
+ syntax. Keys beginning and ending with __ (2 underscores) have magical meaning.
148
151
All other keys are treated as metrics that are recorded against that execution.
149
152
The current magical values that are recognised are:
150
153
@@ -161,43 +164,43 @@ PLI=360
161
164
162
165
```
163
166
164
- It defines the input and output counts and specifies that 360 PLI 'units'
167
+ It defines the input and output counts and specifies that 360 PLI 'units'
165
168
should be recorded as being consumed during execution.
166
169
167
- The purpose of the metrics is primarily to be able to chage for utilisation, but
170
+ The purpose of the metrics is primarily to be able to chage for utilisation, but
168
171
even if not charging (which is often the case) then it is still good practice
169
172
to record the utilisation.
170
173
171
174
### Metadata
172
175
173
176
Squonk's JSON format requires additional metadata to allow proper handling
174
- of the JSON. Writing detailed metadata is optional, but recommended. If
175
- not present then Squonk will use a minimal representation of metadata, but
177
+ of the JSON. Writing detailed metadata is optional, but recommended. If
178
+ not present then Squonk will use a minimal representation of metadata, but
176
179
it's recommended to provide this directly so that additional information can
177
- be added.
180
+ be added.
178
181
179
182
At the very minimum Squonk needs to know the type of dataset (e.g. MoleculeObject
180
183
or BasicObject), but this should be handled for you automatically if you use
181
184
the utils.default_open_output* methods. Better though to also specify metadata for
182
- the field types when you do this. See e.g. conformers.py for an example of
185
+ the field types when you do this. See e.g. conformers.py for an example of
183
186
how to do this.
184
187
185
188
## Deployment to Squonk
186
189
187
- The service descriptors need to to POSTed to the Squonk coreservices REST API.
190
+ The service descriptors need to to POSTed to the Squonk coreservices REST API.
188
191
189
192
### Docker
190
193
191
194
A shell script can be used to deploy the pipelines to a running
192
195
containerised Squonk deployment: -
193
196
194
197
$ ./post-service-descriptors.sh
195
-
198
+
196
199
### OpenShift/OKD
197
200
198
201
The pipelines and service-descriptor container images are built using gradle
199
202
in this project. The are deployed from the Squonk project using Ansible
200
- playbooks.
203
+ playbooks.
201
204
202
205
> A discussion about the deployment of pipelines can be found in the
203
206
`Posting Squonk pipelines` section of Squonk's OpenShift Ansible
@@ -224,7 +227,7 @@ Set your `PYTHONPATH` environment variable to include the `pipelines-utils` and
224
227
(adjusting ` /path/to/ ` to whatever is needed):
225
228
```
226
229
export PYTHONPATH=/path/to/pipelines-utils/src/python:/path/to/pipelines-utils-rdkit/src/python
227
- ```
230
+ ```
228
231
229
232
Run tests:
230
233
```
@@ -233,7 +236,7 @@ Run tests:
233
236
234
237
## Contact
235
238
236
- Any questions contact:
239
+ Any questions contact:
237
240
238
241
Tim Dudgeon
239
242
0 commit comments