Skip to content

Commit f2eb2cf

Browse files
alanbchristieAlan Christie
andauthored
- First cut travis file (#36)
- Adds build badges Co-authored-by: Alan Christie <[email protected]>
1 parent 56d7e0e commit f2eb2cf

File tree

2 files changed

+132
-68
lines changed

2 files changed

+132
-68
lines changed

.travis.yml

Lines changed: 69 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,72 @@
1-
os:
2-
- linux
1+
---
32

4-
language: python
3+
# -----------------
4+
# Control variables (Travis Settings)
5+
# -----------------
6+
#
7+
# PUBLISH_IMAGES Should be 'yes' to enable publishing to Docker Hub.
8+
#
9+
# If you set PUBLISH_IMAGES you must also set the following: -
10+
#
11+
# DOCKER_USERNAME If PUBLISH_IMAGES is 'yes'
12+
# DOCKER_PASSWORD If PUBLISH_IMAGES is 'yes'
513

6-
python:
7-
- 2.7
8-
- 3.6
14+
os: linux
15+
services:
16+
- docker
917

10-
script:
11-
- pip install -e src/python
18+
stages:
19+
- name: publish latest
20+
if: |
21+
branch = master \
22+
AND env(PUBLISH_IMAGES) = yes
23+
- name: publish tag
24+
if: |
25+
tag IS present \
26+
AND env(PUBLISH_IMAGES) = yes
27+
- name: publish stable
28+
if: |
29+
tag IS present \
30+
AND tag =~ ^([0-9]+\.){1,2}[0-9]+$ \
31+
AND env(PUBLISH_IMAGES) = yes
32+
33+
before_script:
34+
- docker login -u="$DOCKER_USERNAME" -p="$DOCKER_PASSWORD"
35+
36+
jobs:
37+
include:
38+
39+
# Publish-stage jobs...
40+
# Every successful master build results in a latest image (above)
41+
# and every tag results in a tagged image in Docker Hub.
42+
# Tags that match a RegEx are considered 'official' tags
43+
# and also result in a 'stable' image tag.
44+
45+
- stage: publish latest
46+
name: Test and Latest Image
47+
script:
48+
# Build and push the pipelines-rdkit image and its sd-poster
49+
- docker build -t informaticsmatters/rdkit_pipelines:latest -f Dockerfile-rdkit .
50+
- docker push informaticsmatters/rdkit_pipelines:latest
51+
- docker build -t squonk/rdkit-pipelines-sdposter:latest -f Dockerfile-sdposter .
52+
- docker push squonk/rdkit-pipelines-sdposter:latest
53+
54+
- stage: publish tag
55+
name: Tagged Image
56+
script:
57+
# Build and push the pipelines-rdkit image and its sd-poster
58+
- docker build -t informaticsmatters/rdkit_pipelines:${TRAVIS_TAG} -f Dockerfile-rdkit .
59+
- docker push informaticsmatters/rdkit_pipelines:${TRAVIS_TAG}
60+
- docker build -t squonk/rdkit-pipelines-sdposter:${TRAVIS_TAG} -f Dockerfile-sdposter .
61+
- docker push squonk/rdkit-pipelines-sdposter:${TRAVIS_TAG}
62+
63+
- stage: publish stable
64+
name: Stable Image
65+
script:
66+
# Pull the corresponding pipelines-rdkit image tag and push it as 'stable'
67+
- docker pull informaticsmatters/rdkit_pipelines:${TRAVIS_TAG}
68+
- docker tag informaticsmatters/rdkit_pipelines:${TRAVIS_TAG} informaticsmatters/rdkit_pipelines:stable
69+
- docker push informaticsmatters/rdkit_pipelines:stable
70+
- docker pull squonk/rdkit-pipelines-sdposter:${TRAVIS_TAG}
71+
- docker tag squonk/rdkit-pipelines-sdposter:${TRAVIS_TAG} squonk/rdkit-pipelines-sdposter:stable
72+
- docker push squonk/rdkit-pipelines-sdposter:stable

README.md

Lines changed: 63 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,21 @@
1-
# Piplelines.
1+
# Pipelines
22

3-
The project experiments with ways to generate data processing piplelines.
4-
The aim is to generate some re-usable building blocks that can be piped
3+
[![Build Status](https://travis-ci.com/InformaticsMatters/pipelines.svg?branch=master)](https://travis-ci.com/InformaticsMatters/pipelines)
4+
![GitHub release (latest SemVer including pre-releases)](https://img.shields.io/github/v/release/informaticsmatters/pipelines?include_prereleases)
5+
6+
The project experiments with ways to generate data processing piplelines.
7+
The aim is to generate some re-usable building blocks that can be piped
58
together into more functional pipelines. Their prime initial use is as executors
69
for the Squonk Computational Notebook (http://squonk.it) though it is expected
710
that they will have uses in other environments.
811

912
As well as being executable directly they can also be executed in Docker
10-
containers (separately or as a single pipeline). Additionally they can be
11-
executed using Nextflow (http://nextflow.io) to allow running large jobs
13+
containers (separately or as a single pipeline). Additionally they can be
14+
executed using Nextflow (http://nextflow.io) to allow running large jobs
1215
on HPC-like environments.
1316

14-
Currently it has some python scripts using RDKit (http://rdkit.org) to provide
15-
basic cheminformatics and comp chem functionality, though other tools will
17+
Currently it has some python scripts using RDKit (http://rdkit.org) to provide
18+
basic cheminformatics and comp chem functionality, though other tools will
1619
be coming soon, including some from the Java ecosystem.
1720

1821
* See [here](src/python/pipelines/rdkit/README.md) for more info on the RDKit components.
@@ -31,11 +34,11 @@ In Jan 2018 some of the core functionality from this repository was broken out i
3134

3235
### Modularity
3336

34-
Each component should be small but useful. Try to split complex tasks into
37+
Each component should be small but useful. Try to split complex tasks into
3538
reusable steps. Think how the same steps could be used in other workflows.
3639
Allow parts of one component to be used in another component where appropriate
37-
but avoid over use. For example see the use of functions in rdkit/conformers.py
38-
to generate conformers in o3dAlign.py
40+
but avoid over use. For example see the use of functions in rdkit/conformers.py
41+
to generate conformers in o3dAlign.py
3942

4043
### Consistency
4144

@@ -50,101 +53,101 @@ Generally use consistent coding styles e.g. PEP8 for Python.
5053

5154
## Input and output formats
5255

53-
We aim to provide consistent input and output formats to allow results to be
54-
passed between different implementations. Currently all implementations handle
56+
We aim to provide consistent input and output formats to allow results to be
57+
passed between different implementations. Currently all implementations handle
5558
chemical structures so SD file would typically be used as the lowest common
56-
denominator interchange format, but implementations should also try to support
59+
denominator interchange format, but implementations should also try to support
5760
Squonk's JSON based Dataset formats, which potentially allow richer representations
58-
and can be used to describe data other than chemical structures.
59-
The utils.py module provides helper methods to handle IO.
61+
and can be used to describe data other than chemical structures.
62+
The utils.py module provides helper methods to handle IO.
6063

6164
### Thin output
62-
65+
6366
In addition implementations are encouraged to support "thin" output formats
64-
where this is appropriate. A "thin" representation is a minimal representation
67+
where this is appropriate. A "thin" representation is a minimal representation
6568
containing only what is new or changed, and can significantly reduce the bandwith
66-
used and avoid the need for the consumer to interpret values it does not
67-
need to understand. It is not always appropriate to support thin format output
69+
used and avoid the need for the consumer to interpret values it does not
70+
need to understand. It is not always appropriate to support thin format output
6871
(e.g. when the structure is changed by the process).
6972

70-
In the case of SDF thin format involves using an empty molecule for the molecule
71-
block and all properties that were present in the input or were generated by the
72-
process (the empty molecule is used so that the SDF syntax remains valid).
73+
In the case of SDF thin format involves using an empty molecule for the molecule
74+
block and all properties that were present in the input or were generated by the
75+
process (the empty molecule is used so that the SDF syntax remains valid).
7376

74-
In the case of Squonk JSON output the thin output would be of type BasicObject
75-
(e.g. containing no structure information) and include all properties that
76-
were present in the input or were generated by the process.
77+
In the case of Squonk JSON output the thin output would be of type BasicObject
78+
(e.g. containing no structure information) and include all properties that
79+
were present in the input or were generated by the process.
7780

78-
Implicit in this is that some identifier (usually a SD file property, or
79-
the JSON UUID property) that is present in the input is included in the output so
80-
that the full results can be "reassembled" by the consumer of the output.
81-
The input would typically only contain additional information that is required
81+
Implicit in this is that some identifier (usually a SD file property, or
82+
the JSON UUID property) that is present in the input is included in the output so
83+
that the full results can be "reassembled" by the consumer of the output.
84+
The input would typically only contain additional information that is required
8285
for execution of the process e.g. the structure.
8386

84-
For consistency implementations should try to honor these command line
87+
For consistency implementations should try to honor these command line
8588
switches relating to input and output:
8689

87-
-i and --input: For specifying the location of the single input. If not specified
88-
then STDIN should be used. File names ending with .gz should be interpreted as
89-
gzipped files. Input on STDIN should not be gzipped.
90+
-i and --input: For specifying the location of the single input. If not specified
91+
then STDIN should be used. File names ending with .gz should be interpreted as
92+
gzipped files. Input on STDIN should not be gzipped.
9093

91-
-if and --informat: For specifying the input format where it cannot be inferred
94+
-if and --informat: For specifying the input format where it cannot be inferred
9295
from the file name (e.g. when using STDIN). Values would be sdf or json.
9396

9497
-o and --output: For specifying the base name of the ouputs (there could be multiple
9598
output files each using the same base name but with a different file extension.
96-
If not specified then STDOUT should be used. Output file names ending with
97-
.gz should be compressed using gzip. Output on STDOUT would not be gzipped.
99+
If not specified then STDOUT should be used. Output file names ending with
100+
.gz should be compressed using gzip. Output on STDOUT would not be gzipped.
98101

99-
-of and --outformat: For specifying the output format where it cannot be inferred
102+
-of and --outformat: For specifying the output format where it cannot be inferred
100103
from the file name (e.g. when using STDOUT). Values would be sdf or json.
101-
102-
--meta: Write additional metadata and metrics (mostly relevant to Squonk's
104+
105+
--meta: Write additional metadata and metrics (mostly relevant to Squonk's
103106
JSON format - see below). Default is not to write.
104107

105108
--thin: Write output in thin format (only present where this makes sense).
106109
Default is not to use thin format.
107110

108111
### UUIDs
109112

110-
The JSON format for input and oputput makes heavy use of UUIDs that uniquely
111-
identify each structure. Generally speaking, if the structure is not changed
112-
(e.g. properties are just being added to input structures) then the existing
113+
The JSON format for input and oputput makes heavy use of UUIDs that uniquely
114+
identify each structure. Generally speaking, if the structure is not changed
115+
(e.g. properties are just being added to input structures) then the existing
113116
UUID should be retained so that UUIDs in the output match those from the input.
114117
However if new structures are being generated (e.g. in reaction enumeration
115118
or conformer generation) then new UUIDs MUST be generated as there is no longer
116119
a straight relationship between the input and output structures. Instead you
117-
probably want to store the UUID of the source structure(s) as a field(s) in
120+
probably want to store the UUID of the source structure(s) as a field(s) in
118121
the output. To allow correlation of the outputs to the inputs (e.g. for conformer
119-
generation output the source molecule UUID as a field so that each conformer
122+
generation output the source molecule UUID as a field so that each conformer
120123
identifies which source molecule it was derived from.
121124

122125
When not using JSON format the need to handle UUIDs does not necessarily apply
123-
(though if there is a field named 'uuid' in the input it will be respected accordingly).
126+
(though if there is a field named 'uuid' in the input it will be respected accordingly).
124127
To accommodate this you are recommended to ALSO specify the input molecule number
125128
(1 based index) as an output field independent of whether UUIDs are being handled
126129
as a "poor man's" approach to correlating the outputs to the inputs.
127130

128131
### Filtering
129132

130-
When a service that filters molecules special attention is needed to ensure
133+
When a service that filters molecules special attention is needed to ensure
131134
that the molecules are output in the same order as the input (obviously skipping
132135
structures that are filtered out). Also the service descriptor (.dsd.json) file needs special care. For
133136
instance take a look at the "thinDescriptors" section of src/pipelines/rdkit/screen.dsd.json
134137

135-
When using multi-threaded execution this is especially important as results
138+
When using multi-threaded execution this is especially important as results
136139
will usually not come back in exactly the same order as the input.
137140

138141
### Metrics
139142

140143
To provide information about what happened you are strongly recommended to generate
141-
a metrics output file (e.g. output_metrics.txt). This file allows to provide
144+
a metrics output file (e.g. output_metrics.txt). This file allows to provide
142145
feedback about what happened. The contents of this file are fairly simple,
143146
each line having a
144147

145148
`key=value`
146149

147-
syntax. Keys beginning and ending with __ (2 underscores) have magical meaning.
150+
syntax. Keys beginning and ending with __ (2 underscores) have magical meaning.
148151
All other keys are treated as metrics that are recorded against that execution.
149152
The current magical values that are recognised are:
150153

@@ -161,43 +164,43 @@ PLI=360
161164
162165
```
163166

164-
It defines the input and output counts and specifies that 360 PLI 'units'
167+
It defines the input and output counts and specifies that 360 PLI 'units'
165168
should be recorded as being consumed during execution.
166169

167-
The purpose of the metrics is primarily to be able to chage for utilisation, but
170+
The purpose of the metrics is primarily to be able to chage for utilisation, but
168171
even if not charging (which is often the case) then it is still good practice
169172
to record the utilisation.
170173

171174
### Metadata
172175

173176
Squonk's JSON format requires additional metadata to allow proper handling
174-
of the JSON. Writing detailed metadata is optional, but recommended. If
175-
not present then Squonk will use a minimal representation of metadata, but
177+
of the JSON. Writing detailed metadata is optional, but recommended. If
178+
not present then Squonk will use a minimal representation of metadata, but
176179
it's recommended to provide this directly so that additional information can
177-
be added.
180+
be added.
178181

179182
At the very minimum Squonk needs to know the type of dataset (e.g. MoleculeObject
180183
or BasicObject), but this should be handled for you automatically if you use
181184
the utils.default_open_output* methods. Better though to also specify metadata for
182-
the field types when you do this. See e.g. conformers.py for an example of
185+
the field types when you do this. See e.g. conformers.py for an example of
183186
how to do this.
184187

185188
## Deployment to Squonk
186189

187-
The service descriptors need to to POSTed to the Squonk coreservices REST API.
190+
The service descriptors need to to POSTed to the Squonk coreservices REST API.
188191

189192
### Docker
190193

191194
A shell script can be used to deploy the pipelines to a running
192195
containerised Squonk deployment: -
193196

194197
$ ./post-service-descriptors.sh
195-
198+
196199
### OpenShift/OKD
197200

198201
The pipelines and service-descriptor container images are built using gradle
199202
in this project. The are deployed from the Squonk project using Ansible
200-
playbooks.
203+
playbooks.
201204

202205
> A discussion about the deployment of pipelines can be found in the
203206
`Posting Squonk pipelines` section of Squonk's OpenShift Ansible
@@ -224,7 +227,7 @@ Set your `PYTHONPATH` environment variable to include the `pipelines-utils` and
224227
(adjusting `/path/to/` to whatever is needed):
225228
```
226229
export PYTHONPATH=/path/to/pipelines-utils/src/python:/path/to/pipelines-utils-rdkit/src/python
227-
```
230+
```
228231

229232
Run tests:
230233
```
@@ -233,7 +236,7 @@ Run tests:
233236

234237
## Contact
235238

236-
Any questions contact:
239+
Any questions contact:
237240

238241
Tim Dudgeon
239242

0 commit comments

Comments
 (0)