Skip to content

Commit 39ae587

Browse files
authored
Merge pull request #397 from INCATools/1.2.26-fixes
1.2.26 fixes
2 parents 4c88f99 + 3c759e1 commit 39ae587

File tree

9 files changed

+311
-23
lines changed

9 files changed

+311
-23
lines changed

Changes.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,37 @@
1+
# v1.2.26 (10 February 2021): HOTFIXES
2+
- Hotfixes:
3+
- The new mireot module technique was buggy and is therefore removed again. Sorry; we will try again next time. You can still use the `custom` option to implement mireot yourself!
4+
- A change in the way imports were processed introduced a very high memory footprint for large ontologies and slowed stuff down. If you do not have a lot of memory (and time!) available, you should use the following new flags: `is_large` and `use_gzipped`. `is_large: TRUE` introduces a special handling for the ontology that is faster and consumes less memory when creating an import. Using `use_gzipped` will try to download the ontology from its gzipped location. Make sure its actually there (we know its the case for chebi and pr at least)!
5+
```
6+
import_group:
7+
products:
8+
- id: pr
9+
use_gzipped: TRUE
10+
is_large: TRUE
11+
- id: chebi
12+
use_gzipped: TRUE
13+
is_large: TRUE
14+
```
15+
- An irrelevant file (keeprelations.txt) was still generated even if needed when seeding a new repo.
16+
- Module type `STAR` was accidentally hard coded default for slme. Now changed to `BOT` as it was.
17+
- CI configs where not correctly copied by update routine. Now it does. Note for the changes to be picked up, you need to run `sh run.sh make update_repo` twice (once for updating the update script itself)!
18+
- Geeky (but necessary) all phony make goals are now correctly declared as `.PHONY`.
19+
- Some last minute features:
20+
- In new repos, the README.md is now generated with the correct, appropriate banners.
21+
- We now have a new feature, `custom_makefile_header`, that allows injecting a custom header into the Makefile. Most mortals wont need this, but this is how it goes:
22+
```
23+
custom_makefile_header: |
24+
### Workflow
25+
#
26+
# Tasks to edit and release OMRSE.
27+
#
28+
# #### Edit
29+
#
30+
# 1. [Prepare release](prepare_release)
31+
# 2. [Refresh imports](all_imports)
32+
# 3. [Update repo to latest ODK](update_repo)
33+
```
34+
- all features and fixes here: https://github.com/INCATools/ontology-development-kit/pull/397
135

236
# v1.2.26 (2 February 2021)
337
- New versions:

docs/DealWithLargeOntologies.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# Dealing with huge ontologies in your import chain
2+
3+
Dealing with very large ontologies, such as Protein Ontology (PR), NCBI Taxonomy (NXBITaxon), Gene Ontology (GO) and CHEBI is a big challenge when developing ontologies, especially if we want to import and re-use terms from them. There are two major problems:
4+
1. It currently takes about 12-16 GB of memory to process PR and NCBITaxon - memory that many of us do not have available.
5+
2. The files are so large, pulling them over the internet can lead to failures, timeouts and other problems.
6+
7+
There are a few strategies we can employ to deal with the problem of memory consumption:
8+
1. We try to reduce the memory footprint of the import as much as possible. In other words: we try to not do the fancy stuff ODK does by default when extracting a module, and keep it simple.
9+
2. We manage the import manually ourselves (no import)
10+
11+
To deal with file size, we:
12+
1. Instead of importing the whole thing, we import curated subsets.
13+
2. If available, we use gzipped (compressed) versions.
14+
15+
All four strategies will be discussed in the following. We will then look a bit
16+
17+
## Overwrite ODK default: less fancy, custom modules
18+
19+
The default recipe for creating a module looks something like that:
20+
21+
```
22+
imports/%_import.owl: mirror/%.owl imports/%_terms_combined.txt
23+
if [ $(IMP) = true ]; then $(ROBOT) query -i $< --update ../sparql/preprocess-module.ru \
24+
extract -T imports/$*_terms_combined.txt --force true --copy-ontology-annotations true --individuals exclude --method BOT \
25+
query --update ../sparql/inject-subset-declaration.ru --update ../sparql/postprocess-module.ru \
26+
annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
27+
28+
.PRECIOUS: imports/%_import.owl
29+
```
30+
(Note: This snippet was copied her on the 10 February 2021 and may be out of date by the time you read this.)
31+
32+
As you can see, a lot of stuff is going on here: first we run some preprocessing (which is really costly in ROBOT, as we need to load the ontology into Jena, and then back into the OWL API - so basically the ontology is loaded three times in total), then extract a module, then run more sparql queries etc etc. Costly. For small ontologies, this is fine. All of these processes are important to mitigate some of the shortcomings of module extraction techniques, but even if they would be sorted in ROBOT, it may still not be enough.
33+
34+
So what we can do now is this. In your `ont.Makefile` (for example, `go.Makefile`, NOT `Makefile`), located in `src/ontology`, you can add a snippet like this:
35+
36+
```
37+
imports/pr_import.owl: mirror/pr.owl imports/pr_terms_combined.txt
38+
if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/pr_terms_combined.txt --force true --method BOT \
39+
annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
40+
41+
.PRECIOUS: imports/pr_import.owl
42+
```
43+
44+
Note that all the `%` variables and uses of `$*` are replaced by the ontology id in question. Adding this to your `ont.Makefile` will overwrite the default ODK behaviour in favour of this new recipe.
45+
46+
_The ODK supports this reduced module out of the box. To activate it, do this:_
47+
48+
```
49+
import_group:
50+
products:
51+
- id: pr
52+
use_gzipped: TRUE
53+
is_large: TRUE
54+
```
55+
56+
This will (a) ensure that PR is pulled from a gzipped location (you _have_ to check whether it exists though. It must correspond to the PURL, followed by the extension `.gz`, for example `http://purl.obolibrary.org/obo/pr.owl.gz`) and (b) that it is considered large, so the default handling of large imports is activated for `pr`, and you dont need to paste anything int `ont.Makefile`.
57+
58+
If you prefer to do it yourself, in the following you can find a few snippets you can use that work for three large ontologies. Just copy them and drop them into `ont.Makefile`; and adjust them however you wish.
59+
60+
### Protein Ontology (PR)
61+
62+
```
63+
imports/pr_import.owl: mirror/pr.owl imports/pr_terms_combined.txt
64+
if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/pr_terms_combined.txt --force true --method BOT \
65+
annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
66+
67+
.PRECIOUS: imports/pr_import.owl
68+
```
69+
70+
### NCBI Taxonomy (NCBITaxon)
71+
72+
```
73+
imports/ncbitaxon_import.owl: mirror/ncbitaxon.owl imports/ncbitaxon_terms_combined.txt
74+
if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/ncbitaxon_terms_combined.txt --force true --method BOT \
75+
annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
76+
77+
.PRECIOUS: imports/ncbitaxon_import.owl
78+
```
79+
80+
### CHEBI
81+
82+
```
83+
imports/chebi_import.owl: mirror/chebi.owl imports/chebi_terms_combined.txt
84+
if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/chebi_terms_combined.txt --force true --method BOT \
85+
annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
86+
87+
.PRECIOUS: imports/chebi_import.owl
88+
```
89+
90+
Feel free to use an even cheaper approach - even one that does not use ROBOT -> as long as it produces the target of the goal (e.g. `imports/chebi_import.owl`).
91+
92+
## Use, slims when they are available
93+
94+
For some ontologies, you can find slims that are _much_ smaller than full ontology. For example, NCBITaxon maintains a slim for OBO here: http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.obo, which is just 3 M(!!)B compared to the 1 or 2 GB of the full version. Many ontologies maintain such slims, and if not, probably should (I would really like to see an OBO slim for Protein Ontology!).
95+
96+
You can also add your favourite Taxa to that slim by simply making a pull request on here: https://github.com/obophenotype/ncbitaxon/blob/master/subsets/taxon-subset-ids.txt
97+
98+
You can use those slims simply like this:
99+
100+
```
101+
import_group:
102+
products:
103+
- id: ncbitaxon
104+
mirror_from: http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.obo
105+
```
106+
107+
## Manage imports manually
108+
109+
This is a real hack, and we want to discourage it with very strong terms. But sometimes, importing an ontology just to import a single term is total overkill. What we do in these cases is to maintain a simple template to "import" minimal information. I cant stress enough that we want to avoid this, as such information must necessarily get out of date, but here is a pattern you can use to handle it in an ok way:
110+
111+
Add this to your `src/ontology/ont-odk.yaml`
112+
113+
```
114+
import_group:
115+
products:
116+
- id: my_ncbitaxon
117+
```
118+
119+
Then add this to `src/ontology/ont.Makefile`:
120+
121+
```
122+
mirror/my_ncbitaxon.owl:
123+
echo "No mirror for $@"
124+
125+
imports/my_ncbitaxon_import.owl: imports/my_ncbitaxon_import.tsv
126+
if [ $(IMP) = true ]; then $(ROBOT) template --template $< \
127+
--ontology-iri "$(ONTBASE)/$@" --output $@.tmp.owl && mv $@.tmp.owl $@; fi
128+
129+
.PRECIOUS: imports/my_ncbitaxon_import.owl
130+
```
131+
132+
Now you can manage your import manually in the template, and the ODK will know not to include your manually curated import in your base release. But again, avoid this pattern for anything but the most trivial case (e.g. you need 1 term from a huge ontology).
133+
134+
135+
## File is too large: Network timeouts and long runtimes
136+
137+
Remember that ontologies are text files. While this makes them easy to read im your browser, it also makes them huge - from 500 MB (Chebi) to 2 GB (NCBITaxon) - which is an enormous amount.
138+
139+
140+
Thankfully, ROBOT can automatically read gzipped ontologies without the need of unpacking. To avoid long runtimes and network timeouts, we can do the following two things (with the new ODK 1.2.26):
141+
142+
```
143+
import_group:
144+
products:
145+
- id: pr
146+
use_gzipped: TRUE
147+
```
148+
This will try to append `.gz` to the default download location (http://purl.obolibrary.org/obo/pr.owl ---> http://purl.obolibrary.org/obo/pr.owl.gz). Note that you must make sure that this file actually exists. It does, for Chebi and Protein Ontology, but not for many others.
149+
150+
151+
If the file exists, but is located elsewhere, you can do this:
152+
153+
```
154+
import_group:
155+
products:
156+
- id: pr
157+
mirror_from: http://purl.obolibrary.org/obo/pr.owl.gz
158+
```
159+
160+
You can put any URL in `mirror_from` (including non-obo ones!).
161+

odk/odk.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,15 @@ class ImportProduct(Product):
104104

105105
mirror_from: Optional[Url] = None
106106
"""if specified this URL is used rather than the default OBO PURL for the main OWL product"""
107+
108+
is_large: bool = False
109+
"""if large, ODK may take measures to reduce the memory footprint of the import."""
110+
111+
use_base: bool = False
112+
"""if use_base is true, try use the base IRI instead of normal one to mirror from."""
113+
114+
use_gzipped: bool = False
115+
"""if use_gzipped is true, try use the base IRI instead of normal one to mirror from."""
107116

108117
@dataclass_json
109118
@dataclass
@@ -228,7 +237,7 @@ class ImportGroup(ProductGroup):
228237
"""all import products"""
229238

230239
module_type : str = "slme"
231-
"""Module type. Supported: slme, mireot, minimal, custom"""
240+
"""Module type. Supported: slme, minimal, custom"""
232241

233242
module_type_slme : str = "BOT"
234243
"""SLME module type. Supported: BOT, TOP, STAR"""
@@ -419,6 +428,12 @@ class OntologyProject(JsonSchemaMixin):
419428

420429
use_dosdps : bool = False
421430
"""if true use dead simple owl design patterns"""
431+
432+
custom_makefile_header : str = """
433+
# ----------------------------------------
434+
# More information: https://github.com/INCATools/ontology-development-kit/
435+
"""
436+
"""A multiline string that is added to the Makefile"""
422437

423438
public_release : str = "none"
424439
"""if true add functions to run automated releases (experimental). Current options are: github_curl, github_python."""

schema/project-schema.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -326,6 +326,10 @@
326326
},
327327
"type": "array"
328328
},
329+
"custom_makefile_header": {
330+
"default": "\n# ----------------------------------------\n# More information: https://github.com/INCATools/ontology-development-kit/\n",
331+
"type": "string"
332+
},
329333
"description": {
330334
"default": "None",
331335
"type": "string"

template/README.md.jinja2

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
1+
{%- if project.ci is defined -%}{% if 'travis' in project.ci %}
12
[![Build Status](https://travis-ci.org/{{ project.github_org }}/{{ project.repo }}.svg?branch=master)](https://travis-ci.org/{{ project.github_org }}/{{ project.repo }})
2-
[![DOI](https://zenodo.org/badge/13996/{{ project.github_org }}/{{ project.repo }}.svg)](https://zenodo.org/badge/latestdoi/13996/{{ project.github_org }}/{{ project.repo }})
3+
{%- endif -%}{% if 'github_actions' in project.ci %}
4+
![Build Status](https://github.com/{{ project.github_org }}/{{ project.repo }}/workflows/CI/badge.svg)
5+
{% endif %}{% endif -%}
36

47
# {{ project.title }}
58

0 commit comments

Comments
 (0)