-
Notifications
You must be signed in to change notification settings - Fork 58
Description
All release artefacts are currently published uncompressed. It is possibly to ask for the “main” release artefact to be compressed (option gzip_main), but this merely add a compressed version of the release artefact on top of the uncompressed one.
This is a considerable waste of storage space (and waste of time, considering the time needed to download the artefacts), given that most release formats are basically text-based formats (XML, JSON, OBO) and are therefore highly amenable to efficient compression. Even the binary SQLite format can be compressed quite well.
For example, with the latest standard release of CL:
| Release artefact | Uncompressed size | Gzip-compressed | XZ-compressed |
|---|---|---|---|
cl.owl |
61 MB | 3.7 MB (~16⨉) | 2.5 MB (~24⨉) |
cl.obo |
16 MB | 2.4 MB (~6⨉) | 1.5 MB (~10⨉) |
cl.json |
35 MB | 2.6 MB (~13⨉) | 1.8 MB (~19⨉) |
cl.db |
491 MB | 94 MB (~5⨉) | 48 MB (~10⨉) |
Maybe nowadays 61 MB is not so much of a big deal, but when counting all release artefacts in all formats, a full-blown CL release amounts to more than 700 MB of perfectly compressible files (and that’s before we enable the production of DB files). This could probably be cut down to less than 100 MB if release artefacts were compressed with Gzip, and maybe even less than 50 MB if they were compressed with XZ.¹
Yes, I realise that switching to compressed files is going to break a lot of existing pipelines that expect to be able to find the released version of the XYZ ontology at http://purl.obolibrary.org/obo/zyx.owl and not at http://purl.obolibrary.org/obo/xyz.owl.gz. However this is still something that should be considered. Releasing uncompressed files is nuts.
¹Though it must be noted that ROBOT does not support reading from XZ-compressed files, se we should probably either stick to Gzip or add XZ support to ROBOT.