Skip to content

Conversation

@gouttegd
Copy link
Collaborator

The current rules to download a file (e.g. a mirror, or a remotely sourced component) are quite complicated because we want to avoid needlessly triggering all the subsequent rules that depend on a download file, if it happens that the remote file has not in fact changed since the last time it was downloaded.

For example, we do not want to re-generate the import module RO_import.owl for a remote ontology RO, if there has been no changes in said remote ontology. That’s because (1) re-generating an import module may be (depending on the module) a costly operation, and (2) the fact that the RO_import.owl module has been re-generated will in turn trigger a rebuild of everything that depends on the import module — all for nothing if the remote ontology has not in fact changed since the last time RO_import.owl was built.

The way this is currently done is by playing the following game:

.PHONY: mirror-envo
.PRECIOUS: $(MIRRORDIR)/envo.owl
mirror-envo: | $(TMPDIR)
        curl -L $(OBOBASE)/envo/envo-base.owl --create-dirs \
             -o $(TMPDIR)/envo-download.owl --retry 4 --max-time 200 && \
        $(ROBOT) convert -i $(TMPDIR)/envo-download.owl -o $(TMPDIR)/[email protected]

$(MIRRORDIR)/%.owl: mirror-% | $(MIRRORDIR)
        if [ -f $(TMPDIR)/mirror-$*.owl ]; then \
                if cmp -s $(TMPDIR)/mirror-$*.owl $@ ; then \
                        echo "Mirror identical, ignoring." ; \
                else \
                        echo "Mirrors different, updating." && \
                        cp $(TMPDIR)/mirror-$*.owl $@ ; \
                fi ; \
        fi

Whenever $(MIRRORDIR)/envo.owl is required, we execute the mirror-envo phony rule, which (1) downloads the original file into $(TMPDIR)/envo-download.owl and (2) runs it through robot convert and writes the result into $(TMPDIR)/mirror-envo.owl. (If the make_base option has been used for the module, step (2) would include the removal of all non-base entities, instead of being a mere convert operation.)

Then, back in the $(MIRRORDIR)/envo.owl rule, we check (1) whether the remote file was indeed downloaded and is available in $(TMPDIR)/mirror-envo.owl, and (2) whether that file is any different from the existing mirror in $(MIRRORDIR)/envo.owl; if it is, then we proceed to overwrite the existing mirror with the newly downloaded one, otherwise we do nothing – we do not touch the existing mirror at all, so that its last modified time is kept unchanged, which prevents the triggering of any further rules that may depend on that file.

We use a similar logic, albeit even more obscure, for remotely sourced components:

.PHONY: component-download-hra_subset.owl
.PRECIOUS: $(COMPONENTSDIR)/hra_subset.owl
component-download-hra_subset.owl: | $(TMPDIR)
        $(ROBOT) merge -I https://raw.githubusercontent.com/hubmapconsortium/ccf-validation-tools/master/owl/UB_ASCTB_subset.owl \
                 annotate --annotation owl:versionInfo $(VERSION) \
                 --output $(TMPDIR)/[email protected]

$(COMPONENTSDIR)/hra_subset.owl: component-download-hra_subset.owl $(TMPDIR)/stamp-component-hra_subset.owl
        @if cmp -s $(TMPDIR)/component-download-hra_subset.owl.owl $(TMPDIR)/component-download-hra_subset.owl.tmp.owl ; then \
                echo "Component identical." ; \
        else \
                echo "Component different, updating." && \
                cp $(TMPDIR)/component-download-hra_subset.owl.owl $(TMPDIR)/component-download-hra_subset.owl.tmp.owl && \
                $(ROBOT) annotate --input $(TMPDIR)/component-download-hra_subset.owl.owl \
                                  --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) \
                                  --output $@ ; \
        fi
.PRECIOUS: $(COMPONENTSDIR)/hra_subset.owl

Whenever $(COMPONENTSDIR)/hra_subset.owl is requested, we execute component-download-hra_subset.owl, which (1) downloads the original file, (2) annotates it, and (3) writes it to $(TMPDIR)/component-download-hra_subset.owl.owl.

Back in the $(COMPONENTSDIR)/hra_subset.owl rule, we compare the newly downloaded file $(TMPDIR)/component-download-hra_subset.owl.owl to a $(TMPDIR)/component-download-hra_subset.owl.tmp.owl file (left in place by a previous call to the rule, see later). If the two files match, then the newly downloaded component has not in fact chaned since last time, and we do nothing (we do not overwrite the existing component). If the two files do not match (which includes the case where the $(TMPDIR)/component-download-hra_subset.owl.tmp.owl does not exist, which would be the case if the rule has never been called since the last make clean or the last time the repository was cloned), then we (1) annotate the component (again! so the annotation in the previous rule was in fact for nothing), (2) writes it to the final $(COMPONENTSDIR)/hra_subset.owl location, and (3) keep a copy in $(TMPDIR)/component-download-hra_subset.owl.tmp.owl, so that we have something to compare the newly downloaded file against the next time we invoke the rule.

Still there? Good. Now I’ll be blunt: I think all of the above is insane, and borderline impossible to maintain properly (especially since all the rules are written in a Jinja2 template, which is already borderline impossible to read without risking brain clots).

Or rather, the aim (avoiding needlessly triggering rebuilds) is very good, but the method is ugly as hell. Most of that complexity should not reside in the Makefile.

What I propose in this PR is to hide most of the complexity in a new helper command (odk-helper download) that would be solely responsible of (1) downloading a file, (2) downloading it only if it has been modified (as much as we can know that, i.e. if the server can tell us that information), (3) overwriting an existing file with a newly downloaded file only if the newly downloaded file has changed since the last time it was downloaded.

With that command, the rules to download the envo mirror become:

.PHONY: download-mirror-envo
.PRECIOUS: $(MIRRORDIR)/envo.owl
download-mirror-envo: | $(TMPDIR)
        odk-helper download --output $(TMPDIR)/[email protected] \
                            --reference $(MIRRORDIR)/envo.owl \
                            $(OBOBASE)/envo/envo-base.owl

$(MIRRORDIR)/envo.owl: download-mirror-envo
        @if [ -f $(TMPDIR)/download-mirror-envo.owl ]; then \
                $(ROBOT) convert -i $(TMPDIR)/download-mirror-envo.owl -o $@ ; \
        fi

Whenever $(MIRRORDIR)/envo.owl is requested, we execute download-mirror-envo, which takes care of downloading (if needed!) the remote ontology and writing it (if needed!) into $(TMPDIR)/download-mirror-envo.owl. Back in the $(MIRRORDIR)/envo.owl rule, all we have to do is to check whether the $(TMPDIR)/download-mirror-envo.owl file exists; if it does, then we know it is new compared to the last time we tried to download it, and we can proceed with the conversion (and “base-ification” if needed, which is not the case here) and write the final $(MIRRORDIR)/envo.owl file.

Likewise for remotely sourced components:

.PRECIOUS: $(COMPONENTSDIR)/hra_subset.owl
.PRECIOUS: $(COMPONENTSDIR)/hra_subset.owl
download-component-hra_subset.owl: | $(TMPDIR)
        @odk-helper download --output $(TMPDIR)/$@ \
                             --reference $(COMPONENTSDIR)/hra_subset.owl \
                             https://raw.githubusercontent.com/hubmapconsortium/ccf-validation-tools/master/owl/UB_ASCTB_subset.owl

$(COMPONENTSDIR)/hra_subset.owl: download-component-hra_subset.owl $(TMPDIR)/stamp-component-hra_subset.owl
        @if [ -f $(TMPDIR)/download-component-hra_subset.owl ]; then \
                $(ROBOT) merge -i $(TMPDIR)/download-component-hra_subset.owl \
                         annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) \
                                  --output $@ ; \
        fi

Overall, this makes the standard Makefile (and the template from which it is generated) much more readable and therefore maintainable, while offering many new possibilities. For example, as implemented, the odk-helper download command can take care of automatically trying to download http://purl.obolibrary.org/obo/envo/envo-base.owl.gz whenever http://purl.obolibrary.org/obo/envo/envo-base.owl is requested, silently fallbacking to the original (uncompressed) URL if the compressed file cannot be found on the server. Implementing that kind of logic directly in the Makefile would be possible but way too cumbersome.

This commit adds a new command to the `odk-helper` tool: `download`. The
new command is intended to be used from within ODK workflows to download
files (e.g. when mirroring an ontology, or when refreshing a remotely
sourced component).

The aim is to simplify the standard Makefile by removing from it all the
suff needed to ensure that the rules that depend on a downloaded file
are not needlessly triggered when we download a file that has, in fact,
not changed since the last time it was downloaded.

The `odk-helper download` command takes care of that by keeping track,
for every downloaded file, of up to three pieces of information:

* the time at which the file was downloaded;
* the ETag returned by the server for that file (if any);
* the SHA-256 checksum of the downloaded file.

When we ask to download the same file again, the command uses the time
of last download and the ETag (if available) to specifically ask the
server for a newer version of the file, thereby avoiding download
completely if the server is able to honor the If-Modified-Since and/or
If-None-Match headers. If we do download the file, then we first write
it into a temporary location, compute its SHA-256 checksum, and compare
it with the recorded checksum from the last download; if the checksums
match, then the newly downloaded file is in fact not new and the
temporary file is deleted without ever touching the existing file.

In addition, the command can automatically:

* uncompress a .gz or .bz2 file;
* try to download a compressed version of a remote resource first (by
  appending `.gz` to the requested URL), then fallback to the original
  URL.
Now that we have a `odk-helper download` command, we can use it to
overhaul the way we download mirrors and remotely sourced components in
the standard Makefile.

Basically, to download and prepare a mirror M, we have two rules:

(1) A "phony" rule `download-mirror-M`. That rule is responsible for
actually downloading the remote file. It does so with the `odk-helper
download` command. The newly downloaded file is expected to be written
into `$(TMPDIR)/download-mirror-M.owl`, _unless_ the file has not been
changed since the last time it was downloaded (in which case nothing
will be written at that location). It is also in that rule that we deal
with the `use_base` and `use_gzipped` options.

(2) A "normal" rule `$(MIRRORDIR)/M.owl`, which depends on the first
rule. That rule should convert the downloaded file into the final mirror
file, by running `robot remove` (if `make_base` is enabled for the
mirror) and `robot convert`. Importantly, that rule must check
beforehand whether the `$(TMPDIR)/download-mirror-M.owl` file exists,
and do nothing if it does not (which will be the case when the remote
file has not changed since the last time it was downloaded).

The same principles applies for downloading remotely sourced components.
@gouttegd gouttegd self-assigned this Dec 31, 2025
@gouttegd gouttegd requested a review from matentzn December 31, 2025 14:49
Update the Makefile template so that the rule for downloading a mirror
takes into account the `mirror_retry_download` option (at the
ImportGroup level) and the `use_gzipped` option (at the individual
import level).

Mark the `mirror_max_time_download` option as deprecated, as (1)
Python's Requests library does not offer a way to set a timeout for
the entire request and (2) I happen to agree with the Requests
developers when they say that total timeouts are in fact a misfeature
(psf/requests#3099 (comment)).
Read timeouts (when we timeout after not receiving a data packet for a
while) should be enough to deal with flaky servers.
We almost regularly run into transient network errors when running the
test suite. Better to systematically retry downloads by default.
If we have to retry a download because of a transient network issue,
wait for one second before attempting again.
Copy link

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit optional but since we started a new year (with a lot of energy for great resolutions), it may make sense to add a handful of unit tests for this? Or do you want to rely on the integration test suite only?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to consider that now that I know you approve of the general idea of a odk-helper download command. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Let me know if you want to re-review (the actual code has not changed beyond the addition of the tests); otherwise, I’ll merge sometime sometime early next week.

The default rules selected by `ruff check` do not include the `isort`
rules (I), which I like to always use. Better to hardcode that in the
pyproject file rather than having to remember to select them on the
command line.
Add a test case for the `download_file` function that is at the core of
the `odk-helper download` command.

Update the `code-quality.yml` workflow to (1) add the `tests` directory
to the linting, formatting, and static typing checks, and (2) run the
unit tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants