-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor downloading rules using a "download helper" command #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit adds a new command to the `odk-helper` tool: `download`. The new command is intended to be used from within ODK workflows to download files (e.g. when mirroring an ontology, or when refreshing a remotely sourced component). The aim is to simplify the standard Makefile by removing from it all the suff needed to ensure that the rules that depend on a downloaded file are not needlessly triggered when we download a file that has, in fact, not changed since the last time it was downloaded. The `odk-helper download` command takes care of that by keeping track, for every downloaded file, of up to three pieces of information: * the time at which the file was downloaded; * the ETag returned by the server for that file (if any); * the SHA-256 checksum of the downloaded file. When we ask to download the same file again, the command uses the time of last download and the ETag (if available) to specifically ask the server for a newer version of the file, thereby avoiding download completely if the server is able to honor the If-Modified-Since and/or If-None-Match headers. If we do download the file, then we first write it into a temporary location, compute its SHA-256 checksum, and compare it with the recorded checksum from the last download; if the checksums match, then the newly downloaded file is in fact not new and the temporary file is deleted without ever touching the existing file. In addition, the command can automatically: * uncompress a .gz or .bz2 file; * try to download a compressed version of a remote resource first (by appending `.gz` to the requested URL), then fallback to the original URL.
Now that we have a `odk-helper download` command, we can use it to overhaul the way we download mirrors and remotely sourced components in the standard Makefile. Basically, to download and prepare a mirror M, we have two rules: (1) A "phony" rule `download-mirror-M`. That rule is responsible for actually downloading the remote file. It does so with the `odk-helper download` command. The newly downloaded file is expected to be written into `$(TMPDIR)/download-mirror-M.owl`, _unless_ the file has not been changed since the last time it was downloaded (in which case nothing will be written at that location). It is also in that rule that we deal with the `use_base` and `use_gzipped` options. (2) A "normal" rule `$(MIRRORDIR)/M.owl`, which depends on the first rule. That rule should convert the downloaded file into the final mirror file, by running `robot remove` (if `make_base` is enabled for the mirror) and `robot convert`. Importantly, that rule must check beforehand whether the `$(TMPDIR)/download-mirror-M.owl` file exists, and do nothing if it does not (which will be the case when the remote file has not changed since the last time it was downloaded). The same principles applies for downloading remotely sourced components.
Update the Makefile template so that the rule for downloading a mirror takes into account the `mirror_retry_download` option (at the ImportGroup level) and the `use_gzipped` option (at the individual import level). Mark the `mirror_max_time_download` option as deprecated, as (1) Python's Requests library does not offer a way to set a timeout for the entire request and (2) I happen to agree with the Requests developers when they say that total timeouts are in fact a misfeature (psf/requests#3099 (comment)). Read timeouts (when we timeout after not receiving a data packet for a while) should be enough to deal with flaky servers.
We almost regularly run into transient network errors when running the test suite. Better to systematically retry downloads by default.
If we have to retry a download because of a transient network issue, wait for one second before attempting again.
matentzn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit optional but since we started a new year (with a lot of energy for great resolutions), it may make sense to add a handful of unit tests for this? Or do you want to rely on the integration test suite only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to consider that now that I know you approve of the general idea of a odk-helper download command. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Let me know if you want to re-review (the actual code has not changed beyond the addition of the tests); otherwise, I’ll merge sometime sometime early next week.
The default rules selected by `ruff check` do not include the `isort` rules (I), which I like to always use. Better to hardcode that in the pyproject file rather than having to remember to select them on the command line.
Add a test case for the `download_file` function that is at the core of the `odk-helper download` command. Update the `code-quality.yml` workflow to (1) add the `tests` directory to the linting, formatting, and static typing checks, and (2) run the unit tests.
The current rules to download a file (e.g. a mirror, or a remotely sourced component) are quite complicated because we want to avoid needlessly triggering all the subsequent rules that depend on a download file, if it happens that the remote file has not in fact changed since the last time it was downloaded.
For example, we do not want to re-generate the import module RO_import.owl for a remote ontology RO, if there has been no changes in said remote ontology. That’s because (1) re-generating an import module may be (depending on the module) a costly operation, and (2) the fact that the RO_import.owl module has been re-generated will in turn trigger a rebuild of everything that depends on the import module — all for nothing if the remote ontology has not in fact changed since the last time RO_import.owl was built.
The way this is currently done is by playing the following game:
Whenever
$(MIRRORDIR)/envo.owlis required, we execute themirror-envophony rule, which (1) downloads the original file into$(TMPDIR)/envo-download.owland (2) runs it throughrobot convertand writes the result into$(TMPDIR)/mirror-envo.owl. (If themake_baseoption has been used for the module, step (2) would include the removal of all non-base entities, instead of being a mereconvertoperation.)Then, back in the
$(MIRRORDIR)/envo.owlrule, we check (1) whether the remote file was indeed downloaded and is available in$(TMPDIR)/mirror-envo.owl, and (2) whether that file is any different from the existing mirror in$(MIRRORDIR)/envo.owl; if it is, then we proceed to overwrite the existing mirror with the newly downloaded one, otherwise we do nothing – we do not touch the existing mirror at all, so that its last modified time is kept unchanged, which prevents the triggering of any further rules that may depend on that file.We use a similar logic, albeit even more obscure, for remotely sourced components:
Whenever
$(COMPONENTSDIR)/hra_subset.owlis requested, we executecomponent-download-hra_subset.owl, which (1) downloads the original file, (2) annotates it, and (3) writes it to$(TMPDIR)/component-download-hra_subset.owl.owl.Back in the
$(COMPONENTSDIR)/hra_subset.owlrule, we compare the newly downloaded file$(TMPDIR)/component-download-hra_subset.owl.owlto a$(TMPDIR)/component-download-hra_subset.owl.tmp.owlfile (left in place by a previous call to the rule, see later). If the two files match, then the newly downloaded component has not in fact chaned since last time, and we do nothing (we do not overwrite the existing component). If the two files do not match (which includes the case where the$(TMPDIR)/component-download-hra_subset.owl.tmp.owldoes not exist, which would be the case if the rule has never been called since the lastmake cleanor the last time the repository was cloned), then we (1) annotate the component (again! so the annotation in the previous rule was in fact for nothing), (2) writes it to the final$(COMPONENTSDIR)/hra_subset.owllocation, and (3) keep a copy in$(TMPDIR)/component-download-hra_subset.owl.tmp.owl, so that we have something to compare the newly downloaded file against the next time we invoke the rule.Still there? Good. Now I’ll be blunt: I think all of the above is insane, and borderline impossible to maintain properly (especially since all the rules are written in a Jinja2 template, which is already borderline impossible to read without risking brain clots).
Or rather, the aim (avoiding needlessly triggering rebuilds) is very good, but the method is ugly as hell. Most of that complexity should not reside in the Makefile.
What I propose in this PR is to hide most of the complexity in a new helper command (
odk-helper download) that would be solely responsible of (1) downloading a file, (2) downloading it only if it has been modified (as much as we can know that, i.e. if the server can tell us that information), (3) overwriting an existing file with a newly downloaded file only if the newly downloaded file has changed since the last time it was downloaded.With that command, the rules to download the
envomirror become:Whenever
$(MIRRORDIR)/envo.owlis requested, we executedownload-mirror-envo, which takes care of downloading (if needed!) the remote ontology and writing it (if needed!) into$(TMPDIR)/download-mirror-envo.owl. Back in the$(MIRRORDIR)/envo.owlrule, all we have to do is to check whether the$(TMPDIR)/download-mirror-envo.owlfile exists; if it does, then we know it is new compared to the last time we tried to download it, and we can proceed with the conversion (and “base-ification” if needed, which is not the case here) and write the final$(MIRRORDIR)/envo.owlfile.Likewise for remotely sourced components:
Overall, this makes the standard Makefile (and the template from which it is generated) much more readable and therefore maintainable, while offering many new possibilities. For example, as implemented, the
odk-helper downloadcommand can take care of automatically trying to downloadhttp://purl.obolibrary.org/obo/envo/envo-base.owl.gzwheneverhttp://purl.obolibrary.org/obo/envo/envo-base.owlis requested, silently fallbacking to the original (uncompressed) URL if the compressed file cannot be found on the server. Implementing that kind of logic directly in the Makefile would be possible but way too cumbersome.