Skip to content

Remove UR:file:// and UR:ftp:// from ref search path, plus REF_PATH to EBI#1881

Merged
daviesrob merged 2 commits intosamtools:developfrom
jkbonfield:no-ebi-ref
May 15, 2025
Merged

Remove UR:file:// and UR:ftp:// from ref search path, plus REF_PATH to EBI#1881
daviesrob merged 2 commits intosamtools:developfrom
jkbonfield:no-ebi-ref

Conversation

@jkbonfield
Copy link
Contributor

While use of the EBI refget server was originally encouraged by the CRAM inventors, it became a self-imposed DDOS and it is now unreliable due to explicit rate-limiting by the EBI. This removes EBI as a fallback when REF_PATH has not been set.

In doing this we discovered that we could still retrieve references (ironically also from EBI due to the test being a 1000genomes CRAM) via the SQ UR: tag supporting remote URIs. This behaviour is explicitly listed as not being supported in the samtools manpage and we believe it was an accidental ability added when switching from fopen to bgzf_open for reading the UR reference file.

Note this check must be in cram_populate_ref and not load_ref_portion or bgzf_open_ref as the user still has the ability to explicitly request an external reference, eg via "samtools view -T URI".

open_path_mfile() now takes an extra 'int *local' argument which is filled out with non-zero if the file found in REF_PATH is local. Non-local files will be cached to REF_CACHE if set, but it no longer has a default value as we did when ebi refget was the default REF_PATH. This means it should operate much as before, except for the lack of EBI defaults.

@daviesrob
Copy link
Member

As this may make failing to get a reference more likely, a useful addition would be to make the

[W::cram_get_ref] Failed to populate reference for id 0

message more useful. It should at least give the name for the offending reference, and maybe also a link to the documentation we plan to add on how to get references.

…o EBI.

While use of the EBI refget server was originally encouraged by the
CRAM inventors, it became a self-imposed DDOS and it is now unreliable
due to rate limiting by the EBI.  This removes EBI as a fallback when
REF_PATH has not been set.

In doing this we discovered that we could still retrieve references
(ironically also from EBI due to the test being a 1000genomes CRAM)
via the SQ UR: tag supporting remote URIs.  This behaviour is
explicity listed as not being supported in the samtools manpage and we
believe it was an accidental ability added when switching from fopen
to bgzf_open for reading the UR reference file.

Note this check must be in cram_populate_ref and not load_ref_portion
or bgzf_open_ref as the user still has the ability to explicitly
request an external reference, eg via "samtools view -T URI".

open_path_mfile() now takes an extra 'int *local' argument which is
filled out with non-zero if the file found in REF_PATH is local.
Non-local files will be cached to REF_CACHE if set, but it no longer
has a default value as we did when ebi refget was the default REF_PATH.
This means it should operate much as before, except for the lack of
EBI defaults.
@jkbonfield jkbonfield force-pushed the no-ebi-ref branch 2 times, most recently from edc5ac2 to 178d1ba Compare May 14, 2025 08:55
Edits from review
@daviesrob daviesrob merged commit e5675d7 into samtools:develop May 15, 2025
9 checks passed
daviesrob added a commit to samtools/samtools that referenced this pull request May 19, 2025
Updates to reflect the changes in the HTSlib UR: and REF_PATH pull
request samtools/htslib#1881.

Also moved the section listing the reference discovery ordering to be
adjacent to the environment variable section so all relevant text is
together.

---------

Co-authored-by: Rob Davies <rmd@sanger.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants