Skip to content

Call to load_registry_from_doi using a Zenodo DOI fails with cryptic errors #502

@NickleDave

Description

@NickleDave

Hi,

First of all, thank you for developing pooch -- it's a great library and I'm guessing my issue is most likely due to my lack of knowledge.

One of the packages I develop uses pooch to download larger example datasets (as intended).

The tests suddenly started failing in CI, see logs here:
https://github.com/vocalpy/vocalpy/actions/runs/20418516011/job/58970029807

My best understanding of what is going on is that I'm getting an HTTP response 429 when I call POOCH.load_registry_from_doi().
Unfortunately this gets obscured by a JSONDecodeError from requests. Here's an example from my tests failing locally (hidden in <details></details> to try and keep this issue readable):

Details
====================================================================================== FAILURES ======================================================================================
___________________________________________________________________________ test_example[False-example11] ____________________________________________________________________________

self = <Response [429]>, kwargs = {}

    def json(self, **kwargs):
        r"""Decodes the JSON response body (if any) as a Python object.
    
        This may return a dictionary, list, etc. depending on what is in the response.
    
        :param \*\*kwargs: Optional arguments that ``json.loads`` takes.
        :raises requests.exceptions.JSONDecodeError: If the response body does not
            contain valid json.
        """
    
        if not self.encoding and self.content and len(self.content) > 3:
            # No encoding set. JSON RFC 4627 section 3 states we should expect
            # UTF-8, -16 or -32. Detect which one to use; If the detection or
            # decoding fails, fall back to `self.text` (using charset_normalizer to make
            # a best guess).
            encoding = guess_json_utf(self.content)
            if encoding is not None:
                try:
                    return complexjson.loads(self.content.decode(encoding), **kwargs)
                except UnicodeDecodeError:
                    # Wrong UTF codec detected; usually because it's not UTF-8
                    # but some other 8-bit codec.  This is an RFC violation,
                    # and the server didn't bother to tell us what codec *was*
                    # used.
                    pass
                except JSONDecodeError as e:
                    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
    
        try:
>           return complexjson.loads(self.text, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.venv/lib/python3.13/site-packages/requests/models.py:976: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../.pyenv/versions/3.13.11/lib/python3.13/json/__init__.py:352: in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
../../../../.pyenv/versions/3.13.11/lib/python3.13/json/decoder.py:345: in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <json.decoder.JSONDecoder object at 0x76ee8b2d38c0>
s = '<html>\r\n<head><title>429 Too Many Requests</title></head>\r\n<body>\r\n<center><h1>429 Too Many Requests</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'
idx = 0

    def raw_decode(self, s, idx=0):
        """Decode a JSON document from ``s`` (a ``str`` beginning with
        a JSON document) and return a 2-tuple of the Python
        representation and the index in ``s`` where the document ended.
    
        This can be used to decode a JSON document from a string that may
        have extraneous data at the end.
    
        """
        try:
            obj, end = self.scan_once(s, idx)
        except StopIteration as err:
>           raise JSONDecodeError("Expecting value", s, err.value) from None
E           json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

../../../../.pyenv/versions/3.13.11/lib/python3.13/json/decoder.py:363: JSONDecodeError

During handling of the above exception, another exception occurred:

example = Example(name='zblib', description='Zebra finch calls, subset of data from:\nElie, Julie; Theunissen, Frédéric E. (2020...alization-library-zebra-finch-subset.zip'), makefunc=<function zblib_makefunc at 0x76ee3aa26520>, makefunc_kwargs=None)
return_path = False

    @pytest.mark.parametrize(
        'example',
        vocalpy.examples._examples.EXAMPLES
    )
    def test_example(example, return_path):
>       out = vocalpy.examples._examples.example(example.name, return_path=return_path)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/test_examples/test__examples.py:32: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/vocalpy/examples/_examples.py:398: in example
    return example_.load(return_path=return_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src/vocalpy/examples/_examples.py:257: in load
    POOCH.load_registry_from_doi()
.venv/lib/python3.13/site-packages/pooch/core.py:704: in load_registry_from_doi
    return repository.populate_registry(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/pooch/downloaders.py:908: in populate_registry
    for filedata in self.api_response["files"]:
                    ^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/pooch/downloaders.py:811: in api_response
    ).json()
      ^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>, kwargs = {}

    def json(self, **kwargs):
        r"""Decodes the JSON response body (if any) as a Python object.
    
        This may return a dictionary, list, etc. depending on what is in the response.
    
        :param \*\*kwargs: Optional arguments that ``json.loads`` takes.
        :raises requests.exceptions.JSONDecodeError: If the response body does not
            contain valid json.
        """
    
        if not self.encoding and self.content and len(self.content) > 3:
            # No encoding set. JSON RFC 4627 section 3 states we should expect
            # UTF-8, -16 or -32. Detect which one to use; If the detection or
            # decoding fails, fall back to `self.text` (using charset_normalizer to make
            # a best guess).
            encoding = guess_json_utf(self.content)
            if encoding is not None:
                try:
                    return complexjson.loads(self.content.decode(encoding), **kwargs)
                except UnicodeDecodeError:
                    # Wrong UTF codec detected; usually because it's not UTF-8
                    # but some other 8-bit codec.  This is an RFC violation,
                    # and the server didn't bother to tell us what codec *was*
                    # used.
                    pass
                except JSONDecodeError as e:
                    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
    
        try:
            return complexjson.loads(self.text, **kwargs)
        except JSONDecodeError as e:
            # Catch JSON-related errors and raise as requests.JSONDecodeError
            # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
>           raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
E           requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

.venv/lib/python3.13/site-packages/requests/models.py:980: JSONDecodeError
============================================================================== short test summary info ===============================================================================
FAILED tests/test_examples/test__examples.py::test_example[False-example11] - requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=========================================================================== 1 failed, 106 passed in 8.42s ============================================================================

But the error seems to be non-deterministic, and sometimes I instead get an error suggesting the DOI doesn't exist, like so:

Details
====================================================================================== FAILURES ======================================================================================
____________________________________________________________________________ test_example[False-example9] ____________________________________________________________________________

example = Example(name='bfsongrepo', description='Sample of song from Bengalese Finch Song Repository.\nNicholson, David; Queen,...po.tar.gz'), makefunc=<function bfsongrepo_makefunc at 0x7b790061d760>, makefunc_kwargs={'annot_format': 'simple-seq'})
return_path = False

    @pytest.mark.parametrize(
        'example',
        vocalpy.examples._examples.EXAMPLES
    )
    def test_example(example, return_path):
>       out = vocalpy.examples._examples.example(example.name, return_path=return_path)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/test_examples/test__examples.py:32: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/vocalpy/examples/_examples.py:398: in example
    return example_.load(return_path=return_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src/vocalpy/examples/_examples.py:257: in load
    POOCH.load_registry_from_doi()
.venv/lib/python3.13/site-packages/pooch/core.py:701: in load_registry_from_doi
    repository = doi_to_repository(doi)
                 ^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/pooch/downloaders.py:689: in doi_to_repository
    archive_url = doi_to_url(doi)
                  ^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

doi = '10.5281/zenodo.10685639'

    def doi_to_url(doi):
        """
        Follow a DOI link to resolve the URL of the archive.
    
        Parameters
        ----------
        doi : str
            The DOI of the archive.
    
        Returns
        -------
        url : str
            The URL of the archive in the data repository.
    
        """
        # Lazy import requests to speed up import time
        import requests  # pylint: disable=C0415
    
        # Use doi.org to resolve the DOI to the repository website.
        response = requests.get(f"https://doi.org/{doi}", timeout=DEFAULT_TIMEOUT)
        url = response.url
        if 400 <= response.status_code < 600:
>           raise ValueError(
                f"Archive with doi:{doi} not found (see {url}). Is the DOI correct?"
            )
E           ValueError: Archive with doi:10.5281/zenodo.10685639 not found (see https://zenodo.org/doi/10.5281/zenodo.10685639). Is the DOI correct?

.venv/lib/python3.13/site-packages/pooch/downloaders.py:652: ValueError
============================================================================== short test summary info ===============================================================================
FAILED tests/test_examples/test__examples.py::test_example[False-example9] - ValueError: Archive with doi:10.5281/zenodo.10685639 not found (see https://zenodo.org/doi/10.5281/zenodo.10685639). Is the DOI correct?
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=========================================================================== 1 failed, 104 passed in 1.77s ============================================================================

But it for sure exists, here: https://zenodo.org/records/13929516
And if I manually type in the url the way that pooch constructs it--('f"https://doi.org/{doi}" where doi=10.5281/zenodo.10685639)--then it resolves correctly to that site: https://doi.org/10.5281/zenodo.10685639

Since part of what's happening here is that the error is not super informative, I guess this is related to #456.

But even if I got a more specific "HTTPError: 429" or something like that, I would still have no idea how to resolve it.

A look at SO and reddit posts suggests that Zenodo thinks I am a bot?
Maybe they assume any IP address from my neck of the woods is a scraper from Anthropic.
I'm really more mis-Anthropic 1.

Even worse, I have the problem that I still get this error when I run tests locally, or just try to use the function to download examples directly.

Could this be due to some change in how Zenodo deals with bots and scrapers? (i.e., it has nothing to do with pooch or requests)

Is there some sort of workaround like maybe a read-only token for accessing Zenodo? (I have no idea if such a thing exists.)

Thanks in advance for any help, I'm happy to provide any additional info as well

Footnotes

  1. can't resist a bad pun, sorry

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions