Skip to content

Conversation

ianhi
Copy link
Contributor

@ianhi ianhi commented Sep 30, 2025

Addresses part of #3495 by attempting to autodetect if a URL is S3 compatible or not.
With this change the following "just works"

import zarr
z = zarr.open("https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.5/idr0062A/6001240_labels.zarr", storage_options={"anon":True})
print(list(z.keys()))

This is equivalent to:

  zarr.open(
      "s3://idr/zarr/v0.5/idr0062A/6001240_labels.zarr",
      storage_options={
          "anon": True,
          "client_kwargs": {"endpoint_url": "https://uk1s3.embassy.ebi.ac.uk"}
      }
  )

but now zarr handles that splitting for the user

  • Add unit tests and/or doctests in docstrings
  • [NA?] Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Sep 30, 2025
Comment on lines +495 to +502
is_likely_s3 = (
any(has_s3_token(part) for part in hostname_parts)
or "object-store" in hostname # Less likely to have false positives
or "objectstore" in hostname
or "minio" in hostname_parts # minio.example.com
or "ceph" in hostname_parts # ceph.example.com
or "rgw" in hostname_parts # rgw.example.com (Ceph RADOS Gateway)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how reliable is this?

@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2025

@martindurant and @kylebarron relevant to your interests.

@martindurant
Copy link
Member

I don't understand: why would someone provide an HTTP URL for something that you need the S3 protocol for? The storage options only make sense to s3, not http, so the caller already knows.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2025

I'm inclined against the changes in this PR -- "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.5/idr0062A/6001240_labels.zarr" is unambiguously an https URL, and so it makes sense to use an HTTP-based storage backend to access it. For example, "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.5/idr0062A/6001240_labels.zarr/zarr.json" works just fine. Directory listing does not work, but that's pretty common for zarr data over http.

@ianhi
Copy link
Contributor Author

ianhi commented Sep 30, 2025

I ran into this today when trying to load data from: https://idr.github.io/ome-ngff-samples/

the "copy s3 url" button gives a one line URL: https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.5/idr0062A/6001240_labels.zarr

which I naively expected to work in zarr. So this PR is basically protection against that situation for other naive users possibly sourcing URLs that look like that.

Ideally zarr would just work and protect me from this, or at least it should have thrown an error, rather silently failing by not returning any data but otherwise working

@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2025

the "copy s3 url" button gives a one line URL: https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.5/idr0062A/6001240_labels.zarr

IMO a button that says "copy s3 url" should give you an s3 url, not an http url

@ianhi
Copy link
Contributor Author

ianhi commented Sep 30, 2025

IMO a button that says "copy s3 url" should give you an s3 url, not an http url

I agree! but I suspect this is not the only place this is happening and we can't fix it everywhere

Also is it possible to specify the endpoint url and bucket and path in one s3: url?

@martindurant
Copy link
Member

Exactly what @d-v-b says. That URL is just wrong, it doesn't point to anything! If you poke it, the returned headers (of the 404) does contain an "x-amz-*" key, but that might be true from a real website that is hosted somewhere within AWS.

@ianhi
Copy link
Contributor Author

ianhi commented Sep 30, 2025

hmm. ok. So should zarr have raised an error for what I did here: #3495 when it loaded as an http store that didn't fully behave like one?

and am I right that there's no one liner for that website to return becuase you can't specify an endpoint-url within the s3: syntax?

@ianhi
Copy link
Contributor Author

ianhi commented Sep 30, 2025

I am updating the URLS over there: IDR/ome-ngff-samples#28

going to close this now as I hear the arguments against this change. I would sitll like to find a way to make this more transparent to users but this is clearly not the right approach.

@ianhi ianhi closed this Sep 30, 2025
@joshmoore
Copy link
Member

and am I right that there's no one liner for that website to return becuase you can't specify an endpoint-url within the s3: syntax?

That's my understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants