-
Notifications
You must be signed in to change notification settings - Fork 15
Add ZEP 8 (URL syntax) draft #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@normanrz Please take a look. |
@MSanKeys963 Looks like there is an issue with the docs build that is unrelated to this PR. |
@martindurant Would appreciate your perspective on this --- I imagine you might say that we should just use fsspec syntax instead, though. |
Well indeed, I could say "why invent another"; although translating between |
While standardizing a URL scheme has benefits on its own, I think the main benefit/motivation for this ZEP is the formalization of Zip stores. Essentially, to comply with this ZEP, implementations need to implement zip stores. Maybe that should be written out more explicitly? |
While this ZEP was prompted by our discussion about zip stores, my intention was that we standardize on the syntax for various protocols, but that implementations would choose which ones to support. I think we could also push implementations to support zip format, but I'm not sure I want to tie that to this URL syntax proposal. |
@bogovicj this might also be relevant for your OME transformations proposal. |
@jbms: I have added #51 to fix the RTD build. Can you please update your PR? |
Thanks @jbms for putting this together! There are a few situations I came up with for which I'm not sure what the What does it look like to use Base URL: Is it correct / valid to use Base URL: If one needs to add an adapter in a relative way, how does one go about it? Base URL: Which, if any, of these do you think should be used? Are any of these invalid?
|
One more thing: We've found it useful to be able to reference a particular part of the attributes stored in json this zarr3 zarr.json
Could you envision adding an For example: A specific use case: I often re-use and reference transformations. Since these are described by metadata (not arrays), For example, if this were adopted, something like this would not uncommon in my workflows:
|
On Tue, Nov 14, 2023, 05:53 John Bogovic ***@***.***> wrote:
Thanks @jbms <https://github.com/jbms> for putting this together! There
are a few situations I came up with for which I'm not sure what the
relative URL should be
What does it look like to use ..: to "go up" multiple levels?
Is this correct / valid?
Base URL: gs://bucket/0.zip|zip:a|zarr3:i
Relative URL: ..:..:1.zip|zip:b|zarr3:ii
Resolved URL: gs://bucket/1.zip|zip:b|zarr3:ii
I was imagining that the relative url would be:
`|..|..:1.zip|zip:b|zarr3:ii`
The part after the | is always the scheme, and a scheme of .. is needed to
get to the parent store.
Is it correct / valid to use .. in the "path part" of relative URL, after
a ..:?
Base URL: gs://bucket/0/a/i.zarr|zarr3:foo
Relative URL: ..:../b/i.zarr|zarr3:foo
Resolved URL: gs://bucket/0/b/i.zarr|zarr3:foo
If one needs to add an adapter in a relative way, how does one go about it?
For example:
Base URL: gs://bucket/0/a/i.zarr Desired Resolved URL:
gs://bucket/0/a/i.zarr|zarr3:foo`
Which, if any, of these do you think should be used? Are any of these
invalid?
- .|zarr3:foo (clearest to me)
- |zarr3:foo
- zarr3:foo
I was imagining `|zarr3:foo`
The existing standard interpretation of a relative url of `.` means to
strip everything after the last slash, and we should be consistent with
that. Therefore if the base url were specified as
`gs://bucket/0/a/i.zarr/` then `.|zarr3:foo` would also be valid, but
probably should not be preferred.
…
—
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAEJ2TUR5G466LQFB4DE63YENZUBAVCNFSM6AAAAAA4R5AJVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJQGI2DIMZQG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
On Tue, Nov 14, 2023, 07:21 John Bogovic ***@***.***> wrote:
One more thing:
We've found it useful to be able to reference a particular part of the
attributes stored in json
with a URL. For example, for
this zarr3 zarr.json
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"dimension_names": ["rows", "columns"],
"data_type": "float64",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "gzip",
"configuration": {
"level": 1
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
- /attributes/baz[0] points to 1
- /shape points to [10000, 1000]
- /chunk_grid/configuration points to { "chunk_shape": [1000, 100] }
Could you envision adding an attributes: or zarr.json:, or similar
adapter, that enaables this?
Yes, having a scheme for accessing an attribute sounds like a good idea.
One option would be a specific scheme for zarr attributes, like zarr3a, e.g:
"gs://bucket/0.zip|zip:a|zarr3:i|zarr3a:/foo"
or
"gs://bucket/0.zip|zip:a/i|zarr3a:/foo"
Another option would be a json scheme for accessing any json file, e.g.:
"gs://bucket/0.zip|zip:a|zarr3:i/zarr.json|json:/attributes/foo"
Then there is the question of what syntax to use for specifying the path
within the json document. A natural choice would be the existing json
pointer syntax (https://datatracker.ietf.org/doc/html/rfc6901), e.g.
"/transform/1". The json pointer syntax does use an unusual escaping
syntax for handling member names containing "/": for example, if you have
an object like:
{"foo/bar": 10. "foo~bar": 11}
then to access the 10 value you use a json pointer of "/foo~1bar", and to
access the 11 value you use a json pointer of "/foo~0bar".
In my opinion this escaping mechanism is rather unfortunate since it is
easy to forget the meaning of "~0" and "~1", but it isn't an issue if you
can avoid using "/" or "~" in member names.
… For example: gs://bucket/0.zip|zip:a|zarr3:i|zarr.json:attributes/foo
A specific use case: I often re-use and reference transformations. Since
these are described by metadata (not arrays),
and so referencing the specific metadata is helpful.
For example, if this were adopted, something like this would not uncommon
in my workflows:
{
"type" : "sequence",
"transformations" : [
{ "url" : "..:/localTransformations|zarr.json:/transform[1]" },
{ "url" : "gs://bucket/path/to/templateTransformation.zarr|zarr3:sharedTransforms|zarr.json:/transform[0]" },
]
}
—
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAEJ2TAPT5G4BH5TRGA2TDYEOD5ZAVCNFSM6AAAAAA4R5AJVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJQGQ2DCMZYGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
From today's Zarr community meeting, @jbms has implemented this ZEP in Neuroglancer. Check here: google/neuroglancer#696 |
This is in line with zarr-developers/zeps#48 and the syntax supported by Neuroglancer. Currently, zip is supported. OCDBT support will be added in a subsequent commit. PiperOrigin-RevId: 755691199 Change-Id: Ia6cb84c12a986a7dd0ba65e41454fbe6d415aed0
@jbms: I tried pushing a merge of origin to try fixing the build, but was rejected. Could you give it a try? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a thorough read of this to understand it and I have left some comments with a few typo fixes.
I also left comments on parts that took me a decent bit of work to understand, or that I don't fully understand in the hope that it's a helpful perspective. I'd rate myself as a competent but not expert reader of a document like this
Co-authored-by: Davis Bennett <[email protected]>
Co-authored-by: Sanket Verma <[email protected]> Co-authored-by: Ian Hunt-Isaak <[email protected]> Co-authored-by: Joe Hamman <[email protected]>
Thanks very much for your review. Based on your comments I made some significant revisions and would appreciate feedback. Based on my revisions it occurs to me that this may be better as an independent standard, and the zarr spec could just recommend that implementations support it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates, I found it significantly easier to understand on this close reading. I've left a few more comments on the few remaining areas where I found myself confused.
- `dataset`: An array, group, or other dataset with a defined format | ||
e.g. a zarr array or group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still struggled a bit here with the general defintion of "dataset". Is there a convention here that I'm missing. I understand zarr array or group, and when i hear dataset my brain jumps to the xarray dataset.
This definition is referred to multiple times below so any further clarification here would be helpful.
A zarr attribute may be defined that specifies the location of some other | ||
related array using the relative URL pipeline syntax. | ||
|
||
The referencing array may be located at | ||
`s3://bucket/path/to/dataset.zip|zip:path/within/zip/|zarr3:`. Using only a | ||
relative path, it could specify the path of another array within | ||
`s3://bucket/path/to/dataset.zip:zip:`, e.g. the relative path | ||
`../another/array/` would refer to | ||
`s3://bucket/path/to/dataset.zip|zip:path/another/array/`. To refer to | ||
`s3://bucket/path/of/another.zip|zip:other/array/`, the relative URL | ||
pipeline `..:../of/another.zip|zip:other/array/|zarr3:` can be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation here is conflicting with the backtick formatting. (at least on github)
The referencing array may be located at | ||
`s3://bucket/path/to/dataset.zip|zip:path/within/zip/|zarr3:`. Using only a | ||
relative path, it could specify the path of another array within | ||
`s3://bucket/path/to/dataset.zip:zip:`, e.g. the relative path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`s3://bucket/path/to/dataset.zip:zip:`, e.g. the relative path | |
`s3://bucket/path/to/dataset.zip|zip:`, e.g. the relative path |
I think?
|
||
- `gs://bucket/path/to/data|byte-range:1000-2000` | ||
|
||
- `tiff:`, `jpeg:`, `png:`, `bmp:`, `avif:`, `webp:` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both these and byte-range, I don't know how something like zarr-python is meant to handle this. Surely zarr can't be responsible for reading different image formats?
This feels like it gets to your point:
Based on my revisions it occurs to me that this may be better as an independent standard,
|
||
The following syntaxes are supported: | ||
|
||
- `icechunk:path/to/node/` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is implicitly specifying relative to HEAD on a default branch. Is that the intent? I think we can be more explicit that this points to the latest commit on main
which is created by default
|
||
- `icechunk:` for [Icechunk](https://icechunk.io/en/latest/) | ||
|
||
The base URL msut refer to a `directory` resource (which is expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base URL msut refer to a `directory` resource (which is expected | |
The base URL must refer to a `directory` resource (which is expected |
For the purpose of relative URLs, the path component does not include the | ||
`@version/` prefix if present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purpose of relative URLs, the path component does not include the | |
`@version/` prefix if present. | |
For the purpose of relative URLs, the path component does not include the | |
`@version/` prefix if present in the base URL. |
?
Note: An `absolute_path` overrides any existing *path* component of the | ||
inner-most sub-URL of the base , but is still relative to the scheme and other | ||
components of the inner-most sub-URL of the base URL pipeline that precede its | ||
path component, if any. The specific scheme of the sub-URL defines what | ||
portion, if any, constitutes the path component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of an absolute_path would be very helpful. I don't fully understand how the scheme and other components collapse without hte path.
small typo fix:
Note: An `absolute_path` overrides any existing *path* component of the | |
inner-most sub-URL of the base , but is still relative to the scheme and other | |
components of the inner-most sub-URL of the base URL pipeline that precede its | |
path component, if any. The specific scheme of the sub-URL defines what | |
portion, if any, constitutes the path component. | |
Note: An `absolute_path` overrides any existing *path* component of the | |
inner-most sub-URL of the base, but is still relative to the scheme and other | |
components of the inner-most sub-URL of the base URL pipeline that precede its | |
path component, if any. The specific scheme of the sub-URL defines what | |
portion, if any, constitutes the path component. |
The use of outer-to-inner order for the sub-URLs enables completion of both | ||
paths and sub-URL schemes as the user types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this very compelling as a reason
|
||
If passed a URL that resolves to a `dataset` resource, returns an error. | ||
|
||
- `open_file`: opens a file from a URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the use case for byte-range and tiff
? Would I use zarr.open_file("s3:/some/path/image.tiff")
to get a file handle on that tiff that I could then pass to tifffile?
No description provided.