diff --git a/draft/ZEP0008.md b/draft/ZEP0008.md new file mode 100644 index 0000000..e2e4d32 --- /dev/null +++ b/draft/ZEP0008.md @@ -0,0 +1,684 @@ +--- +layout: default +title: ZEP0008 +description: URL pipeline syntax +parent: draft ZEPs +nav_order: 8 +--- + +# ZEP 8 — URL pipeline syntax + +--- + +``` +Author: Jeremy Maitin-Shepard , Google Research + +Status: Draft + +Type: Specification + +Created: 2023-09-07 +``` + +## Abstract + +This proposal defines a URL pipeline syntax for specifying how to +locate a zarr node, a plain key-value store, or other resources. + +## Motivation and Scope + +A URL syntax for zarr nodes that is common across multiple zarr implementations +enables users to more easily share dataset locations between different tools. + +While in simple cases it is sufficient to specify the zarr node location using +an existing, well-established URL scheme like `file` or `http`, or existing but +less-standard URL schemes like `s3` and `gs`, for nested storage mechanisms like +the ZIP format, and for nodes within a zarr hierarchy, there is no existing +established URL syntax. + +Additionally, for implementations that support other data formats than zarr v3, +it may be necessary to also indicate the data format as part of the URL syntax. + +This proposal defines a new absolute and relative URL syntax addresses this +need. + +## Usage and Impact + +Zarr implementations that support this URL syntax are expected to provide an API +for opening a Zarr node at a given URL, and for obtaining the URL corresponding +to an open Zarr node. + +Using these APIs, data can easily be shared between different Zarr +implementations that support the proposed URL syntax. + +Implementations may also optionally support this syntax for uses +beyond Zarr, such as specifying a bare key-value store or specifying +data in non-zarr formats. + +## Detailed description + +This ZEP defines a *URL pipeline syntax* that may be optionally supported by +Zarr implementations in order to allow the location of a zarr array or group to +be specified in a convenient, implementation-independent way. + +More precisely, it defines a URL syntax that may specify: + +- A root key-value store, such as a path within an S3 bucket or a path + on the local filesystem, e.g. `s3://bucket/path`; + +- A nested key-value store, such as a sub-directory within a ZIP file + within an S3 bucket, + e.g. `s3://bucket/path/to/archive.zip|zip:path/within/zip/`; + +- A zarr v2 or v3 node within a zarr v3 hierarchy within a particular + key-value store, + e.g. `s3://bucket/path/to/zarr/|zarr3:path/within/hierarchy/`; + +- A non-zarr dataset within a zarr v3 hierarchy within a particular + key-value store, e.g. `s3://bucket/path/to/image.tiff|tiff:` or + `s3://bucket/path/to/dataset.n5/|n5:path/within/hierarchy/`. + +### Resource kinds + +The proposed URL pipeline syntax may refer to several different kinds of +resources: + +- `file`: Single file within a key-value store, no specific data + format. + +- `directory`: Single directory within a key-value store, no specific + data format. + +- `dataset`: An array, group, or other dataset with a defined format + e.g. a zarr array or group. + +Depending on the URL schemes involved, in some cases the resource kind +can be determined syntatically from the URL alone, while in other +cases it can only be resolved by actually accessing the resource. + +### Absolute URL syntax + +The proposed "zarr URL pipeline syntax" has the following ABNF grammar: + +``` +absolute_url_pipeline = root_url *( "|" adapter ) +``` + +where a `root_url` specifies an absolute resource location, such as +`file:/path/to/local/file`, and `adapter` specifies a nested resource using a +specified protocol, such as `zip:path/within/zip`. The `root_url` and each of +the `adapter` portions are considered "sub-URLs". + +For a given `adapter`, the sequence of `root_url` and prior `adapter` +sub-URLs is called the "base URL". + +The following `root_url` schemes are defined: + +- `file:` as defined by [RFC8089](https://datatracker.ietf.org/doc/html/rfc8089). + + `file:/absolute/path` and `file:///absolute/path` are supported. + + Implementations that have a *current working directory* MAY support + the non-standard extension `file:relative/path` where + `relative/path` is resolved relative to the current working + directory. + + Implementations SHOULD not support `file://relative/path` since that + is ambiguous with the `file://hostname/path` syntax defined by + [RFC8089](https://datatracker.ietf.org/doc/html/rfc8089). + + If the path is empty or ends with `/`, the resultant resource kind + is known syntatically to be a `directory`. + + Otherwise, the resultant resource kind may be either `file` or + `directory`. + +- `http:` and `https:` for generic HTTP servers + + In general only single-key read operations (corresponding to GET + requests) are supported. + + Implementations may choose to heuristically support list operations + by detecting server support for HTML directory listings and/or + S3-compatible listing. + + The resultant resource kind may be either `file` or `directory`. + +- `s3://bucket/path/within/bucket` for AWS S3 + + The endpoint, appropriate credentials, and bucket region (for + non-anonymous access) must be determined automatically. + + If the path is empty or ends with `/`, the resultant resource kind + is known syntatically to be a `directory`. + + Otherwise, the resultant resource kind may be either `file` or + `directory`. + +- `s3+http:` and `s3+https:` for S3-compatible servers + + This URL syntax is AMBIGUOUSLY either: + + - `s3+https://endpoint/path/within/bucket`, assuming `endpoint` + corresponds to a single bucket, + e.g. `s3+https://mybucket.s3.amazonaws.com/bucket/path`, or; + + - `s3+https://endpoint/bucket/path/within/bucket`, assuming + `endpoint` corresponds to multiple buckets, + e.g. `s3+https://s3.amazonaws.com/bucket/path`. + + The appropriate credentials (and bucket region for non-anonymous + access) must be determined automatically. + + For most operations, e.g. to GET a single object, this ambiguity + does not matter, but for LIST operations the implementation must + determine the path to the root of the bucket, e.g. by initially + attempting the LIST operation with both possible paths and checking + which one succeeds. For subsequent operations on the same endpoint + the result can be cached to avoid overhead. + + If the path is empty or ends with `/`, the resource kind is known + syntatically to be a `directory`. + + Otherwise, the resultant resource kind may be either `file` or + `directory`. + + For the purpose of relative URLs, the path component includes the `bucket/` + prefix even if the endpoint is in fact a multi-bucket endpoint. + +- `gs://bucket/path/within/bucket` for Google Cloud Storage (GCS) + + If the path is empty or ends with `/`, the resultant resource kind + is known syntatically to be a `directory`. + + Otherwise, the resultant resource kind may be either `file` or + `directory`. + +The following `adapter` URL schemes are defined: + +- `zip:path/within/zip` for ZIP archive format + + The base URL must refer to a `file` resource (which is expected to + be in ZIP format). + + If the path is empty or ends with `/`, the resultant resource kind + is known syntatically to be a `directory`. + + Otherwise, the resultant resource kind may be either `file` or + `directory`. + +- `ocdbt:` for [OCDBT](https://google.github.io/tensorstore/kvstore/ocdbt/index.html) + + The base URL must refer to a `directory` resource (which is expected + to be in OCDBT format). + + The URL syntax is `ocdbt:path` or `ocdbt:@version/path`, where + `version` is either `v123` or `2025-01-01T01:23:45.678Z`. + + For example: + + - `file:///tmp/dataset.ocdbt/|ocdbt:@v1/path/within/database` + - `file:///tmp/dataset.ocdbt/|ocdbt:path/within/database` + - `file:///tmp/dataset.ocdbt/|ocdbt:@2025-01-01T01:23:45.678Z/path/within/database` + + While `@` is normally allowed within the path component of a URL, + with the `ocdbt:` URL scheme, if the path starts with `@` it must be + percent-encoded as `%40` to avoid ambiguity with the `@version` + component. For example, a path of `@abc` can be specified as + `ocdbt:%40abc`. + + The resultant resource kind may be either `file` or `directory`. + + For the purpose of relative URLs, the path component does not include the + `@version/` prefix if present. + +- `icechunk:` for [Icechunk](https://icechunk.io/en/latest/) + + The base URL msut refer to a `directory` resource (which is expected + to contain an Icechunk database). + + The following syntaxes are supported: + + - `icechunk:path/to/node/` + - `icechunk:@branch.BRANCH/path/to/node/` + - `icechunk:@tag.TAG/path/to/node/` + - `icechunk:@SNAPSHOT/path/to/node/` + + For example: + + - `file:///path/to/repo.zarr.icechunk/|icechunk:|zarr3:path/to/array/` + - `file:///path/to/repo.zarr.icechunk/|icechunk:@branch.other/|zarr3:path/to/array/` + - `file:///path/to/repo.zarr.icechunk/|icechunk:@tag.v5/|zarr3:path/to/array/` + - `file:///path/to/repo.zarr.icechunk/|icechunk:@4N0217AZA4VNPYD0HR0G/|zarr3:path/to/array/` + + While `@` is normally allowed within the path component of a URL, + with the `icechunk:` URL scheme, if the path starts with `@` it must + be percent-encoded as `%40` to avoid ambiguity with the `@version` + component. For example, a path of `@abc/` can be specified as + `icechunk:%40abc/`. + + If the path is empty or ends with `/`, the resultant resource kind + is known syntatically to be a `directory`. + + Otherwise, the resultant resource kind may be either `file` or + `directory`. + + For the purpose of relative URLs, the path component does not include the + `@version/` prefix if present. + +- `zarr3:path/within/hierarchy/` to specify a zarr v3 node + + The base URL must refer to a `directory` resource, which is expected + to contain a zarr v3 hierarchy. + + The resultant resource kind is always `dataset`. + +- `zarr2:path/within/hierarchy/` to specify a zarr v2 node + + The base URL must refer to a `directory` resource, which is expected + to contain a zarr v2 hierarchy. + + The resultant resource kind is always `dataset`. + +- `zarr:path/within/hierarchy/` to specify a zarr v2 or v3 node + + The base URL must refer to a `directory` resource, which is expected + to contain a zarr v2/v3 hierarchy. + + The implementation must determine the zarr format version automatically. + + The resultant resource kind is always `dataset`. + +- `n5:path/within/hierarchy` to specify an [N5](https://github.com/saalfeldlab/n5) group or array + + The base URL must refer to a `directory` resource, which is expected + to contain an N5 hierarchy. + + Because N5 is defined to inherit attributes from ancestor groups in the + hierarchy, it is recommended that the base URL refers to the root of the n5 + hierarchy, and any path within the hierarchy be specified through the `n5:` + scheme. + + The resultant resource kind is always `dataset`. + +- `gzip:`, `zstd:` for transparent access to compressed files + + The base URL must refer to a `file` resource, which is expected to be in the + format indicated by the URL scheme. + + Currently no path is supported. + + The resultant resource kind is always `file`. + + For example: + + - `gs://bucket/path/to/data.gz|gzip:` + - `gs://bucket/path/to/data.zstd|zstd:` + +- `byte-range:start-end` for specifying a byte range within a file + + The base URL must refer to a `file` resource, which is expected to support + byte range access. + + The `start` and `end` components of the URL specify byte offsets in base 10. + The `start` bound is inclusive while the `end` bound is exclusive; the total + length is `end - start`. + + The resultant resource kind is always `file`. + + For example: + + - `gs://bucket/path/to/data|byte-range:1000-2000` + +- `tiff:`, `jpeg:`, `png:`, `bmp:`, `avif:`, `webp:` + + The base URL must refer to a `file` resource, which is expected to + contain an image in the format indicated by the URL scheme. + + Currently no path is supported. + + The resource kind is always `dataset`. + +- `neuroglancer-precomputed:` for [Neuroglancer + precomputed](https://neuroglancer-docs.web.app/datasource/precomputed/index.html) + + The base URL must refer to a `directory` resource, which is expected to + contain a neuroglancer precomputed dataset. + + Currently no path component is allowed by the `neuroglancer-precomputed` URL + scheme. + + The resource kind is always `dataset`. + +- `json:path` + + The base URL must refer to a `file` resource, which is expected to + contain an encoded JSON document. + + The `path` is in [JSON pointer + syntax](https://datatracker.ietf.org/doc/html/rfc6901) and indicates + a sub-value within the JSON document. An empty path corresponds to + the entire JSON document. + + The resource kind is always `dataset`, specifically a rank-0 array + with data type `json`. + +- `..:path/within/outer/sub-URL` and `..:/path/within/outer/sub-URL` may be used + to traverse out from the prior adapter. + + This scheme is primarily useful within the relative URL pipeline syntax + defined below. + + Any `..` adapters are resolved in order. The presence of a `..` adapter + causes the prior adapter to be discarded. The path component is interpreted + relative to the path of the sub-URL immediately prior to the discarded adapter + sub-URL. + + It is an error if there are no remaining adapter sub-URLs when resolving the + `..` adapter. + + For security, implementation SHOULD place limits on where this scheme is + permitted. + +If the adapter URL would otherwise consist of just the scheme followed by ":", +it is permitted to omit the final ":". For example: + +- `https://example.com/path/to/archive.zip|zip|zarr3` is equivalent to + `https://example.com/path/to/archive.zip|zip:|zarr3:`. + +It is expected that additional URL schemes may be standardized in the future. + +#### Examples + +Examples: + +- `https://server.example.com:1234/path/to/array` + + Specifies a normal HTTPS URL. + +- `s3://bucket/path/to/file` + + Specifies: + - within the AWS S3 bucket named `bucket`, + - the path `path/to/file`. + +- `gs://bucket/path/to/outer.zip|zip:path/to/inner.zip|zip:path/to/zarr/hierarchy|zarr3:path/to/array` + + Specifies: + - within the GCS bucket named `bucket`, + - within the ZIP file at the path `path/to/outer.zip`, + - within the ZIP file at the path `path/to/inner.zip`, + - within the Zarr v3 hierarchy at the path `path/to/zarr/hierarchy/`, + - the Zarr v3 node at the path `path/to/array/`. + +- `gs://bucket/path/to/outer.zip|zip:path/to/inner.zip|..:other/zarr/hierarchy|zarr3:path/to/array` + + Normalizes to: + + `gs://bucket/path/to/other/zarr/hierarchy/|zarr3:path/to/array` + +### Format auto-detection + +Implementations MAY support format auto-detection for certain `adapter` URL +schemes. + +For a given base URL specifying a `file` or `directory` resource, the +implementation determines a set of matching `adapter` URLs: + +- For a base `file` resource, this is typically done by reading a prefix and/or + suffix of the file in order to match expected signatures; + +- For a base `directory` resource, this is typically done by checking for the + presence of certain files. + +Given a base URL specifying a `file` or `directory` resource, to obtain a +`dataset` resource using format auto-detection, the implementation: + +1. Determines the set of matching `adapter` URLs for the current base URL. If + there is exactly one match, add the matching adapter to the current base URL + to obtain a new base URL. Otherwise, return an error. + +2. If the new base URL is a `dataset` resource, return the new base URL as the + successful format auto-detection result. Otherwise, continue back at step 1 + with the new base URL as the current base URL. + +### Context-dependent URL pipeline interpretation + +Implementations MAY interpret URL pipelines in a context-dependent way. For +example, consider the following hypothetical APIs (which may not all be part of +the same software): + +- `open_array`: opens an arbitrary array from a URL + + If passed a URL that resolves to a `file` or `directory` resource, performs + format auto-detection to obtain a `dataset` resource. + + If format auto-detection fails or the resultant `dataset` resource is not an + array, fails with an error. + + Otherwise, opens the resolved URL as an array. + +- `open_zarr_array`: opens a zarr array from a URL with format auto-detection + + Same as `open_array`, except that if the resolved `dataset` resource is not a + zarr array, fails. + +- `open_zarr_array_without_auto_detection`: opens a zarr array without format + auto-detection + + If passed a URL that resolves to a `file` resource, fails with an error. + + If passed a URL that resolves to a `directory` resource, append the `zarr:` + adapter and open it. + + If passed a URL that resolves to a `dataset` resource, open it and fail if it + is not in zarr format. + +- `open_kvstore`: opens a key-value store file or directory from a URL + + If passed a URL that resolves to a `file` or `directory` resource, opens it. + + If passed a URL that resolves to a `dataset` resource, returns an error. + +- `open_file`: opens a file from a URL + + If passed a URL that resolves to a `file` resource, opens it. + + Otherwise, returns an error. + +### Relative URL pipeline syntax + +Relative URL pipelines permit the locations of resources to be specified +relative to some base URL pipeline that is specified separately, potentially +traversing through one or more layers of adapter. + +For example: + + A zarr attribute may be defined that specifies the location of some other + related array using the relative URL pipeline syntax. + + The referencing array may be located at + `s3://bucket/path/to/dataset.zip|zip:path/within/zip/|zarr3:`. Using only a + relative path, it could specify the path of another array within + `s3://bucket/path/to/dataset.zip:zip:`, e.g. the relative path + `../another/array/` would refer to + `s3://bucket/path/to/dataset.zip|zip:path/another/array/`. To refer to + `s3://bucket/path/of/another.zip|zip:other/array/`, the relative URL + pipeline `..:../of/another.zip|zip:other/array/|zarr3:` can be used. + +The relative URL pipeline syntax has the following ABNF grammar: + +``` +relative_url_pipeline = ( absolute_path / relative_path ) *( "|" adapter ) + / absolute_url_pipeline +``` + +A relative zarr URL is always resolved relative to a specified base URL +pipeline. The initial `absolute_path` or `relative_path` applies to the path +component of the inner-most (last) sub-URL. If the `relative_path` is the empty +string, the path component of the inner-most sub-URL remains unchanged. After +applying the `absolute_path` or `relative_path` to the existing absolute URL, +any specified adapter sub-URLs are appended. + +Note: An `absolute_path` overrides any existing *path* component of the +inner-most sub-URL of the base , but is still relative to the scheme and other +components of the inner-most sub-URL of the base URL pipeline that precede its +path component, if any. The specific scheme of the sub-URL defines what +portion, if any, constitutes the path component. + +As with regular URL syntax, it is not permitted for the first component of +`relative_path` to contain a colon (`:`), e.g. `a:b`, since that would be +ambiguous with specifying the base URL scheme for an absolute URL. Instead, +such a relative URL must be prefixed with `./`, e.g. `./a:b`. + +#### Examples + +- - Base URL: `gs://bucket/path/to/` + - Relative URL: `file.zip|zip:path/within/zip` + - Resolved URL: `gs://bucket/path/to/file.zip|zip:path/within/zip` + +- - Base URL: `gs://bucket/path/to/file.zip|zip:path/within/zip` + - Relative URL: `..:/path/to/other.zip|zip:path/in/other/zip` + - Resolved URL: `gs://bucket/path/to/other.zip|zip:path/in/other/zip` + +## Rationale + +This proposal takes into account several key considerations: + +- The URL syntax must support specifying: + - The underlying key-value store; + - The path within the key-value store of the root Zarr node; + - Optionally, a path within the Zarr hierachy starting from the root Zarr + node. Note: Currently, as no storage transformers have been defined, the + path to any Zarr node may be specified directly as a path within the + underlying key-value store, making this additional path unneccessary. +- Must support nested key-values stores, like one or more layers of a ZIP + archive within some other key-value store. +- The URL syntax must be compatible with interactive completion as the user + types. +- The URL syntax must also be extensible for use with non-zarr formats. + +The use of outer-to-inner order for the sub-URLs enables completion of both +paths and sub-URL schemes as the user types. + +The sub-URL delimiter of `|` was chosen because it is not a valid URL character, +and therefore does not have any existing valid interpretation within URLs, and +also is evocative of POSIX shell pipe syntax. + +## Implementations + +- TensorStore (https://google.github.io/tensorstore/spec.html#json-TensorStoreUrl) + + Format auto-detection is also implemented. + + The `relative_url_pipeline` syntax is not supported. + +- Neuroglancer (https://neuroglancer-docs.web.app/datasource/index.html#url-syntax) + + The `http:` and `https:` schemes automatically detect and support + HTML and S3-compatible directory listing. + + Format auto-detection is also implemented. + + The `relative_url_pipeline` syntax is not supported. + +- zarr-python (https://github.com/zarr-developers/zarr-python/pull/3369) + + The `relative_url_pipeline` syntax is not supported. + +## Related Work + +### fsspec + +The [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) library is +widely used with the zarr-python library to access a variety of storage systems, +and includes support for ZIP files and other nested stores. + +Like this proposal, the fsspec URL syntax consists of a sequence of sub-URLs +separated by a delimiter, but differs as follows: + +- fsspec uses a delimiter of `::` rather than `|` as used in this proposal. +- fsspec orders sub-URLs from innermost to outermost, which is the opposite + order from what is proposed here. + +The use of `::` as a delimiter of the sub-URLs means that fsspec URLs may +conform to the syntax of a normal URL, because `::` is permitted within the +path, query, and fragment components of a URL. This has both advantages and +disadvantages: + +- An fsspec URL may be accepted by existing URL parsers/matchers not + specifically designed for fsspec. +- Because the interpretation of the `::` delimiter within an fsspec URL differs from + the normal interpretation within a URL, operations such as relative path + resolution designed to operate on URLs generically may execute without errors + on an fsspec URL but produce an incorrect result. In contrast, the use of `|` + within this proposal ensures that the resultant syntax will not be confused + with a valid regular URL, because `|` is not a permitted character within + URLs. + +The outer-to-inner order of sub-URLs in the fsspec URL syntax is not compatible +with the usual operation of text completion as the user types. It is also +opposite to the outer-to-inner order used for specifying paths within URLs. + +### Apache Commons VFS + +The [Apache Commons VFS](https://commons.apache.org/proper/commons-vfs/) is a +Java library that provides capabilities similar to those of the fsspec Python +library. + +The Apache Commons VFS URL syntax specifies the base scheme and all of the +sub-schemes, in inner to outer order, delimited by `:`, followed by the paths +for each scheme, in outer-to-inner order, delimited by `!`. + +For example: + +- `gz:tar:file:///extra/data/tryVfs/archive.tar!/tardir/content.txt.gz!content.txt`, + which under this proposal would be + `file:///extra/data/tryVfs/archive.tar|tar:tardir/content.txt.gz|gz` (assuming + the existence of `tar` and `gz` adapter schemes). + +As with the fsspec syntax, this URL syntax conforms to the standard URL syntax +but has a different interpretation, which has both advantages and disadvantages. + +Separating the adapter scheme from the adapter path makes the association of +adapter and path less obvious, particularly if there is more than one adapter. + +While the outer-to-inner order of the nested paths makes text completion of the +paths feasible, the URL syntax is not readily compatible with completion of the +nested schemes. + +### GDAL Virtual File Systems + +https://gdal.org/user/virtual_file_systems.html + +This uses a path syntax rather than a URL syntax. It supports chaining but +makes assumptions about paths (e.g. that a zip file always ends with .zip). + +For example: + +- `/vsizip//vsicurl/ftp://user:password@example.com/foldername/file.zip/example.shp`, + which under this proposal would be + `ftp://user:password@example.com/foldername/file.zip|zip:example.shp`. + +## Backward Compatibility + +If Zarr implementations wish to add support for this proposed URL syntax to an +existing generic "open" interface that already supports other syntax, such as a +plain non-absolute file path or the fsspec URL syntax, there are potential +ambiguities: + +- A relative file path such as `file:/abc` can also be interpreted as a URL. + Presumably implementations would disambiguate this as a URL, which may (in + rare cases) change the behavior of existing code. +- A nested fsspec URL is unlikely to be a valid URL under this proposal, but a + non-nested fsspec URL may well be a valid URL under this proposal. In many + cases the interpretation will also be the same, but in some cases it may be + subtly different. + +## Discussion + +None yet. + +## Copyright + +This document has been placed in the public domain.