-
Notifications
You must be signed in to change notification settings - Fork 50
Docs and API follow-ups to #601 #619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
e605177
df58929
7278819
af453e6
fb0ce2e
01c206b
d953b94
120d223
8e0a0af
e828058
a55b9d2
629f209
2ddc1f9
53294e9
c1e9fbf
a414468
c4a4d16
8b1cfb7
44c3a80
6c75c34
41f38ab
4b30717
825d667
bb126a6
f935ab3
982346a
c02866f
d2d2756
a6dcd80
39440ea
18a1184
a153141
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -20,6 +20,24 @@ Reading | |||||
|
|
||||||
| open_virtual_dataset | ||||||
|
|
||||||
| Parsers | ||||||
| ------- | ||||||
|
|
||||||
| Each parser understands how to read a specific file format, and one parser must be passed to :py:func:`~virtualizarr.open_virtual_dataset` | ||||||
|
||||||
| Each parser understands how to read a specific file format, and one parser must be passed to :py:func:`~virtualizarr.open_virtual_dataset` | |
| Each parser understands how to read a specific file format, and a parser must be passed to :py:func:`~virtualizarr.open_virtual_dataset`. |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,221 @@ | ||||||||||||||||||||||||
| (custom-parsers)= | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| # Custom parsers | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| This page explains how to write a custom parser for VirtualiZarr, to extract chunk references from an archival data format not already supported by the main package. | ||||||||||||||||||||||||
| This is advanced material intended for 3rd-party developers, and assumes you have read the page on [Data Structures](data_structures.md). | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```{note} | ||||||||||||||||||||||||
| "Parsers" were previously known variously as "readers" or "backends" in older versions of VirtualiZarr. We renamed them to avoid confusion with obstore readers and xarray backends. | ||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ## What is a VirtualiZarr parser? | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| All VirtualiZarr parsers are simply callables that accept a path to a file of a specific format and an instantiated [obstore](https://developmentseed.org/obstore/latest/) store to read data from it with, and return an instance of the :py:class:~`virtualizarr.manifests.ManifestStore` class containing information about the contents of that file. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```python | ||||||||||||||||||||||||
| from virtualizarr.manifests import ManifestStore, ObjectStoreRegistry | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| def custom_parser(file_url: str, object_store: ObjectStore) -> ManifestStore: | ||||||||||||||||||||||||
| # access the file's contents, e.g. using the ObjectStore instance | ||||||||||||||||||||||||
| readable_file = obstore.open_reader(object_store, file_url) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| # parse the file contents to extract its metadata | ||||||||||||||||||||||||
| # this is generally where the format-specific logic lives | ||||||||||||||||||||||||
| manifestgroup: ManifestGroup = extract_metadata(readable_file) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| # optionally create an object store registry, used to actually load chunk data from file later | ||||||||||||||||||||||||
| registry = ObjectStoreRegistry({store_prefix: object_store}) | ||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't think this exists yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tracked in #629 |
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| # construct the Manifeststore from the parsed metadata and the object store registry | ||||||||||||||||||||||||
| return ManifestStore(group=manifestgroup, store_registry=registry) | ||||||||||||||||||||||||
|
Comment on lines
+18
to
+30
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Writing this out made me realize it's a bit weird that exactly one |
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| vds = vz.open_virtual_dataset( | ||||||||||||||||||||||||
| file_url, | ||||||||||||||||||||||||
| object_store=object_store, | ||||||||||||||||||||||||
| parser=custom_parser, | ||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||
|
Comment on lines
+33
to
+37
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest we use context managers in examples to show recommended usage to ensure resources are properly managed to avoid leaks:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tangentially, can we rename the However, would that then cause potential confusion with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On context managers: Do we really need to? It makes all the examples more complex to read... On renaming: I agree this is potentially confusing. I think I would prefer everything be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding context managers would certainly add a minor amount of complexity to the examples, but my fear is that most readers of any code examples (regardless of library) tend to repeat the same patterns, even if those patterns are likely not ideal for production code. How many context managers have I already had to add to the codebase itself to resolve problems (both in main code and test code)? At the very least, I recommend a very obvious, bold warning in at least one place in the docs (ideally somewhere most readers are likely to see) that very clearly indicates that use of context managers is recommended for production code, but for brevity, code examples will not use them. And the callout should show an explicit example of the recommended practice, so that the syntax is visually imprinted in the reader's mind. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My preference is to make repeated use of context managers throughout the examples, so that the repetition is imprinted in the reader's mind, and will be the syntax they repeat, rather than repeatedly not using context managers. Even with a big, bold warning somewhere in the docs, I suspect the reader will repeat what they see, not what the warning says, because that's what they would repeatedly see in the examples. I recommend repeating the recommended practice, not repeating the "poor" practice simply for saving a modicum of keystrokes/simplification. Of course, if I'm outvoted, I won't block things. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's totally reasonable. My only remaining concern is that it's tricker to do that in narrative documentation than in real code, because I need text between opening the virtual dataset and using the virtual dataset. But this isn't going to work if users copy it verbatim: with open_virtual_dataset() as vds:
...some explanatory text vds.virtualize.to_kerchunk()In the docs I can't really wrap all later uses of FWIW all your arguments could apply to the xarray documentation too, but they don't use context managers there either https://docs.xarray.dev/en/stable/user-guide/io.html#reading-and-writing-files There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair point about interleaving prose with code. Perhaps we can at least find a good place to put a callout explaining that use of context managers is strongly recommended to prevent memory/resource leaks in critical code (along with a code example), but that for convenience throughout the docs, context managers might be dropped. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a note in 41f38ab |
||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| All parsers _must_ follow this exact call signature, enforced at runtime by checking against the the :py:class:`virtualizarr.parsers.typing.Parser` typing protocol. | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```{note} | ||||||||||||||||||||||||
| The object store registry can technically be empty, but to be able to read actual chunks of data back from the `ManifestStore` later the registry needs to contain at least one `ObjectStore` instance. | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||
| The only time you might want to use an empty object store registry is if you are attempting to parse a custom metadata-only references format without touching the original files they refer to - i.e. a format like Kerchunk or DMR++, that doesn't contain actual binary data values. | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ## What is the responsibility of a parser? | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| The VirtualiZarr package really does four separate things. | ||||||||||||||||||||||||
| In order, it: | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| 1. Maps the contents of common archival file formats to the Zarr data model, including references to the locations of the chunks. | ||||||||||||||||||||||||
| 2. Allows reading chosen variables into memory (e.g. via the `loadable_variables` kwarg, or reading from the `ManifestStore` using zarr-python directly). | ||||||||||||||||||||||||
| 3. Provides a way to combine arrays of chunk references using a convenient API (the Xarray API). | ||||||||||||||||||||||||
| 4. Allows persisting these references to storage for later use, in either the Kerchunk or Icechunk format. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| **VirtualiZarr parsers are responsible for the entirety of step (1).** | ||||||||||||||||||||||||
| In other words, all of the assumptions required to map the data model of an archival file format to the Zarr data model, and the logic for doing so for a specific file, together constitute a parser. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| **The ObjectStore instances are responsible for fetching the bytes in step (2).** | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| This design provides a neat separation of concerns, which is helpful in two ways: | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||
| 1. The Xarray data model is subtly different from the Zarr data model (see below), so as the final objective is to create a virtual store which programmatically maps Zarr API calls to the archival file format at read-time, it is useful to separate that logic up front, before we convert to use the xarray virtual dataset representation and potentially subtly confuse matters. | ||||||||||||||||||||||||
| 2. It also allows us to support reading data from the file via the `ManifestStore` interface, using zarr-python and obstore, but without using Xarray. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ## Reading data from the `ManifestStore` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| As well as being a well-defined representation of the archival data in the Zarr model, you can also read chunk data directly from the `ManifestStore` object. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| This works because the `ManifestStore` class is an implementation of the Zarr-Python `zarr.abc.Store` interface, and uses the [obstore](https://developmentseed.org/obstore/latest/) package internally to actually fetch chunk data when requested. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Reading data from the `ManifestStore` can therefore be done using the zarr-python API directly | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||
| ```python | ||||||||||||||||||||||||
| manifest_store = parser(url, object_store) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| zarr_group = zarr.open_group(manifest_store) | ||||||||||||||||||||||||
| zarr_group.tree() | ||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
| or using xarray: | ||||||||||||||||||||||||
| ```python | ||||||||||||||||||||||||
| manifest_store = parser(url, object_store) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ds = xr.open_zarr(manifest_store) | ||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Note using xarray like this would produce an entirely non-virtual dataset, so is equivalent to passing | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```python | ||||||||||||||||||||||||
| ds = vz.open_virtual_dataset( | ||||||||||||||||||||||||
| file_url, | ||||||||||||||||||||||||
| object_store=object_store, | ||||||||||||||||||||||||
| parser=parser, | ||||||||||||||||||||||||
| loadable_variables=<all_the_variable_names>, | ||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ## How is the parser called internally? | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| The parser is passed to `open_virtual_dataset`, and immediately called on the file url to produce a `ManifestStore` instance. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| The `ManifestStore` is then converted to the xarray data model using `Manifeststore.to_virtual_dataset()`, which loads `loadable_variables` by reading from the `ManifestStore` using `xr.open_zarr`. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| This virtual dataset object is then returned to the user, so `open_virtual_dataset` is really a very thin wrapper around the parser and object store you pass in. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ## Parser-specific keyword arguments | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| The `Parser` callable does not accept arbitrary optional keyword arguments. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| However, extra information is often needed to fully map the archival format to the Zarr data model, for example if the format does not include array names or dimension names. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Instead, to pass arbitrary extra information to your parser callable, it is recommended that you bind that information to class attributes (or use `functools.partial`), e.g. | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```python | ||||||||||||||||||||||||
| class CustomParser: | ||||||||||||||||||||||||
| def __init__(self, **kwargs) -> None: | ||||||||||||||||||||||||
| self.kwargs = kwargs | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| def __call__(self, file_url: str, object_store: ObjectStore) -> ManifestStore: | ||||||||||||||||||||||||
| # access the file's contents, e.g. using the ObjectStore instance | ||||||||||||||||||||||||
| readable_file = obstore.open_reader(object_store, file_url) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| # parse the file contents to extract its metadata | ||||||||||||||||||||||||
| # this is generally where the format-specific logic lives | ||||||||||||||||||||||||
| manifestgroup: ManifestGroup = extract_metadata(readable_file, **self.kwargs) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| # construct the Manifeststore from the parsed metadata | ||||||||||||||||||||||||
| return ManifestStore(...) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| vds = vz.open_virtual_dataset( | ||||||||||||||||||||||||
| file_url, | ||||||||||||||||||||||||
| object_store=object_store, | ||||||||||||||||||||||||
| parser=CustomParser(**kwargs), | ||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| This helps to keep format-specific parser configuration separate from kwargs to `open_virtual_dataset`. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ## How to write your own custom parser | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| As long as your custom parser callable follows the interface above, you can implement it in any way you like. | ||||||||||||||||||||||||
| However there are few common approaches. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### Typical VirtualiZarr parsers | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| The recommended way to implement a custom parser is simply to parse the given file yourself, and construct the `ManifestStore` object explicitly component by component, extracting the metadata that you need. | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Generally you want to follow steps like this: | ||||||||||||||||||||||||
| 1. Extract file header or magic bytes to confirm the file passed is the format your parser expects. | ||||||||||||||||||||||||
| 2. Read metadata to determine how many arrays there are in the file, their shapes, chunk shapes, dimensions, codecs, and other metadata. | ||||||||||||||||||||||||
| 3. For each array in the file: | ||||||||||||||||||||||||
| 4. Create a `zarr.core.metadata.ArrayV3Metadata` object to hold that metadata, including dimension names. At this point you may have to define new Zarr codecs to support deserializing your data (though hopefully the standard Zarr codecs are sufficient). | ||||||||||||||||||||||||
| 5. Extract the byte ranges of each chunk and store them alongside the fully-qualified filepath in a `ChunkManifest` object. | ||||||||||||||||||||||||
| 6. Create one `ManifestArray` object, using the corresponding `ArrayV3Metadata` and `ChunkManifest` objects. | ||||||||||||||||||||||||
| 7. Group `ManifestArrays` into one or more `ManifestGroup` objects. Ideally you would only have one group, but your format's data model may preclude that. If there is group-level metadata attach this to the `ManifestGroup` object as a `zarr.metadata.GroupMetadata` object. Remember that `ManifestGroups` can contain other groups as well as arrays. | ||||||||||||||||||||||||
| 8. Instantiate the final `ManifestStore` using the top-most `ManifestGroup` and return it. | ||||||||||||||||||||||||
TomNicholas marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```{note} | ||||||||||||||||||||||||
| The [regular chunk grid](https://github.com/zarr-developers/zarr-specs/blob/main/docs/v3/chunk-grids/regular-grid/index.rst) for Zarr V3 data expects that chunks at the border of an array always have the full chunk size, even when the array only covers parts of it. For example, having an array with ``"shape": [30, 30]`` and ``"chunk_shape": [16, 16]``, the chunk ``0,1`` would also contain unused values for the indices ``0-16, 30-31``. If the file format that you are virtualizing does not fill in partial chunks, it is recommended that you raise a `ValueError` until Zarr supports [variable chunk sizes](https://github.com/orgs/zarr-developers/discussions/52). | ||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### Parsing a pre-existing index file | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| A custom parser can parse multiple files, perhaps by passing a glob string and looking for expected file naming conventions, or by passing additional parser-specific keyword arguments. | ||||||||||||||||||||||||
| This can be useful for reading file formats which include some kind of additional "index" sidecar file, but don't have all the information necessary to construct the entire `ManifestStore` object from the sidecar file alone. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```{note} | ||||||||||||||||||||||||
| If you do have some type of custom sidecar metadata file which contains all the information necessary to create the `ManifestStore`, then you should just create a custom parser for that metadata file format instead! | ||||||||||||||||||||||||
| Examples of this approach which come packaged with VirtualiZarr are the `DMRPPparser` and the `KerchunkJSONparser` | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
| Examples of this approach which come packaged with VirtualiZarr are the `DMRPPparser` and the `KerchunkJSONparser` | |
| Examples of this approach which come packaged with VirtualiZarr are the `DMRPPparser` and the `KerchunkJSONparser`. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add blank line
| Whilst this might be the quickest way to get a custom parser working, we do not really recommend this approach, as: | |
| Whilst this might be the quickest way to get a custom parser working, we do not really recommend this approach, as: | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Nevertheless this approach is currently used by VirtualiZarr internally, at least for the FITS, netCDF3, and (now-deprecated original implementation of the) HDF5 file format parsers. | |
| Nevertheless, this approach is currently used by VirtualiZarr internally, at least for the FITS, netCDF3, and (now-deprecated original implementation of the) HDF5 file format parsers. |
Uh oh!
There was an error while loading. Please reload this page.