-
Notifications
You must be signed in to change notification settings - Fork 54
Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175
Conversation
|
Just wanted to say that the Zarr V3 class behaviors are not set in stone yet. If you have feedback on the V3 API and potential incompatibilities between V2, please report them issues on Zarr-python: https://github.com/zarr-developers/zarr-python/issues |
|
My sense is that in terms of file format both zarr v2 and zarr v3 should be supported but in terms of library dependencies it would be fine to mandate that only zarr-python >= 3 is supported. |
|
Thanks for taking the initiative here @ghidalgo3 !
I was actually hoping we could use some class from
I see two ways to do this:
I'm not sure which of these two is the best approach. I also completely agree with @jsignell - all of this should be supported through only a dependency on |
|
To minimize the code changes needed in VirtualiZarr, I'll attempt the first option. |
|
Adding a dependency on See the reorg that happened here about 3 months ago for ZarrV3: zarr-developers/zarr-python#1809 and some attempts in Kerchunk to support ZarrV3: Maybe the |
|
@ghidalgo3 please please open a Zarr issue to track any API incompatibilities between V2 and V3. 🙏 |
|
@ghidalgo3 it might be a good idea to have a separate first PR that makes virtualizarr pass every test using zarr v3, and then coming back to this PR once that compatibility can be relied upon. Otherwise this might become a rabbit hole. |
That class is public API, but it might be changing soon, and while it is intended for this use case it also has some extra stuff that goes beyond "representing metadata". So it should do what you want, eventually :) |
TomNicholas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work so far @ghidalgo3 !
virtualizarr/manifests/array.py
Outdated
| """ | ||
| Individual chunk size by number of elements. | ||
| """ | ||
| if isinstance(self._zarray.chunk_grid, RegularChunkGrid): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice check, but I think we could actually perform this check at ManifestArray construction time, because right now a lot of other things will break if the chunk grid is not regular.
| match zarray: | ||
| case ArrayV2Metadata(compressor=compressor, filters=filters): | ||
| return Codec(compressor=compressor, filters=filters) | ||
| case ArrayV3Metadata(codecs=codecs): | ||
| return codecs | ||
| case _: | ||
| raise ValueError("Unknown ArrayMetadata type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer this match...case over a standard if isinstance()...else syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly a style preference coming from languages with discriminated unions and compilers, but if you'd like the whole code base to be consistent I can rewrite it with isinstance.
| zarray = ArrayV2Metadata( | ||
| chunks=chunks, | ||
| compressor="zlib", | ||
| compressor={"id": "zlib"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the presence of "id" a v2 vs v3 difference? because you don't seem to use it in the other tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In V3, the codecs array is made up of (name, configuration) objects, but in V2 the id field serves as the name. In this particular test, I just chose to convert the ZArray to an ArrayV2Metadata
Actually please ignore any of the changes in tests for now because I will need to edit almost every test to excise ZArray :) The few changes I added were done up until I hit the kerchunk module import issue
| class Codec(BaseModel): | ||
| compressor: str | None = None | ||
| """ | ||
| ZarrayV2 codec definition. | ||
| """ | ||
|
|
||
| compressor: str | dict[str, Any] | None = None | ||
| """ | ||
| If it's a string, it's the compressor ID. | ||
| If it's a dict, it's the full compressor configuration. | ||
| """ | ||
| filters: list[dict] | None = None | ||
|
|
||
| def __repr__(self) -> str: | ||
| return f"Codec(compressor={self.compressor}, filters={self.filters})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this class be moved upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, because ZarrV3 now has a notion of codecs which is different from VirtualiZarr's notion of codecs. A VirtualiZarr codec is really the combination of filters and compressor from V2 metadata, which are easily transformable to V3 codecs.
Maybe a zarr.ArrayV2Metadata can have a property that projects the compressor and filters into V3 Codecs, @d-v-b thoughts on that? In that case, Codec could even be a property of the ABC ArrayMetadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think that codecs: ListWithInternalStructure[Codec] (i.e., what zarr v3 metadata uses) is not the best interface. See a writeup on that point of view here.
But if we ignore that concern for a moment, I'm curious about @TomNicholas's question: what could we change in zarr-python so that you could just import a functionally equivalent class from there (or import a dataclass and wrap with pydantic)? I have a selfish interest in this question since I'm likely going to follow in your footsteps when I implement better codec support over in pydantic-zarr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for forcing me to read the Zarr code more closely! I found VirtualiZarr could probably just use create_codec_pipeline to turn V2 or V3 metadata into a BatchedCodecPipeline and then VirtualiZarr can just deal with CodecPipeline as the abstraction over V2 and V3 . Additionally, VirtualiZarr could change its Codec to use V2Filters and V2Compressor from Zarr, or I might get rid of it entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does Virtualizarr do with the codecs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All virtualizarr currently needs to do with the codecs is:
- Check that any two arrays you ask it to concatenate have the same codecs (see Virtual concatenation of arrays with different codecs or dtypes #5)
- Make sure the codecs are recorded in the kerchunk references when they get written out to disk.
|
This was completed in #429 by @abarciauskas-bgse ! |
Before going further with these changes, I'd like to discuss if this direction is correct or not. Based on #17, I tried using Zarr-V3's Array definition instead of VirtualiZarr's
ZArraydefinition. Keeping the names straight is hard,ZArrayis the current code andArrayiszarr's type.The problems I ran into were:
kwargsforArrayare different between v2 and v3 for zarr, butVirtualiZarr.ZArrayI believe is first created with the v2 properties and then transformed to v3 format whenzarr_v3_array_metadatais called. After the replacement, anywhere VirtualiZarr creates aArray, one must choose v2 or v3kwargs, or again introduce something that abstracts over the versions. Unless VirtualiZarr chooses to only support Zarr V3 only, which I believe is the intention.zarr.array.metadatabecause the internal zarr property names have changed. Probably there will be other code-level incompatibilities from metadata property name differences, I think Is thecompressortype right? #94 ran into this too.This is just from a few hours of trying this out.
If VirtualiZarr needs to support both Zarr V2 and V3 stores, then this change needs to handle all the little differences in Zarr's API between 2 and 3. But if only V3 is supported, this becomes easier and replacing
ZArraywithArrayshould not be too difficult.Thoughts?
Incompatibilities:
To be filled later:
docs/releases.rstapi.rst