-
Couldn't load subscription status.
- Fork 43
Zarr support (backend, in esmvalcore.preprocessor._io.py)
#2785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2785 +/- ##
=======================================
Coverage 95.42% 95.43%
=======================================
Files 260 260
Lines 15426 15449 +23
=======================================
+ Hits 14720 14743 +23
Misses 706 706 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks V, great to see this moving forward! 🚀 I got some very specific comments on the code already.
In general, I think we need to make sure that this aligns well with #2765.
Also, have you tried loading Zarr files within a recipe?
|
hi Manu @schlunma many thanks for dropping by 🍺
Indeed, this is the key PR since Zarr ingestion in esmvaltool will not work without an intake catalog parser - what @bouweandela is doing in that PR is he is generalizing a Data object that comes in - and it's great because we need such an object to tackle not only Zarr but also other file types - and not all those need to be "downloaded" in the true data transfer sense 🍻 |
|
Permanent test bucket on CEDA I have created a permanent S3 bucket where we can pop Zarr files to be used for our tests:
|
|
@schlunma if you still around: very many thanks for an excellent review, mate, well appreciated, and I believe I addressed everything, bar:
Here's my thoughts about this (after I took a 5min break from intense committing): Zarr is not really a file - it's a store (a directory in POSIX parlance, but it's not really a directory either when you think Object Storage), in reality it's an object more than it is a file type; with this PR we're escaping POSIX realms and we are getting into object storage, S3 files will follow next so I think we'll need a separate class for object store "files", for now I think the way it's handled now makes the correct distinction, and buring it in |
esmvalcore/preprocessor/_io.py
Outdated
| if not zarr2 and not zarr3: | ||
| msg = f"File '{file}' can not be open as Zarr file at the moment." | ||
| raise ValueError(msg) from None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, but what happens if a non-Zarr2/non-Zarr3 is passed to xr.open_dataset. I would hope that this raises an error. In that case, I think we can remove all the code here that is just there to raise error?
esmvalcore/preprocessor/_io.py
Outdated
| if isinstance(file, Path): | ||
| zarr_xr = xr.open_dataset( | ||
| file, | ||
| consolidated=False, | ||
| engine="zarr", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I cannot find the test. Would you point me to it? 😅
Ok, fair point. Would it make sense then to incorporate the name |
Co-authored-by: Manuel Schlund <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments to make the code clearer (I think). Feel free to ignore 😉
Co-authored-by: Manuel Schlund <[email protected]>
Co-authored-by: Manuel Schlund <[email protected]>
Co-authored-by: Manuel Schlund <[email protected]>
Co-authored-by: Manuel Schlund <[email protected]>
Co-authored-by: Manuel Schlund <[email protected]>
Co-authored-by: Manuel Schlund <[email protected]>
|
Regarding ruff: It might make sense to log the error in the form of a debug message? |
I really don't want to overload debugs with stuff that are non important - what is important is that file can't be open as zarr, here I reformatted the exceptions block 71ebe4e |
|
another logic and puprose for that test, and why I don't want debugs and other such things is that, in the future, that could (and will be in certain conditions) an S3 file that don't have to be Zarr, it could be netCDF4, and in that case, we'll jump over the Zarr load and go straight to |
|
BTW here's an example of a fairly short stacktrace you get when you can't access the file, via full stack (when you don't perform the poke via fsspec) - ugly! pp-mo/ncdata#139 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks V! One final comment to make the test shorter! Cheers 🚀
Co-authored-by: Manuel Schlund <[email protected]>
Co-authored-by: Manuel Schlund <[email protected]>
Co-authored-by: Manuel Schlund <[email protected]>
Thanks, Manu! Very cool, popped it in 🍻 |
Description
This is the back-end for Zarr support ie everything going on in our IO module in load; I think this is ready now, and we should merge before things need to happen in the front-end ie via recipe and config but a lot of that will have to deal with Intake catalogs anyway, though some bits can just be run by pointing to an endpoint URL and S3 bucket.
There are a couple issues I found that involve performance (via
ncdata, see below), and I will open a dedicated issue related to Zarr performance in ESMValCore.Things TODO
[ ] understand the issue withthat is a specific issue related to the test fileaiohttpfrom PyPI Zarr support (backend, inesmvalcore.preprocessor._io.py) #2785 (comment)um.PT1H.hp_z2.zarrwhich hangs (sometimes, quite a few times) from TWO storage units iehackathonanduor-aceswhile running ONLY the CircleCI tests; it could be a problem with loading the Zarr store, or it could be a problem converting the Zarr store to an iris cube, I'll investigate that aspect ->ncdataconversion, see How to improve performance:ncdataconversion from Xarray(zarr) to Cube is not ideal pp-mo/ncdata#139consolidated=Truepp-mo/ncdata#138 - for now we should crack on, it's not a show stopper, but it is an annoyance I had to look intoncdataconverts a Zarr file to an Iris cubethis is done on CEDA: https://uor-aces-o.s3-ext.jc.rl.ac.uk (shortcut
bryan) I have created an S3 bucket calledesmvaltool-zarr:Full instructions in comment below ⬇️ #2785 (comment)
[ ] usual stuff: documentation etcI think it's best we write docs in the PR that contains the front-facing API - so far this is all inloadand it's a bit more hiddenRelated to #2584
Link to documentation: TBA
Before you get started
Checklist
It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.
To help with the number pull requests: