Skip to content

Conversation

@valeriupredoi
Copy link
Contributor

@valeriupredoi valeriupredoi commented Jul 24, 2025

Description

This is the back-end for Zarr support ie everything going on in our IO module in load; I think this is ready now, and we should merge before things need to happen in the front-end ie via recipe and config but a lot of that will have to deal with Intake catalogs anyway, though some bits can just be run by pointing to an endpoint URL and S3 bucket.

There are a couple issues I found that involve performance (via ncdata, see below), and I will open a dedicated issue related to Zarr performance in ESMValCore.

Things TODO

valeriu@valeriu-PORTEGE-Z30-C:~$ minio-binaries/mc mb bryan/esmvaltool-zarr
Bucket created successfully `bryan/esmvaltool-zarr`.

Full instructions in comment below ⬇️ #2785 (comment)

  • once that's set up (am currently asking CEDA), we should pop a subsample of netCDF4 files that are also Zarrs (identical metadata and data, different file formats of course) so we can test, at least in a case of a faily uncluttered filesystem, how they both compare
  • [ ] usual stuff: documentation etc I think it's best we write docs in the PR that contains the front-facing API - so far this is all in load and it's a bit more hidden

Related to #2584

Link to documentation: TBA


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@valeriupredoi valeriupredoi added enhancement New feature or request testing labels Jul 24, 2025
@codecov
Copy link

codecov bot commented Jul 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.43%. Comparing base (05f8e4d) to head (66f9811).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2785   +/-   ##
=======================================
  Coverage   95.42%   95.43%           
=======================================
  Files         260      260           
  Lines       15426    15449   +23     
=======================================
+ Hits        14720    14743   +23     
  Misses        706      706           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Jul 24, 2025

there appears to be an issue with aiohttp and its minion dependencies when they come from PyPI - the conda-froge installed packages work like a charm, but the run_tests CircleCI test times out (no matter how many times it gets rerun), those deps are wheels from PyPI. I was blaming CEDA object storage before I figured this out all fine now, same aiohttp from PyPI - probably network issues

Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks V, great to see this moving forward! 🚀 I got some very specific comments on the code already.

In general, I think we need to make sure that this aligns well with #2765.

Also, have you tried loading Zarr files within a recipe?

@valeriupredoi
Copy link
Contributor Author

hi Manu @schlunma many thanks for dropping by 🍺

In general, I think we need to make sure that this aligns well with #2765.

Indeed, this is the key PR since Zarr ingestion in esmvaltool will not work without an intake catalog parser - what @bouweandela is doing in that PR is he is generalizing a Data object that comes in - and it's great because we need such an object to tackle not only Zarr but also other file types - and not all those need to be "downloaded" in the true data transfer sense 🍻

@valeriupredoi
Copy link
Contributor Author

Permanent test bucket on CEDA

I have created a permanent S3 bucket where we can pop Zarr files to be used for our tests:

mc cp -r pier/sim-data/dev/v5/glm.n2560_RAL3p3/um.PT1H.hp_z2.zarr .
mc cp -r um.PT1H.hp_z2.zarr/ bryan/esmvaltool-zarr

@valeriupredoi
Copy link
Contributor Author

@schlunma if you still around: very many thanks for an excellent review, mate, well appreciated, and I believe I addressed everything, bar:

_load_zarr loads a file, right? In that case, I would put all the logic into _load_from_file. In there, we already distinguish between GRIB and netCDF files, so it would be very fitting to also include zarr there.

Here's my thoughts about this (after I took a 5min break from intense committing): Zarr is not really a file - it's a store (a directory in POSIX parlance, but it's not really a directory either when you think Object Storage), in reality it's an object more than it is a file type; with this PR we're escaping POSIX realms and we are getting into object storage, S3 files will follow next so I think we'll need a separate class for object store "files", for now I think the way it's handled now makes the correct distinction, and buring it in _load_from_file will not work in the long run -> that func is all POSIX. What do you reckon?

Comment on lines 186 to 188
if not zarr2 and not zarr3:
msg = f"File '{file}' can not be open as Zarr file at the moment."
raise ValueError(msg) from None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, but what happens if a non-Zarr2/non-Zarr3 is passed to xr.open_dataset. I would hope that this raises an error. In that case, I think we can remove all the code here that is just there to raise error?

Comment on lines 165 to 170
if isinstance(file, Path):
zarr_xr = xr.open_dataset(
file,
consolidated=False,
engine="zarr",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I cannot find the test. Would you point me to it? 😅

@schlunma
Copy link
Contributor

Here's my thoughts about this (after I took a 5min break from intense committing): Zarr is not really a file - it's a store (a directory in POSIX parlance, but it's not really a directory either when you think Object Storage), in reality it's an object more than it is a file type; with this PR we're escaping POSIX realms and we are getting into object storage, S3 files will follow next so I think we'll need a separate class for object store "files", for now I think the way it's handled now makes the correct distinction, and buring it in _load_from_file will not work in the long run -> that func is all POSIX. What do you reckon?

Ok, fair point. Would it make sense then to incorporate the name posix somehow into the _load_from_file function?

Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments to make the code clearer (I think). Feel free to ignore 😉

@schlunma
Copy link
Contributor

schlunma commented Jul 31, 2025

Regarding ruff: It might make sense to log the error in the form of a debug message?

@valeriupredoi
Copy link
Contributor Author

Regarding ruff: It might make sense to log the error in the form of a debug message?

I really don't want to overload debugs with stuff that are non important - what is important is that file can't be open as zarr, here I reformatted the exceptions block 71ebe4e

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Jul 31, 2025

another logic and puprose for that test, and why I don't want debugs and other such things is that, in the future, that could (and will be in certain conditions) an S3 file that don't have to be Zarr, it could be netCDF4, and in that case, we'll jump over the Zarr load and go straight to netCDF4.Dataset or whatever means we decide to load network netCDF4 files (Pyfive 🤩 )

@valeriupredoi
Copy link
Contributor Author

BTW here's an example of a fairly short stacktrace you get when you can't access the file, via full stack (when you don't perform the poke via fsspec) - ugly! pp-mo/ncdata#139 (comment)

@schlunma schlunma added this to the v2.13.0 milestone Jul 31, 2025
Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks V! One final comment to make the test shorter! Cheers 🚀

@valeriupredoi
Copy link
Contributor Author

Thanks V! One final comment to make the test shorter! Cheers 🚀

Thanks, Manu! Very cool, popped it in 🍻

@valeriupredoi valeriupredoi merged commit e508bec into main Jul 31, 2025
6 checks passed
@valeriupredoi valeriupredoi deleted the zarr_support branch July 31, 2025 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants