Zarr support (backend, in `esmvalcore.preprocessor._io.py`) #2785

valeriupredoi · 2025-07-24T13:12:51Z

Description

This is the back-end for Zarr support ie everything going on in our IO module in load; I think this is ready now, and we should merge before things need to happen in the front-end ie via recipe and config but a lot of that will have to deal with Intake catalogs anyway, though some bits can just be run by pointing to an endpoint URL and S3 bucket.

There are a couple issues I found that involve performance (via ncdata, see below), and I will open a dedicated issue related to Zarr performance in ESMValCore.

Things TODO

test! Get us a bunch of real-world Zarr data and play with. So far:
- Zarr2 test file
- Zarr3 test file
- Zarr3 model data but with issues (see below)
- CMIP6 data
~~[ ] understand the issue with aiohttp from PyPI Zarr support (backend, in esmvalcore.preprocessor._io.py) #2785 (comment)~~ that is a specific issue related to the test file um.PT1H.hp_z2.zarr which hangs (sometimes, quite a few times) from TWO storage units ie hackathon and uor-aces while running ONLY the CircleCI tests; it could be a problem with loading the Zarr store, or it could be a problem converting the Zarr store to an iris cube, I'll investigate that aspect ->
there appears to be an inherent performance issue with ncdata conversion, see How to improve performance: ncdata conversion from Xarray(zarr) to Cube is not ideal pp-mo/ncdata#139
Issue opened at Ncdata about that Indefinite hang when consolidated=True pp-mo/ncdata#138 - for now we should crack on, it's not a show stopper, but it is an annoyance I had to look into
address what @schlunma points out in the first review below
determine a simple but effective method to check if the Zarr file exists at remote (this is probably best addressed in Add an interface for adding new data sources and add support for intake-esgf as a first example #2765
check chunking when ncdata converts a Zarr file to an Iris cube
set up an S3 object store where we put small but useful/relevant Zarr files for our tests:
this is done on CEDA: https://uor-aces-o.s3-ext.jc.rl.ac.uk (shortcut bryan) I have created an S3 bucket called esmvaltool-zarr:

valeriu@valeriu-PORTEGE-Z30-C:~$ minio-binaries/mc mb bryan/esmvaltool-zarr
Bucket created successfully `bryan/esmvaltool-zarr`.

Full instructions in comment below ⬇️ #2785 (comment)

once that's set up (am currently asking CEDA), we should pop a subsample of netCDF4 files that are also Zarrs (identical metadata and data, different file formats of course) so we can test, at least in a case of a faily uncluttered filesystem, how they both compare
~~[ ] usual stuff: documentation etc~~ I think it's best we write docs in the PR that contains the front-facing API - so far this is all in load and it's a bit more hidden

Related to #2584

Link to documentation: TBA

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 Any changed dependencies have been added or removed correctly
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

codecov · 2025-07-24T13:27:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.43%. Comparing base (05f8e4d) to head (66f9811).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2785   +/-   ##
=======================================
  Coverage   95.42%   95.43%           
=======================================
  Files         260      260           
  Lines       15426    15449   +23     
=======================================
+ Hits        14720    14743   +23     
  Misses        706      706

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

valeriupredoi · 2025-07-24T19:15:33Z

there appears to be an issue with aiohttp and its minion dependencies when they come from PyPI - the conda-froge installed packages work like a charm, but the run_tests CircleCI test times out (no matter how many times it gets rerun), those deps are wheels from PyPI. I was blaming CEDA object storage before I figured this out all fine now, same aiohttp from PyPI - probably network issues

schlunma

Thanks V, great to see this moving forward! 🚀 I got some very specific comments on the code already.

In general, I think we need to make sure that this aligns well with #2765.

Also, have you tried loading Zarr files within a recipe?

esmvalcore/preprocessor/_io.py

valeriupredoi · 2025-07-25T10:22:23Z

hi Manu @schlunma many thanks for dropping by 🍺

In general, I think we need to make sure that this aligns well with #2765.

Indeed, this is the key PR since Zarr ingestion in esmvaltool will not work without an intake catalog parser - what @bouweandela is doing in that PR is he is generalizing a Data object that comes in - and it's great because we need such an object to tackle not only Zarr but also other file types - and not all those need to be "downloaded" in the true data transfer sense 🍻

valeriupredoi · 2025-07-28T13:59:43Z

Permanent test bucket on CEDA

I have created a permanent S3 bucket where we can pop Zarr files to be used for our tests:

the bucket is called esmvaltool-zarr and is located here https://s3-portal.jasmin.ac.uk/object-store/uor-aces-o/buckets/esmvaltool-zarr (you need a valid CEDA user name and be registered for CEDA Object Storage, I can help with those); instructions how to use it are here https://help.jasmin.ac.uk/docs/short-term-project-storage/using-the-jasmin-object-store/
operations are done using MinIO and its command line executable mc

mc cp -r pier/sim-data/dev/v5/glm.n2560_RAL3p3/um.PT1H.hp_z2.zarr .
mc cp -r um.PT1H.hp_z2.zarr/ bryan/esmvaltool-zarr

valeriupredoi · 2025-07-31T12:42:05Z

@schlunma if you still around: very many thanks for an excellent review, mate, well appreciated, and I believe I addressed everything, bar:

_load_zarr loads a file, right? In that case, I would put all the logic into _load_from_file. In there, we already distinguish between GRIB and netCDF files, so it would be very fitting to also include zarr there.

Here's my thoughts about this (after I took a 5min break from intense committing): Zarr is not really a file - it's a store (a directory in POSIX parlance, but it's not really a directory either when you think Object Storage), in reality it's an object more than it is a file type; with this PR we're escaping POSIX realms and we are getting into object storage, S3 files will follow next so I think we'll need a separate class for object store "files", for now I think the way it's handled now makes the correct distinction, and buring it in _load_from_file will not work in the long run -> that func is all POSIX. What do you reckon?

esmvalcore/preprocessor/_io.py

schlunma · 2025-07-31T13:43:18Z

esmvalcore/preprocessor/_io.py

+        if not zarr2 and not zarr3:
+            msg = f"File '{file}' can not be open as Zarr file at the moment."
+            raise ValueError(msg) from None


All right, but what happens if a non-Zarr2/non-Zarr3 is passed to xr.open_dataset. I would hope that this raises an error. In that case, I think we can remove all the code here that is just there to raise error?

schlunma · 2025-07-31T13:47:17Z

esmvalcore/preprocessor/_io.py

+    if isinstance(file, Path):
+        zarr_xr = xr.open_dataset(
+            file,
+            consolidated=False,
+            engine="zarr",
+        )


Sorry, I cannot find the test. Would you point me to it? 😅

schlunma · 2025-07-31T13:49:38Z

Here's my thoughts about this (after I took a 5min break from intense committing): Zarr is not really a file - it's a store (a directory in POSIX parlance, but it's not really a directory either when you think Object Storage), in reality it's an object more than it is a file type; with this PR we're escaping POSIX realms and we are getting into object storage, S3 files will follow next so I think we'll need a separate class for object store "files", for now I think the way it's handled now makes the correct distinction, and buring it in _load_from_file will not work in the long run -> that func is all POSIX. What do you reckon?

Ok, fair point. Would it make sense then to incorporate the name posix somehow into the _load_from_file function?

Co-authored-by: Manuel Schlund <[email protected]>

schlunma

A couple of comments to make the code clearer (I think). Feel free to ignore 😉

esmvalcore/preprocessor/_io.py

Co-authored-by: Manuel Schlund <[email protected]>

schlunma · 2025-07-31T14:23:50Z

Regarding ruff: It might make sense to log the error in the form of a debug message?

valeriupredoi · 2025-07-31T14:39:37Z

Regarding ruff: It might make sense to log the error in the form of a debug message?

I really don't want to overload debugs with stuff that are non important - what is important is that file can't be open as zarr, here I reformatted the exceptions block 71ebe4e

valeriupredoi · 2025-07-31T14:43:36Z

another logic and puprose for that test, and why I don't want debugs and other such things is that, in the future, that could (and will be in certain conditions) an S3 file that don't have to be Zarr, it could be netCDF4, and in that case, we'll jump over the Zarr load and go straight to netCDF4.Dataset or whatever means we decide to load network netCDF4 files (Pyfive 🤩 )

valeriupredoi · 2025-07-31T14:46:48Z

BTW here's an example of a fairly short stacktrace you get when you can't access the file, via full stack (when you don't perform the poke via fsspec) - ugly! pp-mo/ncdata#139 (comment)

schlunma

Thanks V! One final comment to make the test shorter! Cheers 🚀

tests/integration/preprocessor/_io/test_zarr.py

Co-authored-by: Manuel Schlund <[email protected]>

valeriupredoi · 2025-07-31T16:17:32Z

Thanks V! One final comment to make the test shorter! Cheers 🚀

Thanks, Manu! Very cool, popped it in 🍻

valeriupredoi added 4 commits July 24, 2025 14:03

add basic zarr support

e347f40

add basic test

5c32b55

add sample zarr store

682f46d

add sample zarr store

81c254c

valeriupredoi requested review from bouweandela and schlunma July 24, 2025 13:12

valeriupredoi added enhancement New feature or request testing labels Jul 24, 2025

valeriupredoi added 3 commits July 24, 2025 14:15

turn on gha

6a02757

add zarr as dependency

84412ab

add zarr as dependency

1f8e127

valeriupredoi added 6 commits July 24, 2025 15:46

account for remote zarrs

5b97169

add test case for remote zarr

8bcc15f

functional remote Zarr and cleanup

c0b049c

add utility and test for remote zarr

e5f8c4e

add intake-esm as dependency

9265b0d

add aiohttp as dependency

4be6152

valeriupredoi mentioned this pull request Jul 24, 2025

Add an interface for adding new data sources and add support for intake-esgf as a first example #2765

Draft

10 tasks

valeriupredoi added 3 commits July 24, 2025 18:32

fixture

28f647f

remove unwanted (for now) fixture altogether

6da4183

remove unneeded import

fb7712a

schlunma reviewed Jul 25, 2025

View reviewed changes

esmvalcore/preprocessor/_io.py Outdated Show resolved Hide resolved

esmvalcore/preprocessor/_io.py Outdated Show resolved Hide resolved

esmvalcore/preprocessor/_io.py Outdated Show resolved Hide resolved

valeriupredoi added 4 commits July 25, 2025 16:45

add storeage options

95a92c9

semi-working version for publick bucket for esmvaltool

872be18

correct bucket with correct permissions and working test

971cf34

add yet another test

0eeeb50

dont match to exception string

8c49e20

valeriupredoi added 2 commits July 31, 2025 13:48

add info on further testing

8b6f221

unrun GHA

63411cb

schlunma reviewed Jul 31, 2025

View reviewed changes

valeriupredoi and others added 2 commits July 31, 2025 15:10

add str path test

84a33f2

Update esmvalcore/preprocessor/_io.py

8909b7d

Co-authored-by: Manuel Schlund <[email protected]>

schlunma reviewed Jul 31, 2025

View reviewed changes

valeriupredoi and others added 7 commits July 31, 2025 15:13

Update esmvalcore/preprocessor/_io.py

eff8956

Co-authored-by: Manuel Schlund <[email protected]>

Update esmvalcore/preprocessor/_io.py

a2e31ab

Co-authored-by: Manuel Schlund <[email protected]>

Update esmvalcore/preprocessor/_io.py

a387558

Co-authored-by: Manuel Schlund <[email protected]>

Update esmvalcore/preprocessor/_io.py

37266da

Co-authored-by: Manuel Schlund <[email protected]>

Update esmvalcore/preprocessor/_io.py

e13a19e

Co-authored-by: Manuel Schlund <[email protected]>

Update esmvalcore/preprocessor/_io.py

cef79ce

Co-authored-by: Manuel Schlund <[email protected]>

fix pytest msg regex

63b817f

better handling of exceptions

71ebe4e

schlunma added this to the v2.13.0 milestone Jul 31, 2025

schlunma approved these changes Jul 31, 2025

View reviewed changes

tests/integration/preprocessor/_io/test_zarr.py Outdated Show resolved Hide resolved

tests/integration/preprocessor/_io/test_zarr.py Outdated Show resolved Hide resolved

tests/integration/preprocessor/_io/test_zarr.py Outdated Show resolved Hide resolved

valeriupredoi and others added 3 commits July 31, 2025 17:14

Update tests/integration/preprocessor/_io/test_zarr.py

171ea74

Co-authored-by: Manuel Schlund <[email protected]>

Update tests/integration/preprocessor/_io/test_zarr.py

464c9f3

Co-authored-by: Manuel Schlund <[email protected]>

Update tests/integration/preprocessor/_io/test_zarr.py

66f9811

Co-authored-by: Manuel Schlund <[email protected]>

valeriupredoi merged commit e508bec into main Jul 31, 2025
6 checks passed

valeriupredoi deleted the zarr_support branch July 31, 2025 16:21

valeriupredoi mentioned this pull request Jul 31, 2025

Optimizations for Zarr loading and processing in ESMValTool #2790

Closed

Uh oh!

Zarr support (backend, in esmvalcore.preprocessor._io.py) #2785

Zarr support (backend, in esmvalcore.preprocessor._io.py) #2785

Uh oh!

Conversation

valeriupredoi commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Before you get started

Checklist

Uh oh!

codecov bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

valeriupredoi commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schlunma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valeriupredoi commented Jul 25, 2025

Uh oh!

valeriupredoi commented Jul 28, 2025

Uh oh!

valeriupredoi commented Jul 31, 2025

Uh oh!

Uh oh!

schlunma Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

schlunma Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

schlunma commented Jul 31, 2025

Uh oh!

schlunma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

schlunma commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valeriupredoi commented Jul 31, 2025

Uh oh!

valeriupredoi commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valeriupredoi commented Jul 31, 2025

Uh oh!

schlunma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valeriupredoi commented Jul 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Zarr support (backend, in `esmvalcore.preprocessor._io.py`) #2785

Zarr support (backend, in `esmvalcore.preprocessor._io.py`) #2785

valeriupredoi commented Jul 24, 2025 •

edited

Loading

codecov bot commented Jul 24, 2025 •

edited

Loading

valeriupredoi commented Jul 24, 2025 •

edited

Loading

schlunma commented Jul 31, 2025 •

edited

Loading

valeriupredoi commented Jul 31, 2025 •

edited

Loading