Skip to content
Merged
Show file tree
Hide file tree
Changes from 55 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
e347f40
add basic zarr support
valeriupredoi Jul 24, 2025
5c32b55
add basic test
valeriupredoi Jul 24, 2025
682f46d
add sample zarr store
valeriupredoi Jul 24, 2025
81c254c
add sample zarr store
valeriupredoi Jul 24, 2025
6a02757
turn on gha
valeriupredoi Jul 24, 2025
84412ab
add zarr as dependency
valeriupredoi Jul 24, 2025
1f8e127
add zarr as dependency
valeriupredoi Jul 24, 2025
5b97169
account for remote zarrs
valeriupredoi Jul 24, 2025
8bcc15f
add test case for remote zarr
valeriupredoi Jul 24, 2025
c0b049c
functional remote Zarr and cleanup
valeriupredoi Jul 24, 2025
e5f8c4e
add utility and test for remote zarr
valeriupredoi Jul 24, 2025
9265b0d
add intake-esm as dependency
valeriupredoi Jul 24, 2025
4be6152
add aiohttp as dependency
valeriupredoi Jul 24, 2025
28f647f
fixture
valeriupredoi Jul 24, 2025
6da4183
remove unwanted (for now) fixture altogether
valeriupredoi Jul 24, 2025
fb7712a
remove unneeded import
valeriupredoi Jul 24, 2025
95a92c9
add storeage options
valeriupredoi Jul 25, 2025
872be18
semi-working version for publick bucket for esmvaltool
valeriupredoi Jul 25, 2025
971cf34
correct bucket with correct permissions and working test
valeriupredoi Jul 28, 2025
0eeeb50
add yet another test
valeriupredoi Jul 28, 2025
f5d13c8
adjust test member docstring
valeriupredoi Jul 28, 2025
fa8b90a
make io more robust
valeriupredoi Jul 28, 2025
cccdb39
change api
valeriupredoi Jul 28, 2025
1618076
test changed api
valeriupredoi Jul 28, 2025
e2ed41c
add basic test for zarr file
valeriupredoi Jul 28, 2025
fe7326e
add test for file with issues
valeriupredoi Jul 28, 2025
39df34e
reduce pytest runners to 2
valeriupredoi Jul 28, 2025
2b44ac9
run only test load
valeriupredoi Jul 28, 2025
caff216
skip a test
valeriupredoi Jul 28, 2025
d48418c
change skip message
valeriupredoi Jul 29, 2025
0909770
restore circle ci configuration
valeriupredoi Jul 29, 2025
e87b12b
skip the other test that uses the healpix dataset
valeriupredoi Jul 29, 2025
37d8a31
removed problematic skipped tests
valeriupredoi Jul 29, 2025
0d446af
add dedicated Zarr IO test module
valeriupredoi Jul 29, 2025
37fcfff
add xr to ncdata test
valeriupredoi Jul 29, 2025
94d8677
add pytest marker
valeriupredoi Jul 29, 2025
7ac7b45
run zarr test single proc
valeriupredoi Jul 29, 2025
caa3657
mark test
valeriupredoi Jul 29, 2025
b4c6b6f
remove pytest marker
valeriupredoi Jul 29, 2025
48db5f3
restore circleci configuration
valeriupredoi Jul 29, 2025
0afcec7
unmark test but dont use cf_time flag
valeriupredoi Jul 29, 2025
0c4a16f
set consolidated to False
valeriupredoi Jul 29, 2025
72d79c2
found hang cause
valeriupredoi Jul 29, 2025
8e54f1e
add Ncdata issue pointer
valeriupredoi Jul 29, 2025
1572fff
replace deprecated use cftime
valeriupredoi Jul 29, 2025
b1fe4b8
add zar3 test and fixed deprecated call with cftime
valeriupredoi Jul 29, 2025
f5c5979
add test non existing file
valeriupredoi Jul 30, 2025
8cddb55
add CMIP6 Zarr store and metadata test for it
valeriupredoi Jul 30, 2025
0d71de7
add test resources
valeriupredoi Jul 30, 2025
2ab8fc0
add purely diagnostic test
valeriupredoi Jul 30, 2025
b01b578
feed the PEP typing moster an actual type
valeriupredoi Jul 30, 2025
ea9377a
cleanup tests
valeriupredoi Jul 30, 2025
72d87bc
cleanup implement
valeriupredoi Jul 30, 2025
76b32b4
dict typing
valeriupredoi Jul 30, 2025
90c8963
Merge branch 'main' into zarr_support
valeriupredoi Jul 30, 2025
7af2ec4
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
c151b57
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
ab78052
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
b5c3301
add mention about backend dict
valeriupredoi Jul 31, 2025
d514b67
add inline text
valeriupredoi Jul 31, 2025
2852381
removed all Zarr tests and moved to test_zarr.py
valeriupredoi Jul 31, 2025
49fb643
moved all tests from test_load here and removed tests that dont test …
valeriupredoi Jul 31, 2025
6a554d8
add mention about s3 bucket
valeriupredoi Jul 31, 2025
683b6e8
spruce up zarr tests and add an extra test for local files
valeriupredoi Jul 31, 2025
f2923e6
add dummy zar plaintext file
valeriupredoi Jul 31, 2025
8c49e20
dont match to exception string
valeriupredoi Jul 31, 2025
8b6f221
add info on further testing
valeriupredoi Jul 31, 2025
63411cb
unrun GHA
valeriupredoi Jul 31, 2025
84a33f2
add str path test
valeriupredoi Jul 31, 2025
8909b7d
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
eff8956
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
a2e31ab
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
a387558
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
37266da
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
e13a19e
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
cef79ce
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
63b817f
fix pytest msg regex
valeriupredoi Jul 31, 2025
71ebe4e
better handling of exceptions
valeriupredoi Jul 31, 2025
171ea74
Update tests/integration/preprocessor/_io/test_zarr.py
valeriupredoi Jul 31, 2025
464c9f3
Update tests/integration/preprocessor/_io/test_zarr.py
valeriupredoi Jul 31, 2025
66f9811
Update tests/integration/preprocessor/_io/test_zarr.py
valeriupredoi Jul 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/run-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ on:
push:
branches:
- main
- zarr_support
# run the test only if the PR is to main
# turn it on if required
#pull_request:
Expand Down
3 changes: 3 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ channels:
- nodefaults

dependencies:
- aiohttp
- cartopy
- cf-units
- cftime
Expand All @@ -18,6 +19,7 @@ dependencies:
- fire
- geopy
- humanfriendly
- intake-esm
- iris >=3.12.2 # https://github.com/SciTools/iris/issues/6417
- iris-esmf-regrid >=0.11.0
- iris-grib >=0.20.0 # github.com/ESMValGroup/ESMValCore/issues/2535
Expand Down Expand Up @@ -46,6 +48,7 @@ dependencies:
- shapely >=2.0.0
- xarray
- yamale
- zarr >3
# Python packages needed for building docs
- autodocsumm >=0.2.2
- ipython <9.0 # github.com/ESMValGroup/ESMValCore/issues/2680
Expand Down
69 changes: 68 additions & 1 deletion esmvalcore/preprocessor/_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@
from itertools import groupby
from pathlib import Path
from typing import TYPE_CHECKING, Any
from urllib.parse import urlparse

import fsspec
import iris
import ncdata
import xarray as xr
Expand Down Expand Up @@ -75,6 +77,7 @@
def load(
file: str | Path | Cube | CubeList | xr.Dataset | ncdata.NcData,
ignore_warnings: list[dict[str, Any]] | None = None,
backend_kwargs: dict[str, Any] | None = None,
) -> CubeList:
"""Load Iris cubes.
Expand All @@ -83,10 +86,18 @@
file:
File to be loaded. If ``file`` is already a loaded dataset, return it
as a :class:`~iris.cube.CubeList`.
File as ``Path`` object could be a Zarr store.
ignore_warnings:
Keyword arguments passed to :func:`warnings.filterwarnings` used to
ignore warnings issued by :func:`iris.load_raw`. Each list element
corresponds to one call to :func:`warnings.filterwarnings`.
backend_kwargs:
Dict to hold info needed by storage backend e.g. to access
a PRIVATE S3 bucket containing object stores (e.g. netCDF4 files);
needed by ``fsspec`` and its extensions e.g. ``s3fs``, so
most of the times it is a ``storage_options`` dict. Note that Zarr
files are opened via ``http`` extension of ``fsspec``, so no need
for ``storage_options`` in that case (ie anon/anon).
Returns
-------
Expand All @@ -102,7 +113,19 @@
"""
if isinstance(file, (str, Path)):
cubes = _load_from_file(file, ignore_warnings=ignore_warnings)
extension = (
file.suffix
if isinstance(file, Path)
else os.path.splitext(file)[1]
)
if "zarr" not in extension:
cubes = _load_from_file(file, ignore_warnings=ignore_warnings)
else:
cubes = _load_zarr(
file,
ignore_warnings=ignore_warnings,
backend_kwargs=backend_kwargs,
)
elif isinstance(file, Cube):
cubes = CubeList([file])
elif isinstance(file, CubeList):
Expand Down Expand Up @@ -134,6 +157,50 @@
return cubes


def _load_zarr(
file: str | Path | Cube | CubeList | xr.Dataset | ncdata.NcData,
ignore_warnings: list[dict[str, Any]] | None = None,
backend_kwargs: dict[str, Any] | None = None,
) -> CubeList:
if isinstance(file, Path):
zarr_xr = xr.open_dataset(
file,
consolidated=False,
engine="zarr",
)
Copy link
Contributor

@schlunma schlunma Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(file, Path):
zarr_xr = xr.open_dataset(
file,
consolidated=False,
engine="zarr",
)

This is the part which is supposed to treat files from the local filesystem, right? You also need to consider the case where file is a str and points to an object in the local filesystem. Currently, this will just run into the else part and return an empty list.

I would actually put this in the else condition below (see my comments below) and let xarray raise any potential errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent point! This is how it looks now d514b67

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it's also tested fully now 🍺

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I cannot find the test. Would you point me to it? 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one and the one below it

def test_load_zarr_local_not_file():

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a nice test, but not exactly what I meant. I think it would be nice to simply replicate test_load_zarr2_local with a str as file arguement, i.e., simply use cubes = load(str(zarr_path)) in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that I missed, plopped in 84a33f2

elif urlparse(file).scheme in ["http", "https"]:
# basic test that opens the Zarr/.zmetadata file for Zarr2
# or Zarr/zarr.json for Zarr3
fs = fsspec.filesystem("http")
zarr2 = zarr3 = True
try:
fs.open(file + "/.zmetadata", "rb") # Zarr2
except Exception: # noqa: BLE001

Check notice on line 178 in esmvalcore/preprocessor/_io.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/preprocessor/_io.py#L178

Catching too general exception Exception (broad-exception-caught)
zarr2 = False
try:
fs.open(file + "/zarr.json", "rb") # Zarr3
except Exception: # noqa: BLE001
zarr3 = False
# we don't want to catch any specific aiohttp/fsspec exception
# bottom line is that that file has issues, so raise
if not zarr2 and not zarr3:
msg = f"File '{file}' can not be open as Zarr file at the moment."
raise ValueError(msg) from None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you try to read files that are not Zarr2 or Zarr3? Will this raise an error? If yes, I would simply let the underlying code raise that error and not build in this logic here.

Copy link
Contributor Author

@valeriupredoi valeriupredoi Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no other Zarr files than Zarr2 and Zarr3: Zarr used to be just Zarr, before the major super release of 3.0, that contained a lot of structural changes, also a lot of breaking changes, after that it's sort of lore that people call Zarr2 and Zarr3 to differentiate between the two formats, it's not an official format per se, it's still "Zarr". Whatever it's not Zarr is caught by this test, also this catches issues with the network or temporary unavailability of the object store (that's why I phrased the exception "at the moment" 😉 ). Here is the Zarr3 mega release (note that they had some serious issues with 3.0, so they had to yank it) https://zarr.dev/blog/zarr-python-3-release/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, but what happens if a non-Zarr2/non-Zarr3 is passed to xr.open_dataset. I would hope that this raises an error. In that case, I think we can remove all the code here that is just there to raise error?

Copy link
Contributor Author

@valeriupredoi valeriupredoi Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bingo! That's why the test here: Xarray uses Zarr that uses fsspec and aiohttp to access files over the network, and indeed you will get an aiohttp error if one tries to access a Zarr file that is not Zarr, but that error usually comes after a good few minutes of waiting, and with a huge stracktrace - we want to avoid exactly that and literally poke the file before trying to open it. Telling you, man, it's only a few lines of code in this PR, but that's been a quite a bit of sweat and guts before these 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if the network's crap or there are privacy restrictions on the file - the very quick call to fsspec resolves those a lot faster than trying to open the entire file/store

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! If that's a robust way to identify Zarr files then this is fine for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is, even cf-python stole this test 😁


time_coder = xr.coders.CFDatetimeCoder(use_cftime=True)
zarr_xr = xr.open_dataset(
file,
consolidated=True,
decode_times=time_coder,
engine="zarr",
backend_kwargs=backend_kwargs,
)
else:
return []

return dataset_to_iris(zarr_xr, ignore_warnings=ignore_warnings)


def _load_from_file(
file: str | Path,
ignore_warnings: list[dict[str, Any]] | None = None,
Expand Down
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ dynamic = [
"version",
]
dependencies = [
"aiohttp",
"cartopy",
"cf-units",
"dask[array,distributed]>=2025", # Core/issues/2503
Expand All @@ -44,6 +45,7 @@ dependencies = [
"fire",
"geopy",
"humanfriendly",
"intake-esm",
"iris-grib>=0.20.0", # github.com/ESMValGroup/ESMValCore/issues/2535
"isodate>=0.7.0",
"jinja2",
Expand All @@ -68,6 +70,7 @@ dependencies = [
"stratify>=0.3",
"xarray",
"yamale",
"zarr>3",
]
description = "A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts"
license = {text = "Apache License, Version 2.0"}
Expand Down
129 changes: 129 additions & 0 deletions tests/integration/preprocessor/_io/test_load.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from importlib.resources import files as importlib_files
from pathlib import Path

import cf_units
import iris
import ncdata
import numpy as np
Expand Down Expand Up @@ -120,6 +121,134 @@ def test_load_ncdata():
assert not cube.coords()


def test_load_zarr_local():
"""Test loading a Zarr store as ncdata.NcData via Xarray."""
zarr_path = (
Path(importlib_files("tests"))
/ "sample_data"
/ "zarr-sample-data"
/ "example_field_0.zarr2"
)

cubes = load(zarr_path)

assert len(cubes) == 1
cube = cubes[0]
assert cube.var_name == "q"
assert cube.standard_name == "specific_humidity"
assert cube.long_name is None
assert cube.units == cf_units.Unit("1")
coords = cube.coords()
coord_names = [coord.standard_name for coord in coords]
assert "longitude" in coord_names
assert "latitude" in coord_names


def test_load_zarr_remote():
"""
Test loading a Zarr store from a https Object Store.

We have a permanent bucket: esmvaltool-zarr at CEDA's object store
"url": "https://uor-aces-o.s3-ext.jc.rl.ac.uk",
where will host a number of test files.
"""
zarr_path = (
"https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr2"
)

# with "dummy" storage options
cubes = load(
zarr_path,
ignore_warnings=None,
backend_kwargs={"storage_options": {}},
)

assert len(cubes) == 1
cube = cubes[0]
assert cube.var_name == "q"
assert cube.standard_name == "specific_humidity"
assert cube.long_name is None
assert cube.units == cf_units.Unit("1")
coords = cube.coords()
coord_names = [coord.standard_name for coord in coords]
assert "longitude" in coord_names
assert "latitude" in coord_names

# without storage_options
cubes = load(zarr_path)

assert len(cubes) == 1
cube = cubes[0]
assert cube.var_name == "q"
assert cube.standard_name == "specific_humidity"
assert cube.long_name is None
assert cube.units == cf_units.Unit("1")
coords = cube.coords()
coord_names = [coord.standard_name for coord in coords]
assert "longitude" in coord_names
assert "latitude" in coord_names


def test_load_zarr_remote_not_zarrfile():
"""
Test loading a Zarr store from a https Object Store.

This fails due to the file being loaded not a Zarr file.
We have a permanent bucket: esmvaltool-zarr at CEDA's object store
"url": "https://uor-aces-o.s3-ext.jc.rl.ac.uk",
where will host a number of test files.
"""
zarr_path = (
"https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr17"
)

msg = (
"File 'https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr17' can not "
"be open as Zarr file at the moment."
)
with pytest.raises(ValueError, match=msg):
load(zarr_path)


def test_load_zarr_remote_not_file():
"""
Test loading a Zarr store from a https Object Store.

This fails due to non-existing file.
We have a permanent bucket: esmvaltool-zarr at CEDA's object store
"url": "https://uor-aces-o.s3-ext.jc.rl.ac.uk",
where will host a number of test files.
"""
zarr_path = (
"https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr22"
)

msg = (
"File 'https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr22' can not "
"be open as Zarr file at the moment."
)
with pytest.raises(ValueError, match=msg):
load(zarr_path)


def test_load_not_zarr():
"""
Test loadinng something that has a zarr extension.

But file doesn't hold any data / doesn't exist.
"""
zarr_path = "esmvaltool-zarr/example_field_0.zarr22"

msg = "esmvaltool-zarr/example_field_0.zarr22 does not contain any data"
with pytest.raises(ValueError, match=msg):
load(zarr_path)


def test_load_invalid_type_fail():
"""Test loading an invalid type."""
with pytest.raises(TypeError):
Expand Down
Loading
Loading