Skip to content
Merged
Show file tree
Hide file tree
Changes from 68 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
e347f40
add basic zarr support
valeriupredoi Jul 24, 2025
5c32b55
add basic test
valeriupredoi Jul 24, 2025
682f46d
add sample zarr store
valeriupredoi Jul 24, 2025
81c254c
add sample zarr store
valeriupredoi Jul 24, 2025
6a02757
turn on gha
valeriupredoi Jul 24, 2025
84412ab
add zarr as dependency
valeriupredoi Jul 24, 2025
1f8e127
add zarr as dependency
valeriupredoi Jul 24, 2025
5b97169
account for remote zarrs
valeriupredoi Jul 24, 2025
8bcc15f
add test case for remote zarr
valeriupredoi Jul 24, 2025
c0b049c
functional remote Zarr and cleanup
valeriupredoi Jul 24, 2025
e5f8c4e
add utility and test for remote zarr
valeriupredoi Jul 24, 2025
9265b0d
add intake-esm as dependency
valeriupredoi Jul 24, 2025
4be6152
add aiohttp as dependency
valeriupredoi Jul 24, 2025
28f647f
fixture
valeriupredoi Jul 24, 2025
6da4183
remove unwanted (for now) fixture altogether
valeriupredoi Jul 24, 2025
fb7712a
remove unneeded import
valeriupredoi Jul 24, 2025
95a92c9
add storeage options
valeriupredoi Jul 25, 2025
872be18
semi-working version for publick bucket for esmvaltool
valeriupredoi Jul 25, 2025
971cf34
correct bucket with correct permissions and working test
valeriupredoi Jul 28, 2025
0eeeb50
add yet another test
valeriupredoi Jul 28, 2025
f5d13c8
adjust test member docstring
valeriupredoi Jul 28, 2025
fa8b90a
make io more robust
valeriupredoi Jul 28, 2025
cccdb39
change api
valeriupredoi Jul 28, 2025
1618076
test changed api
valeriupredoi Jul 28, 2025
e2ed41c
add basic test for zarr file
valeriupredoi Jul 28, 2025
fe7326e
add test for file with issues
valeriupredoi Jul 28, 2025
39df34e
reduce pytest runners to 2
valeriupredoi Jul 28, 2025
2b44ac9
run only test load
valeriupredoi Jul 28, 2025
caff216
skip a test
valeriupredoi Jul 28, 2025
d48418c
change skip message
valeriupredoi Jul 29, 2025
0909770
restore circle ci configuration
valeriupredoi Jul 29, 2025
e87b12b
skip the other test that uses the healpix dataset
valeriupredoi Jul 29, 2025
37d8a31
removed problematic skipped tests
valeriupredoi Jul 29, 2025
0d446af
add dedicated Zarr IO test module
valeriupredoi Jul 29, 2025
37fcfff
add xr to ncdata test
valeriupredoi Jul 29, 2025
94d8677
add pytest marker
valeriupredoi Jul 29, 2025
7ac7b45
run zarr test single proc
valeriupredoi Jul 29, 2025
caa3657
mark test
valeriupredoi Jul 29, 2025
b4c6b6f
remove pytest marker
valeriupredoi Jul 29, 2025
48db5f3
restore circleci configuration
valeriupredoi Jul 29, 2025
0afcec7
unmark test but dont use cf_time flag
valeriupredoi Jul 29, 2025
0c4a16f
set consolidated to False
valeriupredoi Jul 29, 2025
72d79c2
found hang cause
valeriupredoi Jul 29, 2025
8e54f1e
add Ncdata issue pointer
valeriupredoi Jul 29, 2025
1572fff
replace deprecated use cftime
valeriupredoi Jul 29, 2025
b1fe4b8
add zar3 test and fixed deprecated call with cftime
valeriupredoi Jul 29, 2025
f5c5979
add test non existing file
valeriupredoi Jul 30, 2025
8cddb55
add CMIP6 Zarr store and metadata test for it
valeriupredoi Jul 30, 2025
0d71de7
add test resources
valeriupredoi Jul 30, 2025
2ab8fc0
add purely diagnostic test
valeriupredoi Jul 30, 2025
b01b578
feed the PEP typing moster an actual type
valeriupredoi Jul 30, 2025
ea9377a
cleanup tests
valeriupredoi Jul 30, 2025
72d87bc
cleanup implement
valeriupredoi Jul 30, 2025
76b32b4
dict typing
valeriupredoi Jul 30, 2025
90c8963
Merge branch 'main' into zarr_support
valeriupredoi Jul 30, 2025
7af2ec4
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
c151b57
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
ab78052
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
b5c3301
add mention about backend dict
valeriupredoi Jul 31, 2025
d514b67
add inline text
valeriupredoi Jul 31, 2025
2852381
removed all Zarr tests and moved to test_zarr.py
valeriupredoi Jul 31, 2025
49fb643
moved all tests from test_load here and removed tests that dont test …
valeriupredoi Jul 31, 2025
6a554d8
add mention about s3 bucket
valeriupredoi Jul 31, 2025
683b6e8
spruce up zarr tests and add an extra test for local files
valeriupredoi Jul 31, 2025
f2923e6
add dummy zar plaintext file
valeriupredoi Jul 31, 2025
8c49e20
dont match to exception string
valeriupredoi Jul 31, 2025
8b6f221
add info on further testing
valeriupredoi Jul 31, 2025
63411cb
unrun GHA
valeriupredoi Jul 31, 2025
84a33f2
add str path test
valeriupredoi Jul 31, 2025
8909b7d
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
eff8956
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
a2e31ab
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
a387558
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
37266da
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
e13a19e
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
cef79ce
Update esmvalcore/preprocessor/_io.py
valeriupredoi Jul 31, 2025
63b817f
fix pytest msg regex
valeriupredoi Jul 31, 2025
71ebe4e
better handling of exceptions
valeriupredoi Jul 31, 2025
171ea74
Update tests/integration/preprocessor/_io/test_zarr.py
valeriupredoi Jul 31, 2025
464c9f3
Update tests/integration/preprocessor/_io/test_zarr.py
valeriupredoi Jul 31, 2025
66f9811
Update tests/integration/preprocessor/_io/test_zarr.py
valeriupredoi Jul 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ channels:
- nodefaults

dependencies:
- aiohttp
- cartopy
- cf-units
- cftime
Expand All @@ -18,6 +19,7 @@ dependencies:
- fire
- geopy
- humanfriendly
- intake-esm
- iris >=3.12.2 # https://github.com/SciTools/iris/issues/6417
- iris-esmf-regrid >=0.11.0
- iris-grib >=0.20.0 # github.com/ESMValGroup/ESMValCore/issues/2535
Expand Down Expand Up @@ -46,6 +48,7 @@ dependencies:
- shapely >=2.0.0
- xarray
- yamale
- zarr >3
# Python packages needed for building docs
- autodocsumm >=0.2.2
- ipython <9.0 # github.com/ESMValGroup/ESMValCore/issues/2680
Expand Down
72 changes: 71 additions & 1 deletion esmvalcore/preprocessor/_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@
from itertools import groupby
from pathlib import Path
from typing import TYPE_CHECKING, Any
from urllib.parse import urlparse

import fsspec
import iris
import ncdata
import xarray as xr
Expand Down Expand Up @@ -75,6 +77,7 @@
def load(
file: str | Path | Cube | CubeList | xr.Dataset | ncdata.NcData,
ignore_warnings: list[dict[str, Any]] | None = None,
backend_kwargs: dict[str, Any] | None = None,
) -> CubeList:
"""Load Iris cubes.
Expand All @@ -83,10 +86,19 @@
file:
File to be loaded. If ``file`` is already a loaded dataset, return it
as a :class:`~iris.cube.CubeList`.
File as ``Path`` object could be a Zarr store.
ignore_warnings:
Keyword arguments passed to :func:`warnings.filterwarnings` used to
ignore warnings issued by :func:`iris.load_raw`. Each list element
corresponds to one call to :func:`warnings.filterwarnings`.
backend_kwargs:
Dict to hold info needed by storage backend e.g. to access
a PRIVATE S3 bucket containing object stores (e.g. netCDF4 files);
needed by ``fsspec`` and its extensions e.g. ``s3fs``, so
most of the times this will include ``storage_options``. Note that Zarr
files are opened via ``http`` extension of ``fsspec``, so no need
for ``storage_options`` in that case (ie anon/anon). Currently only used
as empty dict in Zarr file opening.
Returns
-------
Expand All @@ -102,7 +114,19 @@
"""
if isinstance(file, (str, Path)):
cubes = _load_from_file(file, ignore_warnings=ignore_warnings)
extension = (
file.suffix
if isinstance(file, Path)
else os.path.splitext(file)[1]
)
if "zarr" not in extension:
cubes = _load_from_file(file, ignore_warnings=ignore_warnings)
else:
cubes = _load_zarr(
file,
ignore_warnings=ignore_warnings,
backend_kwargs=backend_kwargs,
)
elif isinstance(file, Cube):
cubes = CubeList([file])
elif isinstance(file, CubeList):
Expand Down Expand Up @@ -134,6 +158,52 @@
return cubes


def _load_zarr(
file: str | Path | Cube | CubeList | xr.Dataset | ncdata.NcData,
ignore_warnings: list[dict[str, Any]] | None = None,
backend_kwargs: dict[str, Any] | None = None,
) -> CubeList:
# case 1: Zarr store is on remote object store
# file's URI will always be either http or https
if urlparse(str(file)).scheme in ["http", "https"]:
# basic test that opens the Zarr/.zmetadata file for Zarr2
# or Zarr/zarr.json for Zarr3
fs = fsspec.filesystem("http")
zarr2 = zarr3 = True
try:
fs.open(str(file) + "/.zmetadata", "rb") # Zarr2
except Exception: # noqa: BLE001

Check notice on line 175 in esmvalcore/preprocessor/_io.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/preprocessor/_io.py#L175

Catching too general exception Exception (broad-exception-caught)
zarr2 = False
try:
fs.open(str(file) + "/zarr.json", "rb") # Zarr3
except Exception: # noqa: BLE001
zarr3 = False
# we don't want to catch any specific aiohttp/fsspec exception
# bottom line is that that file has issues, so raise
if not zarr2 and not zarr3:
msg = f"File '{file}' can not be open as Zarr file at the moment."
raise ValueError(msg) from None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you try to read files that are not Zarr2 or Zarr3? Will this raise an error? If yes, I would simply let the underlying code raise that error and not build in this logic here.

Copy link
Contributor Author

@valeriupredoi valeriupredoi Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no other Zarr files than Zarr2 and Zarr3: Zarr used to be just Zarr, before the major super release of 3.0, that contained a lot of structural changes, also a lot of breaking changes, after that it's sort of lore that people call Zarr2 and Zarr3 to differentiate between the two formats, it's not an official format per se, it's still "Zarr". Whatever it's not Zarr is caught by this test, also this catches issues with the network or temporary unavailability of the object store (that's why I phrased the exception "at the moment" 😉 ). Here is the Zarr3 mega release (note that they had some serious issues with 3.0, so they had to yank it) https://zarr.dev/blog/zarr-python-3-release/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, but what happens if a non-Zarr2/non-Zarr3 is passed to xr.open_dataset. I would hope that this raises an error. In that case, I think we can remove all the code here that is just there to raise error?

Copy link
Contributor Author

@valeriupredoi valeriupredoi Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bingo! That's why the test here: Xarray uses Zarr that uses fsspec and aiohttp to access files over the network, and indeed you will get an aiohttp error if one tries to access a Zarr file that is not Zarr, but that error usually comes after a good few minutes of waiting, and with a huge stracktrace - we want to avoid exactly that and literally poke the file before trying to open it. Telling you, man, it's only a few lines of code in this PR, but that's been a quite a bit of sweat and guts before these 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if the network's crap or there are privacy restrictions on the file - the very quick call to fsspec resolves those a lot faster than trying to open the entire file/store

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! If that's a robust way to identify Zarr files then this is fine for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is, even cf-python stole this test 😁


time_coder = xr.coders.CFDatetimeCoder(use_cftime=True)
zarr_xr = xr.open_dataset(
file,
consolidated=True,
decode_times=time_coder,
engine="zarr",
backend_kwargs=backend_kwargs,
)
# case 2: Zarr store is local to the file system
else:
zarr_xr = xr.open_dataset(
file,
consolidated=False,
engine="zarr",
backend_kwargs=backend_kwargs,
)

return dataset_to_iris(zarr_xr, ignore_warnings=ignore_warnings)


def _load_from_file(
file: str | Path,
ignore_warnings: list[dict[str, Any]] | None = None,
Expand Down
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ dynamic = [
"version",
]
dependencies = [
"aiohttp",
"cartopy",
"cf-units",
"dask[array,distributed]>=2025", # Core/issues/2503
Expand All @@ -44,6 +45,7 @@ dependencies = [
"fire",
"geopy",
"humanfriendly",
"intake-esm",
"iris-grib>=0.20.0", # github.com/ESMValGroup/ESMValCore/issues/2535
"isodate>=0.7.0",
"jinja2",
Expand All @@ -68,6 +70,7 @@ dependencies = [
"stratify>=0.3",
"xarray",
"yamale",
"zarr>3",
]
description = "A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts"
license = {text = "Apache License, Version 2.0"}
Expand Down
223 changes: 223 additions & 0 deletions tests/integration/preprocessor/_io/test_zarr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
"""
Integration tests for :func:`esmvalcore.preprocessor._io._load_zarr`.

This is a dedicated test module for Zarr files IO; we have identified
a number of issues with Zarr IO so it deserves its own test module.

We have a permanent bucket: esmvaltool-zarr at CEDA's object store
"url": "https://uor-aces-o.s3-ext.jc.rl.ac.uk/esmvaltool-zarr",
where will host a number of test files. Bucket is anon/anon
(read/GET-only, but PUT can be allowed). Bucket operations are done
via usual MinIO client (mc command) e.g. ``mc list``, ``mc du`` etc.

Further performance investigations are being run with a number of tests
that look at ncdata at https://github.com/valeriupredoi/esmvaltool_zarr_tests
also see https://github.com/pp-mo/ncdata/issues/139
"""

from importlib.resources import files as importlib_files
from pathlib import Path

import cf_units
import pytest

from esmvalcore.preprocessor._io import load


def test_load_zarr2_local():
"""Test loading a Zarr2 store from local FS."""
zarr_path = (
Path(importlib_files("tests"))
/ "sample_data"
/ "zarr-sample-data"
/ "example_field_0.zarr2"
)

cubes = load(zarr_path)

assert len(cubes) == 1
cube = cubes[0]
assert cube.var_name == "q"
assert cube.standard_name == "specific_humidity"
assert cube.long_name is None
assert cube.units == cf_units.Unit("1")
coords = cube.coords()
coord_names = [coord.standard_name for coord in coords]
assert "longitude" in coord_names
assert "latitude" in coord_names


def test_load_zarr2_remote():
"""Test loading a Zarr2 store from a https Object Store."""
zarr_path = (
"https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr2"
)

# with "dummy" storage options
cubes = load(
zarr_path,
ignore_warnings=None,
backend_kwargs={"storage_options": {}},
)

assert len(cubes) == 1
cube = cubes[0]
assert cube.var_name == "q"
assert cube.standard_name == "specific_humidity"
assert cube.long_name is None
assert cube.units == cf_units.Unit("1")
coords = cube.coords()
coord_names = [coord.standard_name for coord in coords]
assert "longitude" in coord_names
assert "latitude" in coord_names

# without storage_options
cubes = load(zarr_path)

assert len(cubes) == 1
cube = cubes[0]
assert cube.var_name == "q"
assert cube.standard_name == "specific_humidity"
assert cube.long_name is None
assert cube.units == cf_units.Unit("1")
coords = cube.coords()
coord_names = [coord.standard_name for coord in coords]
assert "longitude" in coord_names
assert "latitude" in coord_names


def test_load_zarr3_remote():
"""Test loading a Zarr3 store from a https Object Store."""
zarr_path = (
"https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr3"
)

# with "dummy" storage options
cubes = load(
zarr_path,
ignore_warnings=None,
backend_kwargs={"storage_options": {}},
)

assert len(cubes) == 1
cube = cubes[0]
assert cube.var_name == "q"
assert cube.standard_name == "specific_humidity"
assert cube.long_name is None
assert cube.units == cf_units.Unit("1")
coords = cube.coords()
coord_names = [coord.standard_name for coord in coords]
assert "longitude" in coord_names
assert "latitude" in coord_names


def test_load_zarr3_cmip6_metadata():
"""
Test loading a Zarr3 store from a https Object Store.

This test loads just the metadata, no computations.

This is an actual CMIP6 dataset (Zarr built from netCDF4 via Xarray)
- Zarr store on disk: 243 MiB
- compression: Blosc
- Dimensions: (lat: 128, lon: 256, time: 2352, axis_nbounds: 2)
- chunking: time-slices; netCDF4.Dataset.chunking() = [1, 128, 256]

Test takes 8-9s (median: 8.5s) and needs max Res mem: 1GB
"""
zarr_path = (
"https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/pr_Amon_CNRM-ESM2-1_02Kpd-11_r1i1p2f2_gr_200601-220112.zarr3"
)

# with "dummy" storage options
cubes = load(
zarr_path,
ignore_warnings=None,
backend_kwargs={"storage_options": {}},
)

assert len(cubes) == 1
cube = cubes[0]
assert cube.var_name == "pr"
assert cube.standard_name == "precipitation_flux"
assert cube.long_name == "Precipitation"
assert cube.units == cf_units.Unit("kg m-2 s-1")
assert cube.has_lazy_data()


def test_load_zarr_remote_not_zarr_file():
"""
Test loading a Zarr store from a https Object Store.

This fails due to the file being loaded is not a Zarr file.
"""
zarr_path = (
"https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr17"
)

msg = (
"File 'https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr17' can not "
"be open as Zarr file at the moment."
)
with pytest.raises(ValueError, match=msg):
load(zarr_path)


def test_load_zarr_remote_not_file():
"""
Test loading a Zarr store from a https Object Store.

This fails due to non-existing file.
"""
zarr_path = (
"https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr22"
)

msg = (
"File 'https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/example_field_0.zarr22' can not "
"be open as Zarr file at the moment."
)
with pytest.raises(ValueError, match=msg):
load(zarr_path)


def test_load_zarr_local_not_file():
"""
Test loading something that has a zarr extension.

But file doesn't exist (on local FS).
"""
zarr_path = "esmvaltool-zarr/example_field_0.zarr22"

# "Unable to find group" or "No group found"
# Zarr keeps changing the exception string so matching
# is bound to fail the test
with pytest.raises(FileNotFoundError):
load(zarr_path)


def test_load_zarr_local_not_zarr_file():
"""
Test loading something that has a zarr extension.

But file is plaintext (on local FS).
"""
zarr_path = (
Path(importlib_files("tests"))
/ "sample_data"
/ "zarr-sample-data"
/ "example_field_0.zarr17"
)

# "Unable to find group" or "No group found"
# Zarr keeps changing the exception string so matching
# is bound to fail the test
with pytest.raises(FileNotFoundError):
load(zarr_path)
1 change: 1 addition & 0 deletions tests/sample_data/zarr-sample-data/example_field_0.zarr17
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This is not a Zarr file. Go grab lunch!
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"Conventions": "CF-1.12"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"zarr_format": 2
}
Loading