Skip to content

Commit 2ee2207

Browse files
authored
Merge branch 'main' into netcdf4-memory
2 parents cdc7523 + f6da514 commit 2ee2207

File tree

5 files changed

+126
-31
lines changed

5 files changed

+126
-31
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,3 +90,4 @@ doc/videos-gallery.txt
9090
# think we shouldn't...)
9191
uv.lock
9292
mypy_report/
93+
xarray-docs/

doc/internals/zarr-encoding-spec.rst

Lines changed: 88 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -19,26 +19,57 @@ Xarray ``Dataset`` objects.
1919

2020
Second, from Xarray's point of view, the key difference between
2121
NetCDF and Zarr is that all NetCDF arrays have *dimension names* while Zarr
22-
arrays do not. Therefore, in order to store NetCDF data in Zarr, Xarray must
23-
somehow encode and decode the name of each array's dimensions.
24-
25-
To accomplish this, Xarray developers decided to define a special Zarr array
26-
attribute: ``_ARRAY_DIMENSIONS``. The value of this attribute is a list of
27-
dimension names (strings), for example ``["time", "lon", "lat"]``. When writing
28-
data to Zarr, Xarray sets this attribute on all variables based on the variable
29-
dimensions. When reading a Zarr group, Xarray looks for this attribute on all
30-
arrays, raising an error if it can't be found. The attribute is used to define
31-
the variable dimension names and then removed from the attributes dictionary
32-
returned to the user.
33-
34-
Because of these choices, Xarray cannot read arbitrary array data, but only
35-
Zarr data with valid ``_ARRAY_DIMENSIONS`` or
36-
`NCZarr <https://docs.unidata.ucar.edu/nug/current/nczarr_head.html>`_ attributes
37-
on each array (NCZarr dimension names are defined in the ``.zarray`` file).
38-
39-
After decoding the ``_ARRAY_DIMENSIONS`` or NCZarr attribute and assigning the variable
40-
dimensions, Xarray proceeds to [optionally] decode each variable using its
41-
standard CF decoding machinery used for NetCDF data (see :py:func:`decode_cf`).
22+
arrays do not. In Zarr v2, Xarray uses an ad-hoc convention to encode and decode
23+
the name of each array's dimensions. However, starting with Zarr v3, the
24+
``dimension_names`` attribute provides a formal convention for storing the
25+
NetCDF data model in Zarr.
26+
27+
Dimension Encoding in Zarr Formats
28+
-----------------------------------
29+
30+
Xarray encodes array dimensions differently depending on the Zarr format version:
31+
32+
**Zarr V2 Format:**
33+
Xarray uses a special Zarr array attribute: ``_ARRAY_DIMENSIONS``. The value of this
34+
attribute is a list of dimension names (strings), for example ``["time", "lon", "lat"]``.
35+
When writing data to Zarr V2, Xarray sets this attribute on all variables based on the
36+
variable dimensions. This attribute is visible when accessing arrays directly with
37+
zarr-python.
38+
39+
**Zarr V3 Format:**
40+
Xarray uses the native ``dimension_names`` field in the array metadata. This is part
41+
of the official Zarr V3 specification and is not stored as a regular attribute.
42+
When accessing arrays with zarr-python, this information is available in the array's
43+
metadata but not in the attributes dictionary.
44+
45+
When reading a Zarr group, Xarray looks for dimension information in the appropriate
46+
location based on the format version, raising an error if it can't be found. The
47+
dimension information is used to define the variable dimension names and then
48+
(for Zarr V2) removed from the attributes dictionary returned to the user.
49+
50+
CF Conventions
51+
--------------
52+
53+
Xarray uses its standard CF encoding/decoding functionality for handling metadata
54+
(see :py:func:`decode_cf`). This includes encoding concepts such as dimensions and
55+
coordinates. The ``coordinates`` attribute, which lists coordinate variables
56+
(e.g., ``"yc xc"`` for spatial coordinates), is one part of the broader CF conventions
57+
used to describe metadata in NetCDF and Zarr.
58+
59+
Compatibility and Reading
60+
-------------------------
61+
62+
Because of these encoding choices, Xarray cannot read arbitrary Zarr arrays, but only
63+
Zarr data with valid dimension metadata. Xarray supports:
64+
65+
- Zarr V2 arrays with ``_ARRAY_DIMENSIONS`` attributes
66+
- Zarr V3 arrays with ``dimension_names`` metadata
67+
- `NCZarr <https://docs.unidata.ucar.edu/nug/current/nczarr_head.html>`_ format
68+
(dimension names are defined in the ``.zarray`` file)
69+
70+
After decoding the dimension information and assigning the variable dimensions,
71+
Xarray proceeds to [optionally] decode each variable using its standard CF decoding
72+
machinery used for NetCDF data.
4273

4374
Finally, it's worth noting that Xarray writes (and attempts to read)
4475
"consolidated metadata" by default (the ``.zmetadata`` file), which is another
@@ -49,34 +80,63 @@ warning about poor performance when reading non-consolidated stores unless they
4980
explicitly set ``consolidated=False``. See :ref:`io.zarr.consolidated_metadata`
5081
for more details.
5182

52-
As a concrete example, here we write a tutorial dataset to Zarr and then
53-
re-open it directly with Zarr:
83+
Examples: Zarr Format Differences
84+
----------------------------------
85+
86+
The following examples demonstrate how dimension and coordinate encoding differs
87+
between Zarr format versions. We'll use the same tutorial dataset but write it
88+
in different formats to show what users will see when accessing the files directly
89+
with zarr-python.
90+
91+
**Example 1: Zarr V2 Format**
5492

5593
.. jupyter-execute::
5694

5795
import os
5896
import xarray as xr
5997
import zarr
6098

99+
# Load tutorial dataset and write as Zarr V2
61100
ds = xr.tutorial.load_dataset("rasm")
62-
ds.to_zarr("rasm.zarr", mode="w", consolidated=False)
63-
os.listdir("rasm.zarr")
101+
ds.to_zarr("rasm_v2.zarr", mode="w", consolidated=False, zarr_format=2)
102+
103+
# Open with zarr-python and examine attributes
104+
zgroup = zarr.open("rasm_v2.zarr")
105+
print("Zarr V2 - Tair attributes:")
106+
tair_attrs = dict(zgroup["Tair"].attrs)
107+
for key, value in tair_attrs.items():
108+
print(f" '{key}': {repr(value)}")
64109

65110
.. jupyter-execute::
111+
:hide-code:
66112

67-
zgroup = zarr.open("rasm.zarr")
68-
zgroup.tree()
113+
import shutil
114+
shutil.rmtree("rasm_v2.zarr")
115+
116+
**Example 2: Zarr V3 Format**
69117

70118
.. jupyter-execute::
71119

72-
dict(zgroup["Tair"].attrs)
120+
# Write the same dataset as Zarr V3
121+
ds.to_zarr("rasm_v3.zarr", mode="w", consolidated=False, zarr_format=3)
122+
123+
# Open with zarr-python and examine attributes
124+
zgroup = zarr.open("rasm_v3.zarr")
125+
print("Zarr V3 - Tair attributes:")
126+
tair_attrs = dict(zgroup["Tair"].attrs)
127+
for key, value in tair_attrs.items():
128+
print(f" '{key}': {repr(value)}")
129+
130+
# For Zarr V3, dimension information is in metadata
131+
tair_array = zgroup["Tair"]
132+
print(f"\nZarr V3 - dimension_names in metadata: {tair_array.metadata.dimension_names}")
73133

74134
.. jupyter-execute::
75135
:hide-code:
76136

77137
import shutil
138+
shutil.rmtree("rasm_v3.zarr")
78139

79-
shutil.rmtree("rasm.zarr")
80140

81141
Chunk Key Encoding
82142
------------------

doc/whats-new.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,10 +45,20 @@ Bug fixes
4545

4646
- Fix the ``align_chunks`` parameter on the :py:meth:`~xarray.Dataset.to_zarr` method, it was not being
4747
passed to the underlying :py:meth:`~xarray.backends.api` method (:issue:`10501`, :pull:`10516`).
48+
- Fix error when encoding an empty :py:class:`numpy.datetime64` array
49+
(:issue:`10722`, :pull:`10723`). By `Spencer Clark
50+
<https://github.com/spencerkclark>`_.
4851

4952
Documentation
5053
~~~~~~~~~~~~~
5154

55+
- Fixed Zarr encoding documentation with consistent examples and added comprehensive
56+
coverage of dimension and coordinate encoding differences between Zarr V2 and V3 formats.
57+
The documentation shows what users will see when accessing Zarr files
58+
with raw zarr-python, and explains the relationship between ``_ARRAY_DIMENSIONS``
59+
(Zarr V2), ``dimension_names`` metadata (Zarr V3), and CF ``coordinates`` attributes.
60+
(:pull:`10720`)
61+
By `Emmanuel Mathot <https://github.com/emmanuelmathot>`_.
5262

5363
Internal Changes
5464
~~~~~~~~~~~~~~~~

xarray/coding/times.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1065,9 +1065,12 @@ def _eagerly_encode_cf_datetime(
10651065
# parse with cftime instead
10661066
raise OutOfBoundsDatetime
10671067
assert np.issubdtype(dates.dtype, "datetime64")
1068-
if calendar in ["standard", "gregorian"] and np.nanmin(dates).astype(
1069-
"=M8[us]"
1070-
).astype(datetime) < datetime(1582, 10, 15):
1068+
if (
1069+
calendar in ["standard", "gregorian"]
1070+
and dates.size > 0
1071+
and np.nanmin(dates).astype("=M8[us]").astype(datetime)
1072+
< datetime(1582, 10, 15)
1073+
):
10711074
raise_gregorian_proleptic_gregorian_mismatch_error = True
10721075

10731076
time_unit, ref_date = _unpack_time_unit_and_ref_date(units)

xarray/tests/test_coding_times.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2198,3 +2198,24 @@ def test_roundtrip_0size_timedelta(time_unit: PDDatetimeUnitOptions) -> None:
21982198
decoded.load()
21992199
assert decoded.dtype == np.dtype("=m8[s]")
22002200
assert decoded.encoding == encoding
2201+
2202+
2203+
def test_roundtrip_empty_datetime64_array(time_unit: PDDatetimeUnitOptions) -> None:
2204+
# Regression test for GitHub issue #10722.
2205+
encoding = {
2206+
"units": "days since 1990-1-1",
2207+
"dtype": np.dtype("float64"),
2208+
"calendar": "standard",
2209+
}
2210+
times = date_range("2000", periods=0, unit=time_unit)
2211+
variable = Variable(["time"], times, encoding=encoding)
2212+
2213+
encoded = conventions.encode_cf_variable(variable, name="foo")
2214+
assert encoded.dtype == np.dtype("float64")
2215+
2216+
decode_times = CFDatetimeCoder(time_unit=time_unit)
2217+
roundtripped = conventions.decode_cf_variable(
2218+
"foo", encoded, decode_times=decode_times
2219+
)
2220+
assert_identical(variable, roundtripped)
2221+
assert roundtripped.dtype == variable.dtype

0 commit comments

Comments
 (0)