Skip to content

Commit f6da514

Browse files
Fix Zarr Encoding Documentation: Correct Examples and Add Comprehensive Coverage (#10720)
* fix: update .gitignore to exclude xarray-docs and enhance Zarr encoding documentation with detailed examples for V2 and V3 formats * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wording * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * PR number * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Zarr encoding documentation with consistent examples and clarify dimension and coordinate encoding differences between Zarr V2 and V3 formats. * s * Clarify the encoding of dimension names in Zarr formats in the documentation. * Refactor Zarr encoding documentation to clarify CF conventions and coordinate metadata handling --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent ba2e831 commit f6da514

File tree

3 files changed

+96
-28
lines changed

3 files changed

+96
-28
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,3 +90,4 @@ doc/videos-gallery.txt
9090
# think we shouldn't...)
9191
uv.lock
9292
mypy_report/
93+
xarray-docs/

doc/internals/zarr-encoding-spec.rst

Lines changed: 88 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -19,26 +19,57 @@ Xarray ``Dataset`` objects.
1919

2020
Second, from Xarray's point of view, the key difference between
2121
NetCDF and Zarr is that all NetCDF arrays have *dimension names* while Zarr
22-
arrays do not. Therefore, in order to store NetCDF data in Zarr, Xarray must
23-
somehow encode and decode the name of each array's dimensions.
24-
25-
To accomplish this, Xarray developers decided to define a special Zarr array
26-
attribute: ``_ARRAY_DIMENSIONS``. The value of this attribute is a list of
27-
dimension names (strings), for example ``["time", "lon", "lat"]``. When writing
28-
data to Zarr, Xarray sets this attribute on all variables based on the variable
29-
dimensions. When reading a Zarr group, Xarray looks for this attribute on all
30-
arrays, raising an error if it can't be found. The attribute is used to define
31-
the variable dimension names and then removed from the attributes dictionary
32-
returned to the user.
33-
34-
Because of these choices, Xarray cannot read arbitrary array data, but only
35-
Zarr data with valid ``_ARRAY_DIMENSIONS`` or
36-
`NCZarr <https://docs.unidata.ucar.edu/nug/current/nczarr_head.html>`_ attributes
37-
on each array (NCZarr dimension names are defined in the ``.zarray`` file).
38-
39-
After decoding the ``_ARRAY_DIMENSIONS`` or NCZarr attribute and assigning the variable
40-
dimensions, Xarray proceeds to [optionally] decode each variable using its
41-
standard CF decoding machinery used for NetCDF data (see :py:func:`decode_cf`).
22+
arrays do not. In Zarr v2, Xarray uses an ad-hoc convention to encode and decode
23+
the name of each array's dimensions. However, starting with Zarr v3, the
24+
``dimension_names`` attribute provides a formal convention for storing the
25+
NetCDF data model in Zarr.
26+
27+
Dimension Encoding in Zarr Formats
28+
-----------------------------------
29+
30+
Xarray encodes array dimensions differently depending on the Zarr format version:
31+
32+
**Zarr V2 Format:**
33+
Xarray uses a special Zarr array attribute: ``_ARRAY_DIMENSIONS``. The value of this
34+
attribute is a list of dimension names (strings), for example ``["time", "lon", "lat"]``.
35+
When writing data to Zarr V2, Xarray sets this attribute on all variables based on the
36+
variable dimensions. This attribute is visible when accessing arrays directly with
37+
zarr-python.
38+
39+
**Zarr V3 Format:**
40+
Xarray uses the native ``dimension_names`` field in the array metadata. This is part
41+
of the official Zarr V3 specification and is not stored as a regular attribute.
42+
When accessing arrays with zarr-python, this information is available in the array's
43+
metadata but not in the attributes dictionary.
44+
45+
When reading a Zarr group, Xarray looks for dimension information in the appropriate
46+
location based on the format version, raising an error if it can't be found. The
47+
dimension information is used to define the variable dimension names and then
48+
(for Zarr V2) removed from the attributes dictionary returned to the user.
49+
50+
CF Conventions
51+
--------------
52+
53+
Xarray uses its standard CF encoding/decoding functionality for handling metadata
54+
(see :py:func:`decode_cf`). This includes encoding concepts such as dimensions and
55+
coordinates. The ``coordinates`` attribute, which lists coordinate variables
56+
(e.g., ``"yc xc"`` for spatial coordinates), is one part of the broader CF conventions
57+
used to describe metadata in NetCDF and Zarr.
58+
59+
Compatibility and Reading
60+
-------------------------
61+
62+
Because of these encoding choices, Xarray cannot read arbitrary Zarr arrays, but only
63+
Zarr data with valid dimension metadata. Xarray supports:
64+
65+
- Zarr V2 arrays with ``_ARRAY_DIMENSIONS`` attributes
66+
- Zarr V3 arrays with ``dimension_names`` metadata
67+
- `NCZarr <https://docs.unidata.ucar.edu/nug/current/nczarr_head.html>`_ format
68+
(dimension names are defined in the ``.zarray`` file)
69+
70+
After decoding the dimension information and assigning the variable dimensions,
71+
Xarray proceeds to [optionally] decode each variable using its standard CF decoding
72+
machinery used for NetCDF data.
4273

4374
Finally, it's worth noting that Xarray writes (and attempts to read)
4475
"consolidated metadata" by default (the ``.zmetadata`` file), which is another
@@ -49,34 +80,63 @@ warning about poor performance when reading non-consolidated stores unless they
4980
explicitly set ``consolidated=False``. See :ref:`io.zarr.consolidated_metadata`
5081
for more details.
5182

52-
As a concrete example, here we write a tutorial dataset to Zarr and then
53-
re-open it directly with Zarr:
83+
Examples: Zarr Format Differences
84+
----------------------------------
85+
86+
The following examples demonstrate how dimension and coordinate encoding differs
87+
between Zarr format versions. We'll use the same tutorial dataset but write it
88+
in different formats to show what users will see when accessing the files directly
89+
with zarr-python.
90+
91+
**Example 1: Zarr V2 Format**
5492

5593
.. jupyter-execute::
5694

5795
import os
5896
import xarray as xr
5997
import zarr
6098

99+
# Load tutorial dataset and write as Zarr V2
61100
ds = xr.tutorial.load_dataset("rasm")
62-
ds.to_zarr("rasm.zarr", mode="w", consolidated=False)
63-
os.listdir("rasm.zarr")
101+
ds.to_zarr("rasm_v2.zarr", mode="w", consolidated=False, zarr_format=2)
102+
103+
# Open with zarr-python and examine attributes
104+
zgroup = zarr.open("rasm_v2.zarr")
105+
print("Zarr V2 - Tair attributes:")
106+
tair_attrs = dict(zgroup["Tair"].attrs)
107+
for key, value in tair_attrs.items():
108+
print(f" '{key}': {repr(value)}")
64109

65110
.. jupyter-execute::
111+
:hide-code:
66112

67-
zgroup = zarr.open("rasm.zarr")
68-
zgroup.tree()
113+
import shutil
114+
shutil.rmtree("rasm_v2.zarr")
115+
116+
**Example 2: Zarr V3 Format**
69117

70118
.. jupyter-execute::
71119

72-
dict(zgroup["Tair"].attrs)
120+
# Write the same dataset as Zarr V3
121+
ds.to_zarr("rasm_v3.zarr", mode="w", consolidated=False, zarr_format=3)
122+
123+
# Open with zarr-python and examine attributes
124+
zgroup = zarr.open("rasm_v3.zarr")
125+
print("Zarr V3 - Tair attributes:")
126+
tair_attrs = dict(zgroup["Tair"].attrs)
127+
for key, value in tair_attrs.items():
128+
print(f" '{key}': {repr(value)}")
129+
130+
# For Zarr V3, dimension information is in metadata
131+
tair_array = zgroup["Tair"]
132+
print(f"\nZarr V3 - dimension_names in metadata: {tair_array.metadata.dimension_names}")
73133

74134
.. jupyter-execute::
75135
:hide-code:
76136

77137
import shutil
138+
shutil.rmtree("rasm_v3.zarr")
78139

79-
shutil.rmtree("rasm.zarr")
80140

81141
Chunk Key Encoding
82142
------------------

doc/whats-new.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,13 @@ Bug fixes
3838
Documentation
3939
~~~~~~~~~~~~~
4040

41+
- Fixed Zarr encoding documentation with consistent examples and added comprehensive
42+
coverage of dimension and coordinate encoding differences between Zarr V2 and V3 formats.
43+
The documentation shows what users will see when accessing Zarr files
44+
with raw zarr-python, and explains the relationship between ``_ARRAY_DIMENSIONS``
45+
(Zarr V2), ``dimension_names`` metadata (Zarr V3), and CF ``coordinates`` attributes.
46+
(:pull:`10720`)
47+
By `Emmanuel Mathot <https://github.com/emmanuelmathot>`_.
4148

4249
Internal Changes
4350
~~~~~~~~~~~~~~~~

0 commit comments

Comments
 (0)