Skip to content

Commit cbe04e6

Browse files
authored
Use a default fill_value of NaN for floats in Zarr v3 (#10757)
* Use a default fill_value of NaN for floats in Zarr v3 Zarr stores written with Xarray now consistently use a default Zarr fill value of ``NaN`` for float variables, for both Zarr v2 and v3. All other dtypes still use the Zarr default ``fill_value`` of zero. To customize, explicitly set encoding in :py:meth:`~Dataset.to_zarr`, e.g., ``encoding=dict.fromkey(ds.data_vars, {'fill_value': 0})``. Fixes #10646 * Update note on fill_value in to_zarr() * Fix for zarr v2 * improve test_default_zarr_fill_value * tweak test for zarr v2 * Fix error regex * revert inadvertent test change * Fix test failure * Use generic np.nan
1 parent c703ce4 commit cbe04e6

File tree

4 files changed

+41
-8
lines changed

4 files changed

+41
-8
lines changed

doc/whats-new.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,13 @@ Breaking changes
4343
sub-groups (:pull:`10785`).
4444
By `Stephan Hoyer <https://github.com/shoyer>`_.
4545

46+
- Zarr stores written with Xarray now consistently use a default Zarr fill value
47+
of ``NaN`` for float variables, for both Zarr v2 and v3 (:issue:`10646``). All
48+
other dtypes still use the Zarr default ``fill_value`` of zero. To customize,
49+
explicitly set encoding in :py:meth:`~Dataset.to_zarr`, e.g.,
50+
``encoding=dict.fromkey(ds.data_vars, {'fill_value': 0})``.
51+
By `Stephan Hoyer <https://github.com/shoyer>`_.
52+
4653
Deprecations
4754
~~~~~~~~~~~~
4855

xarray/backends/zarr.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1178,6 +1178,11 @@ def set_variables(
11781178
fill_value = attrs.pop("_FillValue", None)
11791179
else:
11801180
fill_value = v.encoding.pop("fill_value", None)
1181+
if fill_value is None and v.dtype.kind == "f":
1182+
# For floating point data, Xarray defaults to a fill_value
1183+
# of NaN (unlike Zarr, which uses zero):
1184+
# https://github.com/pydata/xarray/issues/10646
1185+
fill_value = np.nan
11811186
if "_FillValue" in attrs:
11821187
# replace with encoded fill value
11831188
fv = attrs.pop("_FillValue")

xarray/core/dataset.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2350,10 +2350,15 @@ def to_zarr(
23502350
used. Override any existing encodings by providing the ``encoding`` kwarg.
23512351
23522352
``fill_value`` handling:
2353-
There exists a subtlety in interpreting zarr's ``fill_value`` property. For zarr v2 format
2354-
arrays, ``fill_value`` is *always* interpreted as an invalid value similar to the ``_FillValue`` attribute
2355-
in CF/netCDF. For Zarr v3 format arrays, only an explicit ``_FillValue`` attribute will be used
2356-
to mask the data if requested using ``mask_and_scale=True``. See this `Github issue <https://github.com/pydata/xarray/issues/5475>`_
2353+
There exists a subtlety in interpreting zarr's ``fill_value`` property.
2354+
For Zarr v2 format arrays, ``fill_value`` is *always* interpreted as an
2355+
invalid value similar to the ``_FillValue`` attribute in CF/netCDF.
2356+
For Zarr v3 format arrays, only an explicit ``_FillValue`` attribute
2357+
will be used to mask the data if requested using ``mask_and_scale=True``.
2358+
To customize the fill value Zarr uses as a default for unwritten
2359+
chunks on disk, set ``_FillValue`` in encoding for Zarr v2 or
2360+
``fill_value`` for Zarr v3.
2361+
See this `Github issue <https://github.com/pydata/xarray/issues/5475>`_
23572362
for more.
23582363
23592364
See Also

xarray/tests/test_backends.py

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4380,6 +4380,23 @@ def roundtrip_dir(
43804380
) as ds:
43814381
yield ds
43824382

4383+
@requires_dask
4384+
def test_default_zarr_fill_value(self):
4385+
inputs = xr.Dataset({"floats": ("x", [1.0]), "ints": ("x", [1])}).chunk()
4386+
expected = xr.Dataset({"floats": ("x", [np.nan]), "ints": ("x", [0])})
4387+
with self.temp_dir() as (d, store):
4388+
inputs.to_zarr(store, compute=False)
4389+
with open_dataset(store) as on_disk:
4390+
assert np.isnan(on_disk.variables["floats"].encoding["_FillValue"])
4391+
assert (
4392+
"_FillValue" not in on_disk.variables["ints"].encoding
4393+
) # use default
4394+
if not has_zarr_v3:
4395+
# zarr-python v2 interprets fill_value=None inconsistently
4396+
del on_disk["ints"]
4397+
del expected["ints"]
4398+
assert_identical(expected, on_disk)
4399+
43834400
@pytest.mark.parametrize("consolidated", [True, False, None])
43844401
@pytest.mark.parametrize("write_empty", [True, False, None])
43854402
def test_write_empty(
@@ -4418,14 +4435,13 @@ def assert_expected_files(expected: list[str], store: str) -> None:
44184435
"0.1.1",
44194436
]
44204437

4438+
# use nan for default fill_value behaviour
4439+
data = np.array([np.nan, np.nan, 1.0, np.nan]).reshape((1, 2, 2))
4440+
44214441
if zarr_format_3:
4422-
data = np.array([0.0, 0, 1.0, 0]).reshape((1, 2, 2))
44234442
# transform to the path style of zarr 3
44244443
# e.g. 0/0/1
44254444
expected = [e.replace(".", "/") for e in expected]
4426-
else:
4427-
# use nan for default fill_value behaviour
4428-
data = np.array([np.nan, np.nan, 1.0, np.nan]).reshape((1, 2, 2))
44294445

44304446
ds = xr.Dataset(data_vars={"test": (("Z", "Y", "X"), data)})
44314447

0 commit comments

Comments
 (0)