Skip to content

Conversation

rouault
Copy link

@rouault rouault commented Oct 13, 2025

This extension is about missing data in Zarr. It documents an existing convention introduced in Xarray: using the _FillValue attribute to represent "missing data". This is distinct from the Array.fill_value metadata, which is used as the return value for uninitialized chunks.

This extension is about missing data in Zarr. It documents an existing convention introduced
in Xarray: using the `_FillValue` attribute to represent "missing data". This is distinct
from the `Array.fill_value` metadata, which is used as the return value for uninitialized chunks.

Co-authored-by: Ryan Abernathey <[email protected]>
Co-authored-by: Tom White <[email protected]>
Co-authored-by: Mark Kittisopikul <[email protected]>
@rouault
Copy link
Author

rouault commented Oct 13, 2025

CC @rabernat @tomwhite @mkitti

@jbms
Copy link
Contributor

jbms commented Oct 13, 2025

I understand that this is merely intended to document an existing practice, rather than propose a new practice.

However, I do still want to point out a few things:

  1. There is no default fill value in zarr v3 --- fill_value must always be specified explicitly. Implementations may provide a default when creating an array if one isn't specified, typically 0, but that isn't part of the spec.
  2. In zarr v2, a null fill_value did not mean "fail to read missing chunks" and "fail to perform partial writes". It just meant, in early versions of zarr-python, fill those chunks with uninitialized memory (which was a security vulnerability since it could leak arbitrary data from memory), and in later versions, just meant the same thing as a fill value of 0. In tensorstore and neuroglancer, a null fill_value in zarr v2 is treated the same as a fill_value of 0.
  3. Using a different encoding for this _FillValue attribute rather than just having the same encoding as the array metadata fill_value just creates additional complications.
  4. I find the discussion of the array metadata fill_value in the proposal, particularly in the summary, confusing.

As far as I understand, this _FillValue specifies a sentinel value that indicates explicitly "missing data" e.g. "sensor failed to read a measurement" or "survey response for that question was missing", etc.

The Array fill_value more generally indicates an element value to be used when reading chunks that aren't stored, and for partial writes to chunks that were not previously stored. In some cases it may be a true sentinel value, that will never be present in the data that is written, and in other cases it is merely a default value that may also occur in data that is written. It could also be used purely for storage efficiency --- if it is known that many chunks will be entirely equal to a particular value, that value could be set as the fill_value to allow those chunks to not be stored.

I imagine in most cases where this _FillValue attribute is useful, it would also be a reasonable value for the Array fill_value, but that is not necessarily the case.

In any case, I would suggest that the summary give a more straightforward description of _FillValue purely as a "missing/null value" sentinel (e.g. failed sensor reading, no response to survey question).

In regards to the fill_value metadata field, it could just say that "in many cases it may be useful to specify the same value for both the metadata fill_value and the _FillValue attribute, but that is not necessarily the case" and then could give an example where that is useful.

As I see it, the zarr v2 vs zarr v3 distinction isn't really relevant except to indicate that xarray doesn't implement this attribute for zarr v2.

The mention of the NetCDF _FillValue attribute is also confusing, because based on the description you have included, it seems that the NetCDF _FillValue works exactly the same way as the existing array fill_value, and does not allow there to be separate "not yet written" and "missing measurement" senintels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants