-
Notifications
You must be signed in to change notification settings - Fork 32
Description
I believe there is a new usage pattern of Zarr that should be documented and more widely known – populating Zarr arrays incrementally and on-demand, chunk-by-chunk.
We have built internal tooling around this that resembles the chunk manifests (#154, #287, virtualizarr#chunk-manifests) and VirtualiZarr (https://github.com/TomNicholas/VirtualiZarr/issues/33, virtualizarr#manifestarray).
This issue explains the usage pattern and opens the question of how our tooling could be open-sourced.
Incrementally populating Zarr arrays
A core Zarr usecase for us is to initialize an empty Zarr array with a pre-determined grid and then fill it with data as needed.
For a great overview of this usecase, see Earthmover's recent webinar and the relevant code repository. Note that in the webinar, the entire grid is populated. In our case, parts of the grid are populated on-demand.
Code example
This code defines an empty Zarr array given an example schema. For a fuller example, see create_dataset_schema in Earthmover's repository.
from odc.geo.geobox import GeoBox
from odc.geo.xr import xr_zeros
import pandas as pd
def define_schema(varname, geobox, time, band):
schema = (
# Chunks -1 to have a Dask-backed array
# But without creating a massive computational graph
xr_zeros(geobox, chunks=-1)
.expand_dims(
time=time,
band=band,
)
.to_dataset(name=varname)
)
schema['band'] = schema['band'].astype(str)
encoding = {varname: {"chunks": (1, 1, 256, 256)}}
schema.to_zarr(
'schema.zarr',
encoding=encoding,
compute=False, # Do not initialize any data: no chunks written
zarr_version=3,
mode='w'
)
geobox = GeoBox.from_bbox(
(-2_000_000, -5_000_000,
2_250_000, -1_000_000),
"epsg:3577", resolution=1000)
define_schema(
"data",
geobox,
time=pd.date_range("2020-01-01", periods=12, freq='MS'),
band=["B1", "B2", "B3"]
)This pattern is powerful, as it enables us only to compute and ingest what's needed in a query-or-compute fashion. Without going too far from the scope of this issue, it enables us to cascade the computation/ingestion of data in a chunk-by-chunk manner illustrated in the following diagram.
Note: The details of this example are made up for the purposes of this issue
Chunk-level MetaArray
When working with sparsely populated arrays, the natural question arises: which chunks are populated?
To this end, we employ an object we call the chunk-level meta-array (MetaArray for short). This is a second array where each pixel corresponds to a chunk in the original array. The value of the pixel is
1if the corresponding chunk is populated0if the corresponding chunk is empty
Note: Chunk initialization is not the only information one can keep in such MetaArray. For example, the MetaArray could keep track of min & max of each chunk, enabling predicate pushdown, or aggregate statistics for providing approximate answers on the chunk level.
Toy code example
Note: This is only a very rough sketch of the resulting object to illustrate the point.
import xarray as xr
def create_metaarray(schema):
"""Create meta-array from a schema.
A very rough sketch. Not the actual implementation.
"""
_, _, cy, cx = schema.data.encoding['chunks']
# Create spatial pixels with coordinates equal to the centre of the chunk
xx = schema['x'].values
xdiff = xx[cx] - xx[0]
x_meta = xx[::cx] + xdiff/2
yy = schema['y'].values
ydiff = yy[cy] - yy[0]
y_meta = yy[::cy] + ydiff/2
# Initialize an empty meta-array
meta = xr.DataArray(
data=0,
dims=schema.dims,
coords=dict(time=schema.time, band=schema.band, x=x_meta, y=y_meta),
)
meta.to_zarr('schema-meta.zarr', zarr_version=3, mode='w')
schema = xr.open_zarr('schema.zarr', zarr_version=3)
create_metaarray(schema)
# When writing to the Zarr array, also populate the metaarray.Relationship to Chunk manifest & VirtualiZarr
Increasingly, the work in the Zarr world resembles the data structure we have built.
- The chunk manifest keeps a data structure containing information about each chunk (Beyond consolidated metadata for V3: inspiration from Apache Iceberg #154, Manifest storage transformer #287).
- VirtualiZarr, the key piece of the upstreaming of kerchunk, will likely store this information in Numpy arrays (https://github.com/TomNicholas/VirtualiZarr/issues/33) and has even considered xarray (https://github.com/TomNicholas/VirtualiZarr/issues/33#issuecomment-2000414759, https://github.com/TomNicholas/VirtualiZarr/pull/107), while also tackling the question of initialized chunks (https://github.com/TomNicholas/VirtualiZarr/issues/33#issuecomment-2000529283, https://github.com/TomNicholas/VirtualiZarr/issues/32, https://github.com/TomNicholas/VirtualiZarr/issues/16).
- Finally, ZEP 5 (Extension proposal for zarr accumulation #205) plans to propose a way to store chunk-level statistics, one of the potential use-cases of MetaArrays.
The similarities in these proposals and implementations make me think there is a path where the MetaArrays could become an open-source part of the Zarr ecosystem, potentially even enabling further use-cases such as chunk-level aggregate statistics outlined in ZEP 5.
The question is what the exact path forward would be.
Potential path forward
Here is my best guess at what the path forward could be.
There are essentially two layers to MetaArrays
- Information on the chunk level in the index space, stored on disk. This could be taken over by the chunk manifest.
- Convenient in-memory representation & an API to query this information in the label space. This would remain the domain of MetaArrays.
Note: Here I'm using the "index" and "label" space terminology from Xarray
1. Information on the chunk level
In the future, 1. could become part of Zarr. The chunk manifest would only need to represent the information on whether a chunk is initialized to contain the full information needed for construction of the MetaArray.
Currently, MetaArrays are stored as a separate Zarr array and updated using zarr. In the future
- They could be read from whichever format chunk manifest is stored in (of which Zarr itself is one option)
- Their updating & consistency would then be completely taken care of by the storage transformer for the chunk manifest
2. API to query this information
To be easily queryable, the chunk-level information must be exposed back in the coordinates of the original dataset and provide an API for its querying.
Xarray is our choice for this API. For example, asking the question "do we have data for this geometry" simply becomes a rioxarray clip operation and xarray reduction.
A similar conclusion was reached by VirtualiZarr, where the use of xarray e.g. exposes a familiar API for combining of multiple datasets.
MetaArrays could become another package building on top of Zarr, similar to VirtualiZarr. However, if the chunk manifest already contains information for the construction of MetaArrays and VirtualiZarr uses this information to construct Xarray(-like) datasets, a greater integration could be possible. For example in the form of common functionality to read the chunk manifest into xarray format.
Note: We have further APIs that utilize Pydantic and GeoPandas for further processing of chunk information. For simplicity, I am focusing only on the Xarray representation here
Questions
I would appreciate suggestions from the Zarr community on where to best focus the efforts.
Here is a set of concrete open questions and discussion points:
- Does the chunk manifest storage transformer cover the data needed by MetaArrays? Could the chunk manifest also represent information about chunk initialization? Or should this information be another storage transformer that shares implementation with the chunk manifest?
- What is the potential relationship to VirtualiZarr? Could they share functionality, or could a single package cover both uses? Perhaps there's a common functionality for going from the manifest into an xarray object?
- Have I missed any potential difficulties here?
- Is there a different path to the one that I outlined?
A secondary set of questions is around the usefulness of this – would the community find the MetaArrays useful? Be it for storing chunk initialization information, or aggregate statistics. Are people perhaps already using variants of this functionality? Creating this functionality was necessary for us – and thus likely for others, too!
Thank you for any suggestions and answers!



