Skip to content

Incrementally-populated Zarr Arrays #300

@DahnJ

Description

@DahnJ

I believe there is a new usage pattern of Zarr that should be documented and more widely known – populating Zarr arrays incrementally and on-demand, chunk-by-chunk.

We have built internal tooling around this that resembles the chunk manifests (#154, #287, virtualizarr#chunk-manifests) and VirtualiZarr (https://github.com/TomNicholas/VirtualiZarr/issues/33, virtualizarr#manifestarray).

This issue explains the usage pattern and opens the question of how our tooling could be open-sourced.

Incrementally populating Zarr arrays

A core Zarr usecase for us is to initialize an empty Zarr array with a pre-determined grid and then fill it with data as needed.

For a great overview of this usecase, see Earthmover's recent webinar and the relevant code repository. Note that in the webinar, the entire grid is populated. In our case, parts of the grid are populated on-demand.

Code example

This code defines an empty Zarr array given an example schema. For a fuller example, see create_dataset_schema in Earthmover's repository.

image

from odc.geo.geobox import GeoBox
from odc.geo.xr import xr_zeros
import pandas as pd

def define_schema(varname, geobox, time, band):
    schema = (
        # Chunks -1 to have a Dask-backed array
        # But without creating a massive computational graph
        xr_zeros(geobox, chunks=-1)
        .expand_dims(
            time=time,
            band=band,
        )
        .to_dataset(name=varname)
    )

    schema['band'] = schema['band'].astype(str)
    encoding = {varname: {"chunks": (1, 1, 256, 256)}}

    schema.to_zarr(
        'schema.zarr',
        encoding=encoding,
        compute=False,  # Do not initialize any data: no chunks written
        zarr_version=3,
        mode='w'
    )

geobox = GeoBox.from_bbox(
   (-2_000_000, -5_000_000,
   2_250_000, -1_000_000),
   "epsg:3577", resolution=1000)


define_schema(
    "data", 
    geobox, 
    time=pd.date_range("2020-01-01", periods=12, freq='MS'), 
    band=["B1", "B2", "B3"]
)

This pattern is powerful, as it enables us only to compute and ingest what's needed in a query-or-compute fashion. Without going too far from the scope of this issue, it enables us to cascade the computation/ingestion of data in a chunk-by-chunk manner illustrated in the following diagram.

image

Note: The details of this example are made up for the purposes of this issue

Chunk-level MetaArray

When working with sparsely populated arrays, the natural question arises: which chunks are populated?

To this end, we employ an object we call the chunk-level meta-array (MetaArray for short). This is a second array where each pixel corresponds to a chunk in the original array. The value of the pixel is

  • 1 if the corresponding chunk is populated
  • 0 if the corresponding chunk is empty

image

Note: Chunk initialization is not the only information one can keep in such MetaArray. For example, the MetaArray could keep track of min & max of each chunk, enabling predicate pushdown, or aggregate statistics for providing approximate answers on the chunk level.

Toy code example

Note: This is only a very rough sketch of the resulting object to illustrate the point.

import xarray as xr

def create_metaarray(schema):
    """Create meta-array from a schema.
    
    A very rough sketch. Not the actual implementation.
    """
    _, _, cy, cx = schema.data.encoding['chunks']

    # Create spatial pixels with coordinates equal to the centre of the chunk
    xx = schema['x'].values
    xdiff = xx[cx] - xx[0]
    x_meta = xx[::cx] + xdiff/2

    yy = schema['y'].values
    ydiff = yy[cy] - yy[0]
    y_meta = yy[::cy] + ydiff/2

    # Initialize an empty meta-array
    meta = xr.DataArray(
        data=0,
        dims=schema.dims,
        coords=dict(time=schema.time, band=schema.band, x=x_meta, y=y_meta),
    )

    meta.to_zarr('schema-meta.zarr', zarr_version=3, mode='w')


schema = xr.open_zarr('schema.zarr', zarr_version=3)
create_metaarray(schema)

# When writing to the Zarr array, also populate the metaarray.

Relationship to Chunk manifest & VirtualiZarr

Increasingly, the work in the Zarr world resembles the data structure we have built.

The similarities in these proposals and implementations make me think there is a path where the MetaArrays could become an open-source part of the Zarr ecosystem, potentially even enabling further use-cases such as chunk-level aggregate statistics outlined in ZEP 5.

The question is what the exact path forward would be.

Potential path forward

Here is my best guess at what the path forward could be.

There are essentially two layers to MetaArrays

  1. Information on the chunk level in the index space, stored on disk. This could be taken over by the chunk manifest.
  2. Convenient in-memory representation & an API to query this information in the label space. This would remain the domain of MetaArrays.

Image

Note: Here I'm using the "index" and "label" space terminology from Xarray

1. Information on the chunk level

In the future, 1. could become part of Zarr. The chunk manifest would only need to represent the information on whether a chunk is initialized to contain the full information needed for construction of the MetaArray.

Currently, MetaArrays are stored as a separate Zarr array and updated using zarr. In the future

  • They could be read from whichever format chunk manifest is stored in (of which Zarr itself is one option)
  • Their updating & consistency would then be completely taken care of by the storage transformer for the chunk manifest

2. API to query this information

To be easily queryable, the chunk-level information must be exposed back in the coordinates of the original dataset and provide an API for its querying.

Xarray is our choice for this API. For example, asking the question "do we have data for this geometry" simply becomes a rioxarray clip operation and xarray reduction.

A similar conclusion was reached by VirtualiZarr, where the use of xarray e.g. exposes a familiar API for combining of multiple datasets.

MetaArrays could become another package building on top of Zarr, similar to VirtualiZarr. However, if the chunk manifest already contains information for the construction of MetaArrays and VirtualiZarr uses this information to construct Xarray(-like) datasets, a greater integration could be possible. For example in the form of common functionality to read the chunk manifest into xarray format.

Note: We have further APIs that utilize Pydantic and GeoPandas for further processing of chunk information. For simplicity, I am focusing only on the Xarray representation here

Questions

I would appreciate suggestions from the Zarr community on where to best focus the efforts.

Here is a set of concrete open questions and discussion points:

  • Does the chunk manifest storage transformer cover the data needed by MetaArrays? Could the chunk manifest also represent information about chunk initialization? Or should this information be another storage transformer that shares implementation with the chunk manifest?
  • What is the potential relationship to VirtualiZarr? Could they share functionality, or could a single package cover both uses? Perhaps there's a common functionality for going from the manifest into an xarray object?
  • Have I missed any potential difficulties here?
  • Is there a different path to the one that I outlined?

A secondary set of questions is around the usefulness of this – would the community find the MetaArrays useful? Be it for storing chunk initialization information, or aggregate statistics. Are people perhaps already using variants of this functionality? Creating this functionality was necessary for us – and thus likely for others, too!

Thank you for any suggestions and answers!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions