Incrementally-populated Zarr Arrays

I believe there is a new usage pattern of Zarr that should be documented and more widely known – populating Zarr arrays incrementally and on-demand, chunk-by-chunk.

We have built internal tooling around this that resembles the chunk manifests (https://github.com/zarr-developers/zarr-specs/issues/154, https://github.com/zarr-developers/zarr-specs/issues/287, [virtualizarr#chunk-manifests](https://virtualizarr.readthedocs.io/en/latest/usage.html#chunk-manifests)) and VirtualiZarr (https://github.com/TomNicholas/VirtualiZarr/issues/33, [virtualizarr#manifestarray](https://virtualizarr.readthedocs.io/en/latest/usage.html#manifestarray-class)). 

This issue explains the usage pattern and opens the question of how our tooling could be open-sourced.


## Incrementally populating Zarr arrays
A core Zarr usecase for us is to initialize an empty Zarr array with a pre-determined grid and then fill it with data as needed. 

For a great overview of this usecase, see Earthmover's [recent webinar](https://vimeo.com/935616988/d9de7a97e9) and the relevant [code repository](https://github.com/earth-mover/serverless-datacube-demo). Note that in the webinar, the entire grid is populated. In our case, parts of the grid are populated on-demand.

<details> <summary> Code example </summary>

This code defines an empty Zarr array given an example schema. For a fuller example, see [`create_dataset_schema` in Earthmover's repository](https://github.com/earth-mover/serverless-datacube-demo/blob/26765146567452d66d04a0246e9a077b75b9d7ee/src/lib.py#L73).

![image](https://github.com/zarr-developers/zarr-specs/assets/18722560/1bfba4af-2a70-4dea-884f-e5f9a27aa0dc)




```python
from odc.geo.geobox import GeoBox
from odc.geo.xr import xr_zeros
import pandas as pd

def define_schema(varname, geobox, time, band):
    schema = (
        # Chunks -1 to have a Dask-backed array
        # But without creating a massive computational graph
        xr_zeros(geobox, chunks=-1)
        .expand_dims(
            time=time,
            band=band,
        )
        .to_dataset(name=varname)
    )

    schema['band'] = schema['band'].astype(str)
    encoding = {varname: {"chunks": (1, 1, 256, 256)}}

    schema.to_zarr(
        'schema.zarr',
        encoding=encoding,
        compute=False,  # Do not initialize any data: no chunks written
        zarr_version=3,
        mode='w'
    )

geobox = GeoBox.from_bbox(
   (-2_000_000, -5_000_000,
   2_250_000, -1_000_000),
   "epsg:3577", resolution=1000)


define_schema(
    "data", 
    geobox, 
    time=pd.date_range("2020-01-01", periods=12, freq='MS'), 
    band=["B1", "B2", "B3"]
)
```

</details>

This pattern is powerful, as it enables us only to compute and ingest what's needed in a query-or-compute fashion. Without going too far from the scope of this issue, it enables us to cascade the computation/ingestion of data in a chunk-by-chunk manner illustrated in the following diagram.

![image](https://github.com/zarr-developers/zarr-specs/assets/18722560/749685cf-60f7-46dd-a2f1-a09d308161e2)

> Note: The details of this example are made up for the purposes of this issue

## Chunk-level MetaArray
When working with sparsely populated arrays, the natural question arises: which chunks are populated?

To this end, we employ an object we call the **chunk-level meta-array** (MetaArray for short). This is a second array where each pixel corresponds to a chunk in the original array. The value of the pixel is
- `1` if the corresponding chunk is populated
- `0` if the corresponding chunk is empty


![image](https://github.com/zarr-developers/zarr-specs/assets/18722560/257f7e38-2880-46ef-81a2-f1d5affe0ba3)


> Note: Chunk initialization is not the only information one can keep in such MetaArray. For example, the MetaArray could keep track of min & max of each chunk, enabling predicate pushdown, or aggregate statistics for providing approximate answers on the chunk level.

<details> <summary> Toy code example </summary>

Note: This is only a very rough sketch of the resulting object to illustrate the point. 

```python
import xarray as xr

def create_metaarray(schema):
    """Create meta-array from a schema.
    
    A very rough sketch. Not the actual implementation.
    """
    _, _, cy, cx = schema.data.encoding['chunks']

    # Create spatial pixels with coordinates equal to the centre of the chunk
    xx = schema['x'].values
    xdiff = xx[cx] - xx[0]
    x_meta = xx[::cx] + xdiff/2

    yy = schema['y'].values
    ydiff = yy[cy] - yy[0]
    y_meta = yy[::cy] + ydiff/2

    # Initialize an empty meta-array
    meta = xr.DataArray(
        data=0,
        dims=schema.dims,
        coords=dict(time=schema.time, band=schema.band, x=x_meta, y=y_meta),
    )

    meta.to_zarr('schema-meta.zarr', zarr_version=3, mode='w')


schema = xr.open_zarr('schema.zarr', zarr_version=3)
create_metaarray(schema)

# When writing to the Zarr array, also populate the metaarray.
```

</details>

## Relationship to Chunk manifest & VirtualiZarr
Increasingly, the work in the Zarr world resembles the data structure we have built. 

- The chunk manifest keeps a data structure containing information about each chunk (https://github.com/zarr-developers/zarr-specs/issues/154, https://github.com/zarr-developers/zarr-specs/issues/287). 
- VirtualiZarr, the key piece of the [upstreaming of kerchunk](https://hackmd.io/t9Myqt0HR7O0nq6wiHWCDA?view), will likely store this information in Numpy arrays (https://github.com/TomNicholas/VirtualiZarr/issues/33) and has even considered xarray (https://github.com/TomNicholas/VirtualiZarr/issues/33#issuecomment-2000414759, https://github.com/TomNicholas/VirtualiZarr/pull/107), while also tackling the question of initialized chunks (https://github.com/TomNicholas/VirtualiZarr/issues/33#issuecomment-2000529283, https://github.com/TomNicholas/VirtualiZarr/issues/32, https://github.com/TomNicholas/VirtualiZarr/issues/16). 
- Finally, [ZEP 5](https://zarr.dev/zeps/draft/ZEP0005.html) (https://github.com/zarr-developers/zarr-specs/pull/205) plans to propose a way to store chunk-level statistics, one of the potential use-cases of MetaArrays.

The similarities in these proposals and implementations make me think there is a path where the MetaArrays could become an open-source part of the Zarr ecosystem, potentially even enabling further use-cases such as chunk-level aggregate statistics outlined in [ZEP 5](https://zarr.dev/zeps/draft/ZEP0005.html). 

The question is what the exact path forward would be.


# Potential path forward
Here is my best guess at what the path forward could be.

There are essentially two layers to MetaArrays
1. Information on the chunk level in the index space, stored on disk. This could be taken over by the chunk manifest.
2. Convenient in-memory representation & an API to query this information in the label space. This would remain the domain of MetaArrays.

![Image](https://github.com/user-attachments/assets/35682140-e407-48f7-8f83-c40016b753e6)

> Note: Here I'm using the "index" and "label" space terminology from [Xarray](https://tutorial.xarray.dev/intermediate/01-high-level-computation-patterns#concept-refresher-index-space-vs-label-space)

### 1. Information on the chunk level
In the future, 1. could become part of Zarr. The [chunk manifest](https://github.com/zarr-developers/zarr-specs/issues/287) would only need to represent the information on whether a chunk is initialized to contain the full information needed for construction of the MetaArray. 

Currently, MetaArrays are stored as a separate Zarr array and updated using zarr. In the future
- They could be read from whichever format chunk manifest is stored in (of which Zarr itself is one option)
- Their updating & consistency would then be completely taken care of by the storage transformer for the chunk manifest

### 2. API to query this information
To be easily queryable, the chunk-level information must be exposed back in the coordinates of the original dataset and provide an API for its querying.

Xarray is our choice for this API. For example, asking the question "do we have data for this geometry" simply becomes a [rioxarray](https://github.com/corteva/rioxarray) clip operation and xarray reduction. 

A similar conclusion was reached by VirtualiZarr, where the [use of xarray](https://virtualizarr.readthedocs.io/en/latest/usage.html#virtual-datasets-as-zarr-groups) e.g. exposes a familiar API for [combining of multiple datasets](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets). 

MetaArrays could become another package building on top of Zarr, similar to VirtualiZarr. However, if the chunk manifest already contains information for the construction of MetaArrays and VirtualiZarr uses this information to construct Xarray(-like) datasets, a greater integration could be possible. For example in the form of common functionality to read the chunk manifest into xarray format.

> Note: We have further APIs that utilize Pydantic and GeoPandas for further processing of chunk information. For simplicity, I am focusing only on the Xarray representation here


## Questions
I would appreciate suggestions from the Zarr community on where to best focus the efforts. 

Here is a set of concrete open questions and discussion points:
- Does the chunk manifest storage transformer cover the data needed by MetaArrays? Could the chunk manifest also represent information about chunk initialization? Or should this information be another storage transformer that shares implementation with the chunk manifest? 
- What is the potential relationship to VirtualiZarr? Could they share functionality, or could a single package cover both uses? Perhaps there's a common functionality for going from the manifest into an xarray object?
- Have I missed any potential difficulties here? 
- Is there a different path to the one that I outlined? 

A secondary set of questions is around the usefulness of this – would the community find the MetaArrays useful? Be it for storing chunk initialization information, or aggregate statistics. Are people perhaps already using variants of this functionality? Creating this functionality was necessary for us – and thus likely for others, too!

Thank you for any suggestions and answers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incrementally-populated Zarr Arrays #300

Incrementally populating Zarr arrays

Chunk-level MetaArray

Relationship to Chunk manifest & VirtualiZarr

Potential path forward

1. Information on the chunk level

2. API to query this information

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incrementally-populated Zarr Arrays #300

Description

Incrementally populating Zarr arrays

Chunk-level MetaArray

Relationship to Chunk manifest & VirtualiZarr

Potential path forward

1. Information on the chunk level

2. API to query this information

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions