Skip to content

Add in-memory arrays to ManifestStore #799

@TomNicholas

Description

@TomNicholas

tl;dr: I think we should add in-memory arrays to ManifestStore, probably by generalizing the wrapped type to ManifestArray | np.ndarray.

Context

This library has evolved from managing only "virtual" xarray datasets into a design where ManifestStore really should be the primary entity, with xarray as an optional dependency. Icechunk and Kerchunk are already optional dependencies, and ManifestStore should be able to stand alone as independently useful.

However, ManifestStore is currently lacking a way to represent "inlined" chunks. Comparing different representations of virtual/non-virtual chunks, this absence becomes conspicuous:

- ManifestStore "virtual" xarray.Dataset Icechunk format Kerchunk format
external references ManifestArray xr.Variable wrapping ManifestArray (a so-called "virtual variable") virtual chunk ref kerchunk reference stored in JSON
"inlined" references xr.Variable wrapping (lazy) np.ndarray (a so-called "loadable variable") native chunk ref (also potentially inlined into manifests, but this is an optimization that's an internal implementation detail) encoded bytes stored in JSON

This causes multiple problems:

Proposal: Add numpy.ndarray as a wrapped type in ManifestStore

Currently ManifestStore holds ManifestGroups, which have ._members: Mapping[str, "ManifestArray | ManifestGroup"]. If we generalize this to ._members: Mapping[str, "np.ndarray | ManifestArray | ManifestGroup"] we can store inlined chunks in memory in the ManifestStore as numpy arrays.

We would need to ensure that xr.open_dataset(ManifestStore) could still read these arrays, which might require something clever during ManifestStore.get.

vz.open_virtual_dataset (really just a backwards-compatibility wrapper around ManifestStore.to_virtual_dataset()) also needs to know that any array stored as a numpy array is to become a loadable variable.

This also means parsers effectively would now have the power to decide to create "loadable variables". This actually makes some sense - open_virtual_dataset has a problem in that its' loadable_variables kwarg mixes concerns: sometimes you want the variable to be loaded so that it can be used for xarray indexing, sometimes you want it to be loaded as a storage optimization when you finally serialize the references out to kerchunk/icechunk. It's weird that we have one knob to control two things.

Note that storing as a numpy array directly means no zarr metadata for that variable. But I think that's fine - we don't store zarr metadata for loadable variables inside "virtual" xarray Datasets anyway.

Alternative considered (but I vote against this): Use obstore.MemoryStore

In #636 and #794 it was suggested that we store inlined chunks in the ManifestStore by adding obstore.MemoryStore to the ObjectStoreRegistry and referencing the bytes via virtual chunk references of the form memory://{location}. This is neat in that you can read the bytes back directly using obstore and zarr-python.

The reason I don't think this is a good idea is that it breaks the abstraction of the store. Our current design has the very neat property that ObjectStoreRegistry can be any obstore.Store, and we treat all of them identically, no matter their contents. But we would instead be special-casing obstore.MemoryStore compared to other obstore Store types. We also want to be able to treat all icechunk stores identically (note Icechunk has in_memory_storage and supports virtual references beginning with memory:// too). Special-casing memory:// or some namespace within memory:// breaks this abstraction.

Some other problems special-casing MemoryStore would cause:

  • When calling ManifestStore.to_icechunk on an icechunkstore that supports virtual references pointing at in-memory locations, how would we confidently distinguish a memory:// reference that represents an inlined chunk from a memory:// reference that represents a virtual chunk?
  • How would we be sure to avoid name collisions inside the memory:// namespace? Support in-lined Kerchunk references #794 does this by having the kerchunk parser write inlined data into "memory://kerchunk/", but this seems like a leaky abstraction. That idea also means that the ManifestStore contents depend on the parser they were created by, which is also leaky.

cc @maxrjones @sharkinsspatial @ilan-gold @paraseba

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions