-
Notifications
You must be signed in to change notification settings - Fork 49
Description
tl;dr: I think we should add in-memory arrays to ManifestStore
, probably by generalizing the wrapped type to ManifestArray | np.ndarray
.
Context
This library has evolved from managing only "virtual" xarray datasets into a design where ManifestStore
really should be the primary entity, with xarray as an optional dependency. Icechunk and Kerchunk are already optional dependencies, and ManifestStore
should be able to stand alone as independently useful.
However, ManifestStore
is currently lacking a way to represent "inlined" chunks. Comparing different representations of virtual/non-virtual chunks, this absence becomes conspicuous:
- | ManifestStore |
"virtual" xarray.Dataset |
Icechunk format | Kerchunk format |
---|---|---|---|---|
external references | ManifestArray |
xr.Variable wrapping ManifestArray (a so-called "virtual variable") |
virtual chunk ref | kerchunk reference stored in JSON |
"inlined" references | ❌ | xr.Variable wrapping (lazy) np.ndarray (a so-called "loadable variable") |
native chunk ref (also potentially inlined into manifests, but this is an optimization that's an internal implementation detail) | encoded bytes stored in JSON |
This causes multiple problems:
- We can't roundtrip
kerchunk -> ManifestStore -> virtual xarray.Dataset -> kerchunk
if the original kerchunk file contains inlined references. (Error reading inlined reference data when trying to roundtrip virtual dataset #489) - Similarly it limits the utility of proposed
.to_icechunk
/.to_kerchunk
methods onManifestStore
, as they can't handle all the cases thatds.vz.to_icechunk/kerchunk
can. (Add.to_icechunk()
method toManifestGroup
#591). Withhout this xarray can't really be an optional dependency. - It tempts us to hack around this using a memorystore (see Support inlined Kerchunk data using obstore MemoryStore? #636), but I think this is the wrong design...
Proposal: Add numpy.ndarray
as a wrapped type in ManifestStore
Currently ManifestStore
holds ManifestGroups
, which have ._members: Mapping[str, "ManifestArray | ManifestGroup"]
. If we generalize this to ._members: Mapping[str, "np.ndarray | ManifestArray | ManifestGroup"]
we can store inlined chunks in memory in the ManifestStore
as numpy arrays.
We would need to ensure that xr.open_dataset(ManifestStore)
could still read these arrays, which might require something clever during ManifestStore.get
.
vz.open_virtual_dataset
(really just a backwards-compatibility wrapper around ManifestStore.to_virtual_dataset()
) also needs to know that any array stored as a numpy array is to become a loadable variable.
This also means parsers effectively would now have the power to decide to create "loadable variables". This actually makes some sense - open_virtual_dataset
has a problem in that its' loadable_variables
kwarg mixes concerns: sometimes you want the variable to be loaded so that it can be used for xarray indexing, sometimes you want it to be loaded as a storage optimization when you finally serialize the references out to kerchunk/icechunk. It's weird that we have one knob to control two things.
Note that storing as a numpy array directly means no zarr metadata for that variable. But I think that's fine - we don't store zarr metadata for loadable variables inside "virtual" xarray Datasets anyway.
Alternative considered (but I vote against this): Use obstore.MemoryStore
In #636 and #794 it was suggested that we store inlined chunks in the ManifestStore
by adding obstore.MemoryStore
to the ObjectStoreRegistry
and referencing the bytes via virtual chunk references of the form memory://{location}
. This is neat in that you can read the bytes back directly using obstore and zarr-python.
The reason I don't think this is a good idea is that it breaks the abstraction of the store. Our current design has the very neat property that ObjectStoreRegistry
can be any obstore.Store
, and we treat all of them identically, no matter their contents. But we would instead be special-casing obstore.MemoryStore
compared to other obstore
Store types. We also want to be able to treat all icechunk stores identically (note Icechunk has in_memory_storage
and supports virtual references beginning with memory://
too). Special-casing memory://
or some namespace within memory://
breaks this abstraction.
Some other problems special-casing MemoryStore
would cause:
- When calling
ManifestStore.to_icechunk
on an icechunkstore that supports virtual references pointing at in-memory locations, how would we confidently distinguish amemory://
reference that represents an inlined chunk from amemory://
reference that represents a virtual chunk? - How would we be sure to avoid name collisions inside the
memory://
namespace? Support in-lined Kerchunk references #794 does this by having the kerchunk parser write inlined data into"memory://kerchunk/"
, but this seems like a leaky abstraction. That idea also means that theManifestStore
contents depend on the parser they were created by, which is also leaky.