Skip to content

Comparison to VirtualiZarr #25

@TomNicholas

Description

@TomNicholas

Hi, I develop VirtualiZarr, a package that seems to have very similar goals to dataplug. I found out about this library via your SciPy talk announcement - I also have a SciPy talk about VirtualiZarr!

VirtualiZarr was developed to replace the Kerchunk package, to serve as a way to make non-cloud-optimized file formats accessible in a cloud-optimized manner through the Zarr API. It's based on the idea that most binary chunked array file formats can be mapped to the Zarr data model. Both Kerchunk and Zarr are mentioned in your talk abstract.

Our packages seem very similar. As far I can tell, they both:

  1. Target non-cloud-optimized data sat in object storage
  2. Pre-process that data to extract metadata and chunk / partition references
  3. Can perform that pre-processing at scale in parallel (VirtualiZarr can use dask or lithops or a general parallel executor, though this functionality hasn't been released yet)
  4. Persist those chunk / partition references to storage in new objects
  5. Allow cloud-optimized parallel access to the original data by having the data user hit the serialized chunk / partition references first.

Some possible differences:

  • VirtualiZarr is targeting array data, and maps all data to the Zarr data model (of a heirarchy of multidimensional chunked arrays). It seems dataplug is more general but produces less structured output (like Kerchunk could do in theory)?
  • VirtualiZarr can assemble references from a large number of files into one big cloud-optimized "virtual datacube". It's unclear to me if dataplug tries to do that.
  • Dataplug puts a lot of emphasis on re-partitioning. If I understand correctly this is analogous to this idea in VirtualiZarr of slicing into uncompressed chunks (which I haven't implemented yet).
  • VirtualiZarr can write to multiple formats, specifically Kerchunk and Icechunk.
  • VirtualiZarr works with Icechunk, which can be thought of as serverless transactional database of chunk references.
  • Each library currently understands how to parse some file formats that the other doesn't. See Listing every format that could be represented as virtual zarr zarr-developers/VirtualiZarr#218.

I'm curious if my assessment is correct, and if so whether there is any opportunity to join forces 😀

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions