Comparison to VirtualiZarr

Hi, I develop [VirtualiZarr](https://github.com/zarr-developers/VirtualiZarr/tree/develop), a package that seems to have very similar goals to dataplug. I found out about this library via your [SciPy talk announcement](https://cfp.scipy.org/scipy2025/talk/K98LXU/) - I also have a [SciPy talk about VirtualiZarr!](https://cfp.scipy.org/scipy2025/talk/JBNR9A/)

VirtualiZarr was developed to replace the Kerchunk package, to serve as a way to make non-cloud-optimized file formats accessible in a cloud-optimized manner through the Zarr API. It's based on the idea that most binary chunked array file formats can be mapped to the Zarr data model. Both Kerchunk and Zarr are mentioned in your talk abstract.

Our packages seem very similar. As far I can tell, they both:
1. Target non-cloud-optimized data sat in object storage
2. Pre-process that data to extract metadata and chunk / partition references
3. Can perform that pre-processing at scale in parallel (VirtualiZarr can use dask or lithops or a general parallel executor, though this functionality hasn't been released yet)
4. Persist those chunk / partition references to storage in new objects
5. Allow cloud-optimized parallel access to the original data by having the data user hit the serialized chunk / partition references first.

Some possible differences:
- VirtualiZarr is targeting array data, and maps all data to the Zarr data model (of a heirarchy of multidimensional chunked arrays). It seems dataplug is more general but produces less structured output (like Kerchunk could do in theory)?
- VirtualiZarr can assemble references from a large number of files into one big cloud-optimized "virtual datacube". It's unclear to me if dataplug tries to do that.
- Dataplug puts a lot of emphasis on re-partitioning. If I understand correctly this is analogous to [this idea in VirtualiZarr of slicing into uncompressed chunks](https://github.com/zarr-developers/VirtualiZarr/issues/86) (which I haven't implemented yet).
- VirtualiZarr can write to multiple formats, specifically Kerchunk and Icechunk.
- VirtualiZarr works with [Icechunk](https://icechunk.io/en/latest/), which can be thought of as serverless transactional database of chunk references.
- Each library currently understands how to parse some file formats that the other doesn't. See https://github.com/zarr-developers/VirtualiZarr/issues/218.

I'm curious if my assessment is correct, and if so whether there is any opportunity to join forces 😀 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison to VirtualiZarr #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Comparison to VirtualiZarr #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions