Create virtual Zarr stores for cloud-friendly access to archival data, using familiar xarray syntax.
The best way to distribute large scientific datasets is via the Cloud, in Cloud-Optimized formats 1. But often this data is stuck in archival pre-Cloud file formats such as netCDF.
VirtualiZarr2 makes it easy to create "Virtual" Zarr stores, allowing performant access to archival data as if it were in the Cloud-Optimized Zarr format, without duplicating any data.
"Virtualized data" solves an incredibly important problem: accessing big archival datasets via a cloud-optimized pattern, but without copying or modifying the original data in any way. This is a win-win-win for users, data engineers, and data providers. Users see fast-opening zarr-compliant stores that work performantly with libraries like xarray and dask, data engineers can provide this speed by adding a lightweight virtualization layer on top of existing data (without having to ask anyone's permission), and data providers don't have to change anything about their archival files for them to be used in a cloud-optimized way.
VirtualiZarr aims to make the creation of cloud-optimized virtualized zarr data from existing scientific data as easy as possible.
- Create virtual references pointing to bytes inside a archival file with
open_virtual_dataset, - Supports a range of archival file formats, including netCDF4 and HDF5,
- Combine data from multiple files into one larger store using xarray's combining functions, such as
xarray.concat, - Commit the virtual references to storage either using the Kerchunk references specification or the Icechunk transactional storage engine.
- Users access the virtual dataset using
xarray.open_dataset.
VirtualiZarr grew out of discussions on the Kerchunk repository, and is an attempt to provide the game-changing power of kerchunk but in a zarr-native way, and with a familiar array-like API.
You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk.
Creating the virtual dataset looks quite similar to how we normally open data with [xarray][], but there are a few notable differences that are shown through this example.
First, import the necessary functions and classes:
import icechunk
import obstore
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistryZarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet -- let's suppress those.
import warnings
warnings.filterwarnings(
"ignore",
message="Numcodecs codecs are not in the Zarr version 3 specification*",
category=UserWarning
)# This code isn't shown since we didn't set source="above"
import xarray as xr
xr.set_options(display_style="html")We can use Obstore's [obstore.store.from_url][obstore.store.from_url] convenience method to create an [ObjectStore][obstore.store.ObjectStore] that can fetch data from the specified URLs.
bucket = "s3://nex-gddp-cmip6"
path = "NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc"
store = obstore.store.from_url(bucket, region="us-west-2", skip_signature=True)We also need to create an [ObjectStoreRegistry][virtualizarr.registry.ObjectStoreRegistry] that maps the URL structure to the ObjectStore.
registry = ObjectStoreRegistry({bucket: store})Now, let's create a parser instance and create a virtual dataset by passing the URL, parser, and registry to [virtualizarr.open_virtual_dataset][].
parser = HDFParser()
vds = open_virtual_dataset(
url=f"{bucket}/{path}",
parser=parser,
registry=registry,
loadable_variables=[],
)
print(vds)Since we specified loadable_variables=[], no data has been loaded or copied in this process. We have merely created an in-memory lookup table that points to the location of chunks in the original netCDF when data is needed later on. The default behavior (loadable_variables=None) will load data associated with coordinates but not data variables. The size represents the size of the original dataset - you can see the size of the virtual dataset using the vz accessor:
print(f"Original dataset size: {vds.nbytes} bytes")
print(f"Virtual dataset size: {vds.vz.nbytes} bytes")VirtualiZarr's other top-level function is [virtualizarr.open_virtual_mfdataset][], which can open and virtualize multiple data sources into a single virtual dataset, similar to how [xarray.open_mfdataset][] opens multiple data files as a single dataset.
urls = [f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_{year}_v2.0.nc" for year in range(2015, 2017)]
vds = open_virtual_mfdataset(urls, parser = parser, registry = registry)
print(vds)The magic of VirtualiZarr is that you can persist the virtual dataset to disk in a chunk references format such as Icechunk, meaning that the work of constructing the single coherent dataset only needs to happen once. For subsequent data access, you can use [xarray.open_zarr][] to open that Icechunk store, which on object storage is far faster than using [xarray.open_mfdataset][] to open the the original non-cloud-optimized files.
Let's persist the Virtual dataset using Icechunk. Here we store the dataset in a memory store but in most cases you'll store the virtual dataset in the cloud.
icechunk_store = icechunk.in_memory_storage()
repo = icechunk.Repository.create(icechunk_store)
session = repo.writable_session("main")
vds.vz.to_icechunk(session.store)
session.commit("Create virtual store")See the Usage docs page for more details.
- 2025/04/30 - Cloud-Native Geospatial Forum - Tom Nicholas - Slides / Recording
- 2024/11/21 - MET Office Architecture Guild - Tom Nicholas - Slides
- 2024/11/13 - Cloud-Native Geospatial conference - Raphael Hagen - Slides
- 2024/07/24 - ESIP Meeting - Sean Harkins - Event / Recording
- 2024/05/15 - Pangeo showcase - Tom Nicholas - Event / Recording / Slides
This package was originally developed by Tom Nicholas whilst working at [C]Worthy, who deserve credit for allowing him to prioritise a generalizable open-source solution to the dataset virtualization problem. VirtualiZarr is now a community-owned multi-stakeholder project.
Apache 2.0
Footnotes
-
Cloud-Native Repositories for Big Scientific Data, Abernathey et. al., Computing in Science & Engineering. ↩
-
(Pronounced like "virtualizer" but more piratey 🦜) ↩