Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
99a99aa
draft refactor
TomNicholas Mar 6, 2025
feadc32
sketch of simplified handling of loadable_variables
TomNicholas Mar 6, 2025
b6e0242
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 6, 2025
f284c5a
get at least some tests working
TomNicholas Mar 7, 2025
7d50f8e
separate VirtualBackend api definition from common utilities
TomNicholas Mar 7, 2025
618da43
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 7, 2025
7f0ee4d
remove indexes={} everywhere in tests
TomNicholas Mar 7, 2025
6e020d3
Merge branch 'refactor_loadable_variables' of https://github.com/TomN…
TomNicholas Mar 7, 2025
f85c9d9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 7, 2025
03e50fc
stop passing through loadable_variables to where it isn't used
TomNicholas Mar 10, 2025
cc34f77
implement logic to load 1D dimension coords by default
TomNicholas Mar 10, 2025
eefbcc8
remove more instances of indexes={}
TomNicholas Mar 10, 2025
e05a640
remove more indexes={}
TomNicholas Mar 12, 2025
1106a40
refactor logic for choosing loadable_variables
TomNicholas Mar 20, 2025
dd0c947
fix more tets
TomNicholas Mar 20, 2025
b3f445a
xfail Aimee's test that I don't understand
TomNicholas Mar 20, 2025
be3429f
xfail test that explicitly specifies no indexes
TomNicholas Mar 20, 2025
ce4ea4e
made a bunch more stuff pass
TomNicholas Mar 20, 2025
bb23c7b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 20, 2025
1ea236f
fix netcdf3 reader
TomNicholas Mar 20, 2025
91b4dd2
Merge branch 'refactor_loadable_variables' of https://github.com/TomN…
TomNicholas Mar 20, 2025
30d020e
fix bad import in FITS reader
TomNicholas Mar 20, 2025
8a76cfe
fix import in tiff reader
TomNicholas Mar 20, 2025
8b0e7ae
fix import in icechunk test
TomNicholas Mar 20, 2025
ae5f480
Merge branch 'main' into refactor_loadable_variables
TomNicholas Mar 20, 2025
9b01010
release note
TomNicholas Mar 20, 2025
ef0ab78
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 20, 2025
80faea6
update docstring
TomNicholas Mar 20, 2025
c85538a
Merge branch 'refactor_loadable_variables' of https://github.com/TomN…
TomNicholas Mar 20, 2025
8a5fc65
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 20, 2025
0162bac
fix fits reader
TomNicholas Mar 20, 2025
2cea749
xfail on empty dict for indexes
TomNicholas Mar 20, 2025
216edbd
linting
TomNicholas Mar 20, 2025
fbcd127
Merge branch 'refactor_loadable_variables' of https://github.com/TomN…
TomNicholas Mar 20, 2025
e14f173
actually test new expected behaviour
TomNicholas Mar 20, 2025
97e1f74
fix logic for setting loadable_variables
TomNicholas Mar 21, 2025
77a7227
update docs page to reflect new behaviour
TomNicholas Mar 21, 2025
c2ecd88
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 21, 2025
d31a9fd
fix expected behaviour in another tests
TomNicholas Mar 21, 2025
6fa03f8
additional assert
TomNicholas Mar 21, 2025
eb773b7
Merge branch 'develop' into refactor_loadable_variables
maxrjones Mar 21, 2025
e2c3a79
Merge branch 'develop' into refactor_loadable_variables
maxrjones Mar 21, 2025
2859052
Merge branch 'refactor_loadable_variables' of https://github.com/TomN…
TomNicholas Mar 21, 2025
6bfcf1c
Merge branch 'develop' into refactor_loadable_variables
TomNicholas Mar 21, 2025
5820895
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 21, 2025
6c049f0
use encode_dataset_coordinates in kerchunk writer
TomNicholas Mar 21, 2025
ef4fa81
Merge branch 'develop' into refactor_loadable_variables
TomNicholas Mar 21, 2025
7729a33
Encode zarr vars
maxrjones Mar 21, 2025
51fa125
Merge pull request #1 from maxrjones/fixup
TomNicholas Mar 21, 2025
a00c097
fix some mypy errors
TomNicholas Mar 21, 2025
0d6fb40
move drop_variables implmentation to the end of every reader
TomNicholas Mar 21, 2025
d15ed90
override loadable_variables and raise warning
TomNicholas Mar 21, 2025
4ccbb5b
fix failing test by not creating loadable variables that would get in…
TomNicholas Mar 21, 2025
7992c08
improve error message
TomNicholas Mar 21, 2025
f20af13
remove some more occurrences of indexes={}
TomNicholas Mar 21, 2025
33d45c2
skip slow test
TomNicholas Mar 21, 2025
917a973
slay mypy errors
TomNicholas Mar 21, 2025
94f3d4f
docs typos
TomNicholas Mar 21, 2025
72beb94
should fix dmrpp test
TomNicholas Mar 21, 2025
83cb2a5
Merge branch 'refactor_loadable_variables' of https://github.com/TomN…
TomNicholas Mar 21, 2025
8a436f2
Delete commented-out code
TomNicholas Mar 24, 2025
9470b97
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2025
6bf2615
remove unecessary test skip
TomNicholas Mar 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def netcdf4_virtual_dataset(netcdf4_file):
"""Create a virtual dataset from a NetCDF4 file."""
from virtualizarr import open_virtual_dataset

with open_virtual_dataset(netcdf4_file, indexes={}) as ds:
with open_virtual_dataset(netcdf4_file, loadable_variables=[]) as ds:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required otherwise we get inlined variables in the kerchunk file which we don't know how to read (#489)

yield ds


Expand Down
10 changes: 10 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,16 @@ New Features
Breaking changes
~~~~~~~~~~~~~~~~

- Which variables are loadable by default has changed. The behaviour is now to make loadable by default the
same variables which `xarray.open_dataset` would create indexes for: i.e. one-dimensional coordinate variables whose
name matches the name of their only dimension (also known as "dimension coordinates").
Pandas indexes will also now be created by default for these loadable variables.
This is intended to provide a more friendly default, as often you will want these small variables to be loaded
(or "inlined", for efficiency of storage in icechunk/kerchunk), and you will also want to have in-memory indexes for these variables
(to allow `xarray.combine_by_coords` to sort using them).
The old behaviour is equivalent to passing ``loadable_variables=[]`` and ``indexes={}``.
(:issue:`335`, :pull:`477`) by `Tom Nicholas <https://github.com/TomNicholas>`_.

Deprecations
~~~~~~~~~~~~

Expand Down
79 changes: 41 additions & 38 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,27 +28,27 @@ vds = open_virtual_dataset('air.nc')

(Notice we did not have to explicitly indicate the file format, as {py:func}`open_virtual_dataset <virtualizarr.open_virtual_dataset>` will attempt to automatically infer it.)

Printing this "virtual dataset" shows that although it is an instance of `xarray.Dataset`, unlike a typical xarray dataset, it does not contain numpy or dask arrays, but instead it wraps {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects.
Printing this "virtual dataset" shows that although it is an instance of `xarray.Dataset`, unlike a typical xarray dataset, in addition to a few in-memory numpy arrays, it also wraps {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects.

```python
vds
```

```
<xarray.Dataset> Size: 8MB
Dimensions: (time: 2920, lat: 25, lon: 53)
<xarray.Dataset> Size: 31MB
Dimensions: (lat: 25, lon: 53, time: 2920)
Coordinates:
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
time (time) float32 12kB ManifestArray<shape=(2920,), dtype=float32, ...
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) int16 8MB ManifestArray<shape=(2920, 25, 53), d...
air (time, lat, lon) float64 31MB ManifestArray<shape=(2920, 25, 53)...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)
```

Generally a "virtual dataset" is any `xarray.Dataset` which wraps one or more {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects.
Expand All @@ -70,7 +70,7 @@ vds.virtualize.nbytes
```

```
128
23704
```

```{important} Virtual datasets are not normal xarray datasets!
Expand Down Expand Up @@ -230,7 +230,9 @@ But before we combine our data, we might want to consider loading some variables

## Loading variables

Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:
Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy arrays, just like `xr.open_dataset` normally returns.

Which variables to open this way can be specified using the `loadable_variables` argument:

```python
vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'])
Expand All @@ -240,17 +242,17 @@ vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'])
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)
```

You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects (`lat` and `lon`), and loadable variables backed by (lazy) numpy arrays (`air` and `time`).
Expand All @@ -261,18 +263,21 @@ Loading variables can be useful in a few scenarios:
2. You want in-memory indexes to use with `xr.combine_by_coords`,
3. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
4. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding,
5. Some of your variables have inconsistent-length chunks, and you want to be able to concatenate them together. For example you might have multiple virtual datasets with coordinates of inconsistent length (e.g., leap years within multi-year daily data).
5. Some of your variables have inconsistent-length chunks, and you want to be able to concatenate them together. For example you might have multiple virtual datasets with coordinates of inconsistent length (e.g., leap years within multi-year daily data). Loading them allows you to rechunk them however you like.

The default value of `loadable_variables` is `None`, which effectively specifies all the "dimension coordinates" in the file, i.e. all one-dimensional coordinate variables whose name is the same as the name of their dimensions. Xarray indexes will also be automatically created for these variables. Together these defaults mean that your virtual dataset will be opened with the same indexes as it would have been if it had been opened with just `xarray.open_dataset()`.

### Loading low-dimensional coordinates
```{note}
In general, it is recommended to load all of your low-dimensional variables.

In general, it is recommended to load all of your low-dimensional coordinates.
This will slow down your initial opening of the individual virtual datasets, but by loading your coordinates into memory, they can be inlined in the reference file for fast reads of the virtualized store.
However, doing this for coordinates that are N-dimensional might use a lot of storage duplicating them.
Also, anything duplicated could become out of sync with the referenced original files, especially if not using a transactional storage engine like `Icechunk`.
Whilst this does mean the original data will be duplicated in your new virtual zarr store, by loading your coordinates into memory they can be inlined in the reference file for fast reads from the virtual store.

However, you should not do this for higher-dimensional variables, as then you might use a lot of storage duplicating them, defeating the point of the virtual zarr approach. Also, anything duplicated could become out of sync with the referenced origial files, especially if not using a transactional storage engine like `Icechunk`.
```

### CF-encoded time variables

To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).
To decode time variables according to the CF conventions, you must ensure `time` is one of the `loadable_variables` and the `decode_times` argument of `open_virtual_dataset` is set to `True` (`decode_times` defaults to None).

```python
vds = open_virtual_dataset(
Expand All @@ -286,17 +291,17 @@ vds = open_virtual_dataset(
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
time (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)
```

## Combining virtual datasets
Expand Down Expand Up @@ -328,26 +333,26 @@ vds2 = open_virtual_dataset('air2.nc')

As we know the correct order a priori, we can just combine along one dimension using `xarray.concat`.

```
combined_vds = xr.concat([vds1, vds2], dim='time', coords='minimal', compat='override')
```python
combined_vds = xr.concat([vds1, vds2], dim='time')
combined_vds
```

```
<xarray.Dataset> Size: 8MB
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
time (time) float32 12kB ManifestArray<shape=(2920,), dtype=float32, ...
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) int16 8MB ManifestArray<shape=(2920, 25, 53), d...
air (time, lat, lon) float64 31MB ManifestArray<shape=(2920, 25, 53)...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)
```

We can see that the resulting combined manifest has two chunks, as expected.
Expand All @@ -362,32 +367,30 @@ combined_vds['air'].data.manifest.dict()
```

```{note}
The keyword arguments `coords='minimal', compat='override'` are currently necessary because the default behaviour of xarray will attempt to load coordinates in order to check their compatibility with one another. In future this [default will be changed](https://github.com/pydata/xarray/issues/8778), such that passing these two arguments explicitly will become unnecessary.
If you have any virtual coordinate variables, you will likely need to specify the keyword arguments `coords='minimal'` and `compat='override'` to `xarray.concat()`, because the default behaviour of xarray will attempt to load coordinates in order to check their compatibility with one another. In future this [default will be changed](https://github.com/pydata/xarray/issues/8778), such that passing these two arguments explicitly will become unnecessary.
```

The general multi-dimensional version of this concatenation-by-order-supplied can be achieved using `xarray.combine_nested`.
The general multi-dimensional version of this concatenation-by-order-supplied can be achieved using `xarray.combine_nested()`.

```python
combined_vds = xr.combine_nested([vds1, vds2], concat_dim=['time'], coords='minimal', compat='override')
combined_vds = xr.combine_nested([vds1, vds2], concat_dim=['time'])
```

In N-dimensions the datasets would need to be passed as an N-deep nested list-of-lists, see the [xarray docs](https://docs.xarray.dev/en/stable/user-guide/combining.html#combining-along-multiple-dimensions).

```{note}
For manual concatenation we can actually avoid creating any xarray indexes, as we won't need them. Without indexes we can avoid loading any data whatsoever from the files. However, you should first be confident that the archival files actually do have compatible data, as the coordinate values then cannot be efficiently compared for consistency (i.e. aligned).

By default indexes are created for 1-dimensional ``loadable_variables`` whose name matches their only dimension (i.e. "dimension coordinates"), but if you wish you can load variables without creating any indexes by passing ``indexes={}`` to ``open_virtual_dataset``.
```

### Ordering by coordinate values

If you're happy to load 1D dimension coordinates into memory, you can use their values to do the ordering for you!

```python
vds1 = open_virtual_dataset('air1.nc', loadable_variables=['time', 'lat', 'lon'])
vds2 = open_virtual_dataset('air2.nc', loadable_variables=['time', 'lat', 'lon'])
vds1 = open_virtual_dataset('air1.nc')
vds2 = open_virtual_dataset('air2.nc')

combined_vds = xr.combine_by_coords([vds2, vds1], coords='minimal', compat='override')
combined_vds = xr.combine_by_coords([vds2, vds1])
```

Notice we don't have to specify the concatenation dimension explicitly - xarray works out the correct ordering for us. Even though we actually passed in the virtual datasets in the wrong order just now, the manifest still has the chunks listed in the correct order such that the 1-dimensional `time` coordinate has ascending values:
Expand Down
19 changes: 8 additions & 11 deletions virtualizarr/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
NetCDF3VirtualBackend,
TIFFVirtualBackend,
)
from virtualizarr.readers.common import VirtualBackend
from virtualizarr.utils import _FsspecFSFromFilepath, check_for_collisions
from virtualizarr.readers.api import VirtualBackend
from virtualizarr.utils import _FsspecFSFromFilepath

# TODO add entrypoint to allow external libraries to add to this mapping
VIRTUAL_BACKENDS = {
Expand Down Expand Up @@ -112,11 +112,13 @@ def open_virtual_dataset(
backend: type[VirtualBackend] | None = None,
) -> Dataset:
"""
Open a file or store as an xarray Dataset wrapping virtualized zarr arrays.
Open a file or store as an xarray.Dataset wrapping virtualized zarr arrays.

No data variables will be loaded unless specified in the ``loadable_variables`` kwarg (in which case they will be xarray lazily indexed arrays).

Xarray indexes can optionally be created (the default behaviour). To avoid creating any xarray indexes pass ``indexes={}``.
Some variables can be opened as loadable lazy numpy arrays. This can be controlled explicitly using the ``loadable_variables`` keyword argument.
By default this will be the same variables which `xarray.open_dataset` would create indexes for: i.e. one-dimensional coordinate variables whose
name matches the name of their only dimension (also known as "dimension coordinates").
Pandas indexes will also now be created by default for these loadable variables, but this can be controlled by passing a value for the ``indexes`` keyword argument.
To avoid creating any xarray indexes pass ``indexes={}``.

Parameters
----------
Expand Down Expand Up @@ -159,11 +161,6 @@ def open_virtual_dataset(
stacklevel=2,
)

drop_variables, loadable_variables = check_for_collisions(
drop_variables,
loadable_variables,
)

if reader_options is None:
reader_options = {}

Expand Down
33 changes: 33 additions & 0 deletions virtualizarr/readers/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from abc import ABC
from collections.abc import Iterable, Mapping
from typing import Optional

import xarray as xr


class VirtualBackend(ABC):
@staticmethod
def open_virtual_dataset(
filepath: str,
group: str | None = None,
drop_variables: Iterable[str] | None = None,
loadable_variables: Iterable[str] | None = None,
decode_times: bool | None = None,
indexes: Mapping[str, xr.Index] | None = None,
virtual_backend_kwargs: Optional[dict] = None,
reader_options: Optional[dict] = None,
) -> xr.Dataset:
raise NotImplementedError()

@staticmethod
def open_virtual_datatree(
path: str,
group: str | None = None,
drop_variables: Iterable[str] | None = None,
loadable_variables: Iterable[str] | None = None,
decode_times: bool | None = None,
indexes: Mapping[str, xr.Index] | None = None,
virtual_backend_kwargs: Optional[dict] = None,
reader_options: Optional[dict] = None,
) -> xr.DataTree:
raise NotImplementedError()
Loading