Skip to content

Commit 3188ca0

Browse files
Remove warning when passing indexes=None (#357)
* remove warning * fix test by loading dimension coordinates * fix other test by passing loadable_variables * move loadable variables docs section before combining * remove recommendation to not create indexes * de-emphasise avoiding creating indexes * document using xr.combine_by_coords * clarify todo * signpost segue * remove extra line * release notes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * correct some PR numbers in release notes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refer to #18 in release notes --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 1474f18 commit 3188ca0

File tree

6 files changed

+119
-98
lines changed

6 files changed

+119
-98
lines changed

docs/api.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ Reading
2020

2121
open_virtual_dataset
2222

23-
2423
Serialization
2524
-------------
2625

docs/releases.rst

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,11 @@ Breaking changes
1414

1515
- Passing ``group=None`` (the default) to ``open_virtual_dataset`` for a file with multiple groups no longer raises an error, instead it gives you the root group.
1616
This new behaviour is more consistent with ``xarray.open_dataset``.
17-
(:issue:`336`, :pull:`337`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
17+
(:issue:`336`, :pull:`338`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
18+
- Indexes are now created by default for any loadable one-dimensional coordinate variables.
19+
Also a warning is no longer thrown when ``indexes=None`` is passed to ``open_virtual_dataset``, and the recommendations in the docs updated to match.
20+
This also means that ``xarray.combine_by_coords`` will now work when the necessary dimension coordinates are specified in ``loadable_variables``.
21+
(:issue:`18`, :pull:`357`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
1822

1923
Deprecations
2024
~~~~~~~~~~~~
@@ -23,7 +27,7 @@ Bug fixes
2327
~~~~~~~~~
2428

2529
- Fix bug preventing generating references for the root group of a file when a subgroup exists.
26-
(:issue:`336`, :pull:`337`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
30+
(:issue:`336`, :pull:`338`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
2731

2832
Documentation
2933
~~~~~~~~~~~~~

docs/usage.md

Lines changed: 85 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,68 @@ The full Zarr model (for a single group) includes multiple arrays, array names,
184184

185185
The problem of combining many archival format files (e.g. netCDF files) into one virtual Zarr store therefore becomes just a matter of opening each file using `open_virtual_dataset` and using [xarray's various combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html) to combine them into one aggregate virtual dataset.
186186

187+
But before we combine our data, we might want to consider loading some variables into memory.
188+
189+
## Loading variables
190+
191+
Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:
192+
193+
```python
194+
vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'], indexes={})
195+
```
196+
```python
197+
<xarray.Dataset> Size: 31MB
198+
Dimensions: (time: 2920, lat: 25, lon: 53)
199+
Coordinates:
200+
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
201+
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
202+
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
203+
Data variables:
204+
air (time, lat, lon) float64 31MB ...
205+
Attributes:
206+
Conventions: COARDS
207+
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
208+
platform: Model
209+
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
210+
title: 4x daily NMC reanalysis (1948)
211+
```
212+
You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects (`lat` and `lon`), and loadable variables backed by (lazy) numpy arrays (`air` and `time`).
213+
214+
Loading variables can be useful in a few scenarios:
215+
1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
216+
2. You want in-memory indexes to use with ``xr.combine_by_coords``,
217+
3. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
218+
4. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.
219+
220+
### CF-encoded time variables
221+
222+
To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).
223+
224+
```python
225+
vds = open_virtual_dataset(
226+
'air.nc',
227+
loadable_variables=['air', 'time'],
228+
decode_times=True,
229+
indexes={},
230+
)
231+
```
232+
```python
233+
<xarray.Dataset> Size: 31MB
234+
Dimensions: (time: 2920, lat: 25, lon: 53)
235+
Coordinates:
236+
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
237+
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
238+
time (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
239+
Data variables:
240+
air (time, lat, lon) float64 31MB ...
241+
Attributes:
242+
Conventions: COARDS
243+
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
244+
platform: Model
245+
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
246+
title: 4x daily NMC reanalysis (1948)
247+
```
248+
187249
## Combining virtual datasets
188250

189251
In general we should be able to combine all the datasets from our archival files into one using some combination of calls to `xarray.concat` and `xarray.merge`. For combining along multiple dimensions in one call we also have `xarray.combine_nested` and `xarray.combine_by_coords`. If you're not familiar with any of these functions we recommend you skim through [xarray's docs on combining](https://docs.xarray.dev/en/stable/user-guide/combining.html).
@@ -206,23 +268,9 @@ TODO: Note about variable-length chunking?
206268

207269
The simplest case of concatenation is when you have a set of files and you know which order they should be concatenated in, _without looking inside the files_. In this case it is sufficient to open the files one-by-one, then pass the virtual datasets as a list to the concatenation function.
208270

209-
We can actually avoid creating any xarray indexes, as we won't need them. Without indexes we can avoid loading any data whatsoever from the files, making our opening and combining much faster than it normally would be. **Therefore if you can do your combining manually you should.** However, you should first be confident that the archival files actually do have compatible data, as only the array shapes and dimension names will be checked for consistency.
210-
211-
You can specify that you don't want any indexes to be created by passing `indexes={}` to `open_virtual_dataset`.
212-
213271
```python
214-
vds1 = open_virtual_dataset('air1.nc', indexes={})
215-
vds2 = open_virtual_dataset('air2.nc', indexes={})
216-
```
217-
218-
We can see that the datasets have no indexes.
219-
220-
```python
221-
vds1.indexes
222-
```
223-
```
224-
Indexes:
225-
*empty*
272+
vds1 = open_virtual_dataset('air1.nc')
273+
vds2 = open_virtual_dataset('air2.nc')
226274
```
227275

228276
As we know the correct order a priori, we can just combine along one dimension using `xarray.concat`.
@@ -285,73 +333,37 @@ In future we would like for it to be possible to just use `xr.open_mfdataset` to
285333
but this requires some [upstream changes](https://github.com/TomNicholas/VirtualiZarr/issues/35) in xarray.
286334
```
287335

288-
### Automatic ordering using coordinate data
289-
290-
TODO: Reinstate this part of the docs once [GH issue #18](https://github.com/TomNicholas/VirtualiZarr/issues/18#issuecomment-2023955860) is properly closed.
291-
292-
### Automatic ordering using metadata
336+
```{note}
337+
For manual concatenation we can actually avoid creating any xarray indexes, as we won't need them. Without indexes we can avoid loading any data whatsoever from the files. However, you should first be confident that the archival files actually do have compatible data, as the coordinate values then cannot be efficiently compared for consistency (i.e. aligned).
293338
294-
TODO: Use preprocess to create a new index from the metadata
339+
By default indexes are created for 1-dimensional ``loadable_variables`` whose name matches their only dimension (i.e. "dimension coordinates"), but if you wish you can load variables without creating any indexes by passing ``indexes={}`` to ``open_virtual_dataset``.
340+
```
295341

296-
## Loading variables
342+
### Ordering by coordinate values
297343

298-
Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:
344+
If you're happy to load 1D dimension coordinates into memory, you can use their values to do the ordering for you!
299345

300346
```python
301-
vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'], indexes={})
302-
```
303-
```python
304-
<xarray.Dataset> Size: 31MB
305-
Dimensions: (time: 2920, lat: 25, lon: 53)
306-
Coordinates:
307-
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
308-
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
309-
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
310-
Data variables:
311-
air (time, lat, lon) float64 31MB ...
312-
Attributes:
313-
Conventions: COARDS
314-
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
315-
platform: Model
316-
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
317-
title: 4x daily NMC reanalysis (1948)
318-
```
319-
You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects (`lat` and `lon`), and loadable variables backed by (lazy) numpy arrays (`air` and `time`).
320-
321-
Loading variables can be useful in a few scenarios:
322-
1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
323-
2. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
324-
3. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.
347+
vds1 = open_virtual_dataset('air1.nc', loadable_variables=['time', 'lat', 'lon'])
348+
vds2 = open_virtual_dataset('air2.nc', loadable_variables=['time', 'lat', 'lon'])
325349

326-
### CF-encoded time variables
350+
combined_vds = xr.combine_by_coords([vds2, vds1], coords='minimal', compat='override')
351+
```
327352

328-
To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).
353+
Notice we don't have to specify the concatenation dimension explicitly - xarray works out the correct ordering for us. Even though we actually passed in the virtual datasets in the wrong order just now, the manifest still has the chunks listed in the correct order such that the 1-dimensional ``time`` coordinate has ascending values:
329354

330355
```python
331-
vds = open_virtual_dataset(
332-
'air.nc',
333-
loadable_variables=['air', 'time'],
334-
decode_times=True,
335-
indexes={},
336-
)
356+
combined_vds['air'].data.manifest.dict()
337357
```
338-
```python
339-
<xarray.Dataset> Size: 31MB
340-
Dimensions: (time: 2920, lat: 25, lon: 53)
341-
Coordinates:
342-
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
343-
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
344-
time (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
345-
Data variables:
346-
air (time, lat, lon) float64 31MB ...
347-
Attributes:
348-
Conventions: COARDS
349-
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
350-
platform: Model
351-
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
352-
title: 4x daily NMC reanalysis (1948)
358+
```
359+
{'0.0.0': {'path': 'file:///work/data/air1.nc', 'offset': 15419, 'length': 3869000},
360+
'1.0.0': {'path': 'file:///work/data/air2.nc', 'offset': 15419, 'length': 3869000}}
353361
```
354362

363+
### Ordering using metadata
364+
365+
TODO: Use preprocess to create a new index from the metadata. Requires ``open_virtual_mfdataset`` to be implemented in [PR #349](https://github.com/zarr-developers/VirtualiZarr/pull/349).
366+
355367
## Writing virtual stores to disk
356368

357369
Once we've combined references to all the chunks of all our archival files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
@@ -439,9 +451,9 @@ This store can however be read by {py:func}`~virtualizarr.open_virtual_dataset`,
439451
You can open existing Kerchunk `json` or `parquet` references as Virtualizarr virtual datasets. This may be useful for converting existing Kerchunk formatted references to storage formats like [Icechunk](https://icechunk.io/).
440452

441453
```python
442-
vds = open_virtual_dataset('combined.json', filetype='kerchunk', indexes={})
454+
vds = open_virtual_dataset('combined.json', filetype='kerchunk')
443455
# or
444-
vds = open_virtual_dataset('combined.parquet', filetype='kerchunk', indexes={})
456+
vds = open_virtual_dataset('combined.parquet', filetype='kerchunk')
445457
```
446458

447459
One difference between the kerchunk references format and virtualizarr's internal manifest representation (as well as icechunk's format) is that paths in kerchunk references can be relative paths. Opening kerchunk references that contain relative local filepaths therefore requires supplying another piece of information: the directory of the ``fsspec`` filesystem which the filepath was defined relative to.
@@ -454,7 +466,6 @@ You can dis-ambuiguate kerchunk references containing relative paths by passing
454466
vds = open_virtual_dataset(
455467
'relative_refs.json',
456468
filetype='kerchunk',
457-
indexes={},
458469
virtual_backend_kwargs={'fs_root': 'file:///some_directory/'}
459470
)
460471

virtualizarr/readers/common.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
import warnings
21
from abc import ABC
32
from collections.abc import Iterable, Mapping, MutableMapping
43
from typing import (
@@ -55,13 +54,13 @@ def open_loadable_vars_and_indexes(
5554
)
5655

5756
if indexes is None:
58-
warnings.warn(
59-
"Specifying `indexes=None` will create in-memory pandas indexes for each 1D coordinate, but concatenation of ManifestArrays backed by pandas indexes is not yet supported (see issue #18)."
60-
"You almost certainly want to pass `indexes={}` to `open_virtual_dataset` instead."
61-
)
62-
6357
# add default indexes by reading data from file
64-
indexes = {name: index for name, index in ds.xindexes.items()}
58+
# but avoid creating an in-memory index for virtual variables by default
59+
indexes = {
60+
name: index
61+
for name, index in ds.xindexes.items()
62+
if name in loadable_variables
63+
}
6564
elif indexes != {}:
6665
# TODO allow manual specification of index objects
6766
raise NotImplementedError()

virtualizarr/tests/test_backend.py

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -89,13 +89,21 @@ def test_no_indexes(self, netcdf4_file, hdf_backend):
8989
vds = open_virtual_dataset(netcdf4_file, indexes={}, backend=hdf_backend)
9090
assert vds.indexes == {}
9191

92-
def test_create_default_indexes(self, netcdf4_file, hdf_backend):
93-
with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
94-
vds = open_virtual_dataset(netcdf4_file, indexes=None, backend=hdf_backend)
92+
def test_create_default_indexes_for_loadable_variables(
93+
self, netcdf4_file, hdf_backend
94+
):
95+
loadable_variables = ["time", "lat"]
96+
97+
vds = open_virtual_dataset(
98+
netcdf4_file,
99+
indexes=None,
100+
backend=hdf_backend,
101+
loadable_variables=loadable_variables,
102+
)
95103
ds = open_dataset(netcdf4_file, decode_times=True)
96104

97105
# TODO use xr.testing.assert_identical(vds.indexes, ds.indexes) instead once class supported by assertion comparison, see https://github.com/pydata/xarray/issues/5812
98-
assert index_mappings_equal(vds.xindexes, ds.xindexes)
106+
assert index_mappings_equal(vds.xindexes, ds[loadable_variables].xindexes)
99107

100108

101109
def index_mappings_equal(indexes1: Mapping[str, Index], indexes2: Mapping[str, Index]):

virtualizarr/tests/test_xarray.py

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
from virtualizarr import open_virtual_dataset
88
from virtualizarr.manifests import ChunkManifest, ManifestArray
9-
from virtualizarr.readers.hdf import HDFVirtualBackend
9+
from virtualizarr.readers import HDF5VirtualBackend, HDFVirtualBackend
1010
from virtualizarr.tests import requires_kerchunk
1111
from virtualizarr.zarr import ZArray
1212

@@ -227,15 +227,17 @@ def test_concat_dim_coords_along_existing_dim(self):
227227

228228

229229
@requires_kerchunk
230-
@pytest.mark.parametrize("hdf_backend", [None, HDFVirtualBackend])
230+
@pytest.mark.parametrize("hdf_backend", [HDF5VirtualBackend, HDFVirtualBackend])
231231
class TestCombineUsingIndexes:
232232
def test_combine_by_coords(self, netcdf4_files_factory: Callable, hdf_backend):
233233
filepath1, filepath2 = netcdf4_files_factory()
234234

235-
with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
236-
vds1 = open_virtual_dataset(filepath1, backend=hdf_backend)
237-
with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
238-
vds2 = open_virtual_dataset(filepath2, backend=hdf_backend)
235+
vds1 = open_virtual_dataset(
236+
filepath1, backend=hdf_backend, loadable_variables=["time", "lat", "lon"]
237+
)
238+
vds2 = open_virtual_dataset(
239+
filepath2, backend=hdf_backend, loadable_variables=["time", "lat", "lon"]
240+
)
239241

240242
combined_vds = xr.combine_by_coords(
241243
[vds2, vds1],
@@ -247,10 +249,8 @@ def test_combine_by_coords(self, netcdf4_files_factory: Callable, hdf_backend):
247249
def test_combine_by_coords_keeping_manifestarrays(self, netcdf4_files, hdf_backend):
248250
filepath1, filepath2 = netcdf4_files
249251

250-
with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
251-
vds1 = open_virtual_dataset(filepath1, backend=hdf_backend)
252-
with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
253-
vds2 = open_virtual_dataset(filepath2, backend=hdf_backend)
252+
vds1 = open_virtual_dataset(filepath1, backend=hdf_backend)
253+
vds2 = open_virtual_dataset(filepath2, backend=hdf_backend)
254254

255255
combined_vds = xr.combine_by_coords(
256256
[vds2, vds1],

0 commit comments

Comments
 (0)