You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* remove warning
* fix test by loading dimension coordinates
* fix other test by passing loadable_variables
* move loadable variables docs section before combining
* remove recommendation to not create indexes
* de-emphasise avoiding creating indexes
* document using xr.combine_by_coords
* clarify todo
* signpost segue
* remove extra line
* release notes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* correct some PR numbers in release notes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refer to #18 in release notes
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Copy file name to clipboardExpand all lines: docs/releases.rst
+6-2Lines changed: 6 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,11 @@ Breaking changes
14
14
15
15
- Passing ``group=None`` (the default) to ``open_virtual_dataset`` for a file with multiple groups no longer raises an error, instead it gives you the root group.
16
16
This new behaviour is more consistent with ``xarray.open_dataset``.
17
-
(:issue:`336`, :pull:`337`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
17
+
(:issue:`336`, :pull:`338`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
18
+
- Indexes are now created by default for any loadable one-dimensional coordinate variables.
19
+
Also a warning is no longer thrown when ``indexes=None`` is passed to ``open_virtual_dataset``, and the recommendations in the docs updated to match.
20
+
This also means that ``xarray.combine_by_coords`` will now work when the necessary dimension coordinates are specified in ``loadable_variables``.
21
+
(:issue:`18`, :pull:`357`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
18
22
19
23
Deprecations
20
24
~~~~~~~~~~~~
@@ -23,7 +27,7 @@ Bug fixes
23
27
~~~~~~~~~
24
28
25
29
- Fix bug preventing generating references for the root group of a file when a subgroup exists.
26
-
(:issue:`336`, :pull:`337`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
30
+
(:issue:`336`, :pull:`338`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
Copy file name to clipboardExpand all lines: docs/usage.md
+85-74Lines changed: 85 additions & 74 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -184,6 +184,68 @@ The full Zarr model (for a single group) includes multiple arrays, array names,
184
184
185
185
The problem of combining many archival format files (e.g. netCDF files) into one virtual Zarr store therefore becomes just a matter of opening each file using `open_virtual_dataset` and using [xarray's various combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html) to combine them into one aggregate virtual dataset.
186
186
187
+
But before we combine our data, we might want to consider loading some variables into memory.
188
+
189
+
## Loading variables
190
+
191
+
Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:
You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects (`lat` and `lon`), and loadable variables backed by (lazy) numpy arrays (`air` and `time`).
213
+
214
+
Loading variables can be useful in a few scenarios:
215
+
1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
216
+
2. You want in-memory indexes to use with ``xr.combine_by_coords``,
217
+
3. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
218
+
4. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.
219
+
220
+
### CF-encoded time variables
221
+
222
+
To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).
223
+
224
+
```python
225
+
vds = open_virtual_dataset(
226
+
'air.nc',
227
+
loadable_variables=['air', 'time'],
228
+
decode_times=True,
229
+
indexes={},
230
+
)
231
+
```
232
+
```python
233
+
<xarray.Dataset> Size: 31MB
234
+
Dimensions: (time: 2920, lat: 25, lon: 53)
235
+
Coordinates:
236
+
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
In general we should be able to combine all the datasets from our archival files into one using some combination of calls to `xarray.concat` and `xarray.merge`. For combining along multiple dimensions in one call we also have `xarray.combine_nested` and `xarray.combine_by_coords`. If you're not familiar with any of these functions we recommend you skim through [xarray's docs on combining](https://docs.xarray.dev/en/stable/user-guide/combining.html).
@@ -206,23 +268,9 @@ TODO: Note about variable-length chunking?
206
268
207
269
The simplest case of concatenation is when you have a set of files and you know which order they should be concatenated in, _without looking inside the files_. In this case it is sufficient to open the files one-by-one, then pass the virtual datasets as a list to the concatenation function.
208
270
209
-
We can actually avoid creating any xarray indexes, as we won't need them. Without indexes we can avoid loading any data whatsoever from the files, making our opening and combining much faster than it normally would be. **Therefore if you can do your combining manually you should.** However, you should first be confident that the archival files actually do have compatible data, as only the array shapes and dimension names will be checked for consistency.
210
-
211
-
You can specify that you don't want any indexes to be created by passing `indexes={}` to `open_virtual_dataset`.
As we know the correct order a priori, we can just combine along one dimension using `xarray.concat`.
@@ -285,73 +333,37 @@ In future we would like for it to be possible to just use `xr.open_mfdataset` to
285
333
but this requires some [upstream changes](https://github.com/TomNicholas/VirtualiZarr/issues/35) in xarray.
286
334
```
287
335
288
-
### Automatic ordering using coordinate data
289
-
290
-
TODO: Reinstate this part of the docs once [GH issue #18](https://github.com/TomNicholas/VirtualiZarr/issues/18#issuecomment-2023955860) is properly closed.
291
-
292
-
### Automatic ordering using metadata
336
+
```{note}
337
+
For manual concatenation we can actually avoid creating any xarray indexes, as we won't need them. Without indexes we can avoid loading any data whatsoever from the files. However, you should first be confident that the archival files actually do have compatible data, as the coordinate values then cannot be efficiently compared for consistency (i.e. aligned).
293
338
294
-
TODO: Use preprocess to create a new index from the metadata
339
+
By default indexes are created for 1-dimensional ``loadable_variables`` whose name matches their only dimension (i.e. "dimension coordinates"), but if you wish you can load variables without creating any indexes by passing ``indexes={}`` to ``open_virtual_dataset``.
340
+
```
295
341
296
-
##Loading variables
342
+
### Ordering by coordinate values
297
343
298
-
Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:
344
+
If you're happy to load 1D dimension coordinates into memory, you can use their values to do the ordering for you!
You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects (`lat` and `lon`), and loadable variables backed by (lazy) numpy arrays (`air` and `time`).
320
-
321
-
Loading variables can be useful in a few scenarios:
322
-
1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
323
-
2. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
324
-
3. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.
To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).
353
+
Notice we don't have to specify the concatenation dimension explicitly - xarray works out the correct ordering for us. Even though we actually passed in the virtual datasets in the wrong order just now, the manifest still has the chunks listed in the correct order such that the 1-dimensional ``time`` coordinate has ascending values:
329
354
330
355
```python
331
-
vds = open_virtual_dataset(
332
-
'air.nc',
333
-
loadable_variables=['air', 'time'],
334
-
decode_times=True,
335
-
indexes={},
336
-
)
356
+
combined_vds['air'].data.manifest.dict()
337
357
```
338
-
```python
339
-
<xarray.Dataset> Size: 31MB
340
-
Dimensions: (time: 2920, lat: 25, lon: 53)
341
-
Coordinates:
342
-
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
TODO: Use preprocess to create a new index from the metadata. Requires ``open_virtual_mfdataset`` to be implemented in [PR #349](https://github.com/zarr-developers/VirtualiZarr/pull/349).
366
+
355
367
## Writing virtual stores to disk
356
368
357
369
Once we've combined references to all the chunks of all our archival files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
@@ -439,9 +451,9 @@ This store can however be read by {py:func}`~virtualizarr.open_virtual_dataset`,
439
451
You can open existing Kerchunk `json` or `parquet` references as Virtualizarr virtual datasets. This may be useful for converting existing Kerchunk formatted references to storage formats like [Icechunk](https://icechunk.io/).
One difference between the kerchunk references format and virtualizarr's internal manifest representation (as well as icechunk's format) is that paths in kerchunk references can be relative paths. Opening kerchunk references that contain relative local filepaths therefore requires supplying another piece of information: the directory of the ``fsspec`` filesystem which the filepath was defined relative to.
@@ -454,7 +466,6 @@ You can dis-ambuiguate kerchunk references containing relative paths by passing
"Specifying `indexes=None` will create in-memory pandas indexes for each 1D coordinate, but concatenation of ManifestArrays backed by pandas indexes is not yet supported (see issue #18)."
60
-
"You almost certainly want to pass `indexes={}` to `open_virtual_dataset` instead."
# TODO use xr.testing.assert_identical(vds.indexes, ds.indexes) instead once class supported by assertion comparison, see https://github.com/pydata/xarray/issues/5812
0 commit comments