Skip to content

Commit eb4d6e0

Browse files
authored
Add nbytes property (#227)
* add nbytes property * dataset accessor method * test * release notes * add to API docs * fix implementation so it still displays non-virtual total in xarray repr * mention in documentation
1 parent 3188ca0 commit eb4d6e0

File tree

7 files changed

+83
-1
lines changed

7 files changed

+83
-1
lines changed

docs/api.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,16 @@ Serialization
3232
VirtualiZarrDatasetAccessor.to_zarr
3333
VirtualiZarrDatasetAccessor.to_icechunk
3434

35+
Information
36+
-----------
37+
38+
.. currentmodule:: virtualizarr.accessor
39+
.. autosummary::
40+
:nosignatures:
41+
:toctree: generated/
42+
43+
VirtualiZarrDatasetAccessor.nbytes
44+
3545
Rewriting
3646
---------
3747

docs/releases.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ v1.2.1 (unreleased)
99
New Features
1010
~~~~~~~~~~~~
1111

12+
- Added a ``.nbytes`` accessor method which displays the bytes needed to hold the virtual references in memory.
13+
(:issue:`167`, :pull:`227`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
14+
1215
Breaking changes
1316
~~~~~~~~~~~~~~~~
1417

docs/usage.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,11 +60,25 @@ Attributes:
6060
title: 4x daily NMC reanalysis (1948)
6161
```
6262

63-
6463
Generally a "virtual dataset" is any `xarray.Dataset` which wraps one or more {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects.
6564

6665
These particular {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects are each a virtual reference to some data in the `air.nc` netCDF file, with the references stored in the form of "Chunk Manifests".
6766

67+
As the manifest contains only addresses at which to find large binary chunks, the virtual dataset takes up far less space in memory than the original dataset does:
68+
69+
```python
70+
ds.nbytes
71+
```
72+
```
73+
30975672
74+
```
75+
```python
76+
vds.virtualize.nbytes
77+
```
78+
```
79+
128
80+
```
81+
6882
```{important} Virtual datasets are not normal xarray datasets!
6983
7084
Although the top-level type is still `xarray.Dataset`, they are intended only as an abstract representation of a set of data files, not as something you can do analysis with. If you try to load, view, or plot any data you will get a `NotImplementedError`. Virtual datasets only support a very limited subset of normal xarray operations, particularly functions and methods for concatenating, merging and extracting variables, as well as operations for renaming dimensions and variables.

virtualizarr/accessor.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,3 +183,21 @@ def rename_paths(
183183
new_ds[var_name].data = data.rename_paths(new=new)
184184

185185
return new_ds
186+
187+
@property
188+
def nbytes(self) -> int:
189+
"""
190+
Size required to hold these references in memory in bytes.
191+
192+
Note this is not the size of the referenced chunks if they were actually loaded into memory,
193+
this is only the size of the pointers to the chunk locations.
194+
If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
195+
196+
In-memory (loadable) variables are included in the total using xarray's normal ``.nbytes`` method.
197+
"""
198+
return sum(
199+
var.data.nbytes_virtual
200+
if isinstance(var.data, ManifestArray)
201+
else var.nbytes
202+
for var in self.ds.variables.values()
203+
)

virtualizarr/manifests/array.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,18 @@ def size(self) -> int:
9393
def __repr__(self) -> str:
9494
return f"ManifestArray<shape={self.shape}, dtype={self.dtype}, chunks={self.chunks}>"
9595

96+
@property
97+
def nbytes_virtual(self) -> int:
98+
"""
99+
Size required to hold these references in memory in bytes.
100+
101+
Note this is not the size of the referenced array if it were actually loaded into memory,
102+
this is only the size of the pointers to the chunk locations.
103+
If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
104+
"""
105+
# note: we don't name this method `.nbytes` as we don't want xarray's repr to use it
106+
return self.manifest.nbytes
107+
96108
def __array_function__(self, func, types, args, kwargs) -> Any:
97109
"""
98110
Hook to teach this class what to do if np.concat etc. is called on it.

virtualizarr/manifests/manifest.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,17 @@ def shape_chunk_grid(self) -> tuple[int, ...]:
357357
def __repr__(self) -> str:
358358
return f"ChunkManifest<shape={self.shape_chunk_grid}>"
359359

360+
@property
361+
def nbytes(self) -> int:
362+
"""
363+
Size required to hold these references in memory in bytes.
364+
365+
Note this is not the size of the referenced chunks if they were actually loaded into memory,
366+
this is only the size of the pointers to the chunk locations.
367+
If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
368+
"""
369+
return self._paths.nbytes + self._offsets.nbytes + self._lengths.nbytes
370+
360371
def __getitem__(self, key: ChunkKey) -> ChunkEntry:
361372
indices = split(key)
362373
path = self._paths[indices]

virtualizarr/tests/test_xarray.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
import numpy as np
44
import pytest
55
import xarray as xr
6+
from xarray import open_dataset
67

78
from virtualizarr import open_virtual_dataset
89
from virtualizarr.manifests import ChunkManifest, ManifestArray
@@ -310,3 +311,16 @@ def test_mixture_of_manifestarrays_and_numpy_arrays(
310311
== "s3://bucket/air.nc"
311312
)
312313
assert isinstance(renamed_vds["lat"].data, np.ndarray)
314+
315+
316+
@requires_kerchunk
317+
def test_nbytes(simple_netcdf4):
318+
vds = open_virtual_dataset(simple_netcdf4)
319+
assert vds.virtualize.nbytes == 32
320+
assert vds.nbytes == 48
321+
322+
vds = open_virtual_dataset(simple_netcdf4, loadable_variables=["foo"])
323+
assert vds.virtualize.nbytes == 48
324+
325+
ds = open_dataset(simple_netcdf4)
326+
assert ds.virtualize.nbytes == ds.nbytes

0 commit comments

Comments
 (0)