Skip to content

Commit 8534ec7

Browse files
authored
Merge branch 'main' into close-filehandle
2 parents 7dd22f9 + bf63593 commit 8534ec7

File tree

5 files changed

+20
-171
lines changed

5 files changed

+20
-171
lines changed

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,28 +8,28 @@
88
[![pre-commit Enabled](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://pre-commit.com/)
99
[![Apache 2.0 License](https://img.shields.io/badge/license-Apache%202-cb2533.svg)](https://www.apache.org/licenses/LICENSE-2.0)
1010
[![Python Versions](https://img.shields.io/python/required-version-toml?tomlFilePath=https://raw.githubusercontent.com/zarr-developers/VirtualiZarr/main/pyproject.toml&logo=Python&logoColor=gold&label=Python)](https://docs.python.org)
11-
11+
[![slack](https://img.shields.io/badge/slack-virtualizarr-purple.svg?logo=slack)](https://earthmover-community.slack.com/archives/C08EXCE8ZQX)
1212
[![Latest Release](https://img.shields.io/github/v/release/zarr-developers/VirtualiZarr)](https://github.com/zarr-developers/VirtualiZarr/releases)
1313
[![PyPI - Downloads](https://img.shields.io/pypi/dm/virtualizarr?label=pypi%7Cdownloads)](https://pypistats.org/packages/virtualizarr)
1414
[![Conda - Downloads](https://img.shields.io/conda/d/conda-forge/virtualizarr
1515
)](https://anaconda.org/conda-forge/virtualizarr)
16-
[![slack](https://img.shields.io/badge/slack-virtualizarr-purple.svg?logo=slack)](https://earthmover-community.slack.com/archives/C08EXCE8ZQX)
1716

18-
## Cloud-Optimize your Scientific Data as Virtual Zarr stores, using xarray syntax.
17+
18+
## Cloud-Optimize your Scientific Data as a Virtual Zarr Datacube, using Xarray syntax.
1919

2020
The best way to distribute large scientific datasets is via the Cloud, in [Cloud-Optimized formats](https://guide.cloudnativegeo.org/) [^1]. But often this data is stuck in archival pre-Cloud file formats such as netCDF.
2121

22-
**VirtualiZarr[^2] makes it easy to create "Virtual" Zarr stores, allowing performant access to archival data as if it were in the Cloud-Optimized [Zarr format](https://zarr.dev/), _without duplicating any data_.**
22+
**VirtualiZarr[^2] makes it easy to create "Virtual" Zarr datacubes, allowing performant access to archival data as if it were in the Cloud-Optimized [Zarr format](https://zarr.dev/), _without duplicating any data_.**
2323

2424
Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/index.html).
2525

2626
### Features
2727

28-
* Create virtual references pointing to bytes inside a archival file with [`open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/latest/usage.html#opening-files-as-virtual-datasets),
29-
* Supports a [range of archival file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4 and HDF5,
30-
* [Combine data from multiple files](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets) into one larger store using [xarray's combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html), such as [`xarray.concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html),
28+
* Create virtual references pointing to bytes inside an archival file with [`open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/latest/usage.html#opening-files-as-virtual-datasets).
29+
* Supports a [range of archival file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4 and HDF5, and has a pluggable system for supporting new formats.
30+
* [Combine data from multiple files](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets) into one larger datacube using [xarray's combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html), such as [`xarray.concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html).
3131
* Commit the virtual references to storage either using the [Kerchunk references](https://fsspec.github.io/kerchunk/spec.html) specification or the [Icechunk](https://icechunk.io/) transactional storage engine.
32-
* Users access the virtual dataset using [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset).
32+
* Users access the virtual datacube simply as a single zarr-compatible store using [`xarray.open_zarr`](https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html).
3333

3434
### Inspired by Kerchunk
3535

virtualizarr/readers/hdf/hdf.py

Lines changed: 11 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,11 @@
88
List,
99
Mapping,
1010
Optional,
11-
Tuple,
1211
Union,
1312
)
1413

1514
import numpy as np
1615
import xarray as xr
17-
from xarray.backends.zarr import FillValueCoder
1816

1917
from virtualizarr.manifests import (
2018
ChunkEntry,
@@ -42,20 +40,6 @@
4240
H5Dataset: Any = None
4341
H5Group: Any = None
4442

45-
FillValueType = Union[
46-
int,
47-
float,
48-
bool,
49-
complex,
50-
str,
51-
np.integer,
52-
np.floating,
53-
np.bool_,
54-
np.complexfloating,
55-
bytes, # For fixed-length string storage
56-
Tuple[bytes, int], # Structured type
57-
]
58-
5943

6044
class HDFVirtualBackend(VirtualBackend):
6145
@staticmethod
@@ -232,29 +216,6 @@ def _dataset_dims(dataset: H5Dataset, group: str = "") -> List[str]:
232216

233217
return [dim.removeprefix(group) for dim in dims]
234218

235-
@staticmethod
236-
def _extract_cf_fill_value(
237-
h5obj: Union[H5Dataset, H5Group],
238-
) -> Optional[FillValueType]:
239-
"""
240-
Convert the _FillValue attribute from an HDF5 group or dataset into
241-
encoding.
242-
243-
Parameters
244-
----------
245-
h5obj : h5py.Group or h5py.Dataset
246-
An h5py group or dataset.
247-
"""
248-
fillvalue = None
249-
for n, v in h5obj.attrs.items():
250-
if n == "_FillValue":
251-
if isinstance(v, np.ndarray) and v.size == 1:
252-
fillvalue = v.item()
253-
else:
254-
fillvalue = v
255-
fillvalue = FillValueCoder.encode(fillvalue, h5obj.dtype) # type: ignore[arg-type]
256-
return fillvalue
257-
258219
@staticmethod
259220
def _extract_attrs(h5obj: Union[H5Dataset, H5Group]):
260221
"""
@@ -279,14 +240,14 @@ def _extract_attrs(h5obj: Union[H5Dataset, H5Group]):
279240
for n, v in h5obj.attrs.items():
280241
if n in _HIDDEN_ATTRS:
281242
continue
282-
if n == "_FillValue":
283-
continue
284243
# Fix some attribute values to avoid JSON encoding exceptions...
285244
if isinstance(v, bytes):
286245
v = v.decode("utf-8") or " "
287246
elif isinstance(v, (np.ndarray, np.number, np.bool_)):
288247
if v.dtype.kind == "S":
289248
v = v.astype(str)
249+
if n == "_FillValue":
250+
continue
290251
elif v.size == 1:
291252
v = v.flatten()[0]
292253
if isinstance(v, (np.ndarray, np.number, np.bool_)):
@@ -297,6 +258,7 @@ def _extract_attrs(h5obj: Union[H5Dataset, H5Group]):
297258
v = ""
298259
if v == "DIMENSION_SCALE":
299260
continue
261+
300262
attrs[n] = v
301263
return attrs
302264

@@ -328,19 +290,21 @@ def _dataset_to_variable(
328290
codecs = codecs_from_dataset(dataset)
329291
cfcodec = cfcodec_from_dataset(dataset)
330292
attrs = HDFVirtualBackend._extract_attrs(dataset)
331-
cf_fill_value = HDFVirtualBackend._extract_cf_fill_value(dataset)
332-
attrs.pop("_FillValue", None)
333-
334293
if cfcodec:
335294
codecs.insert(0, cfcodec["codec"])
336295
dtype = cfcodec["target_dtype"]
337296
attrs.pop("scale_factor", None)
338297
attrs.pop("add_offset", None)
298+
fill_value = cfcodec["codec"].decode(dataset.fillvalue)
339299
else:
340300
dtype = dataset.dtype
341-
342-
fill_value = dataset.fillvalue.item()
343-
301+
fill_value = dataset.fillvalue
302+
if isinstance(fill_value, np.ndarray):
303+
fill_value = fill_value[0]
304+
if np.isnan(fill_value):
305+
fill_value = float("nan")
306+
if isinstance(fill_value, np.generic):
307+
fill_value = fill_value.item()
344308
filters = [codec.get_config() for codec in codecs]
345309
zarray = ZArray(
346310
chunks=chunks, # type: ignore
@@ -359,8 +323,6 @@ def _dataset_to_variable(
359323
variable = xr.Variable(data=marray, dims=dims, attrs=attrs)
360324
else:
361325
variable = xr.Variable(data=np.empty(dataset.shape), dims=dims, attrs=attrs)
362-
if cf_fill_value is not None:
363-
variable.encoding["_FillValue"] = cf_fill_value
364326
return variable
365327

366328
@staticmethod

virtualizarr/tests/test_readers/conftest.py

Lines changed: 0 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -342,64 +342,3 @@ def non_coord_dim(tmpdir):
342342
ds = ds.drop_dims("dim3")
343343
ds.to_netcdf(filepath, engine="netcdf4")
344344
return filepath
345-
346-
347-
@pytest.fixture
348-
def scalar_fill_value_hdf5_file(tmpdir):
349-
filepath = f"{tmpdir}/scalar_fill_value.nc"
350-
f = h5py.File(filepath, "w")
351-
data = np.random.randint(0, 10, size=(5))
352-
fill_value = 42
353-
f.create_dataset(name="data", data=data, chunks=True, fillvalue=fill_value)
354-
return filepath
355-
356-
357-
compound_dtype = np.dtype(
358-
[
359-
("id", "i4"), # 4-byte integer
360-
("temperature", "f4"), # 4-byte float
361-
]
362-
)
363-
364-
compound_data = np.array(
365-
[
366-
(1, 98.6),
367-
(2, 101.3),
368-
],
369-
dtype=compound_dtype,
370-
)
371-
372-
compound_fill = (-9999, -9999.0)
373-
374-
fill_values = [
375-
{"fill_value": -9999, "data": np.random.randint(0, 10, size=(5))},
376-
{"fill_value": -9999.0, "data": np.random.random(5)},
377-
{"fill_value": np.nan, "data": np.random.random(5)},
378-
{"fill_value": False, "data": np.array([True, False, False, True, True])},
379-
{"fill_value": "NaN", "data": np.array(["three"], dtype="S10")},
380-
{"fill_value": compound_fill, "data": compound_data},
381-
]
382-
383-
384-
@pytest.fixture(params=fill_values)
385-
def cf_fill_value_hdf5_file(tmpdir, request):
386-
filepath = f"{tmpdir}/cf_fill_value.nc"
387-
f = h5py.File(filepath, "w")
388-
dset = f.create_dataset(name="data", data=request.param["data"], chunks=True)
389-
dim_scale = f.create_dataset(
390-
name="dim_scale", data=request.param["data"], chunks=True
391-
)
392-
dim_scale.make_scale()
393-
dset.dims[0].attach_scale(dim_scale)
394-
dset.attrs["_FillValue"] = request.param["fill_value"]
395-
return filepath
396-
397-
398-
@pytest.fixture
399-
def cf_array_fill_value_hdf5_file(tmpdir):
400-
filepath = f"{tmpdir}/cf_array_fill_value.nc"
401-
f = h5py.File(filepath, "w")
402-
data = np.random.random(5)
403-
dset = f.create_dataset(name="data", data=data, chunks=True)
404-
dset.attrs["_FillValue"] = np.array([np.nan])
405-
return filepath

virtualizarr/tests/test_readers/test_hdf/test_hdf.py

Lines changed: 0 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
from unittest.mock import patch
22

33
import h5py # type: ignore
4-
import numpy as np
54
import pytest
65

76
from virtualizarr import open_virtual_dataset
@@ -112,36 +111,6 @@ def test_dataset_attributes(self, string_attributes_hdf5_file):
112111
)
113112
assert var.attrs["attribute_name"] == "attribute_name"
114113

115-
def test_scalar_fill_value(self, scalar_fill_value_hdf5_file):
116-
f = h5py.File(scalar_fill_value_hdf5_file)
117-
ds = f["data"]
118-
var = HDFVirtualBackend._dataset_to_variable(
119-
scalar_fill_value_hdf5_file, ds, group=""
120-
)
121-
assert var.data.zarray.fill_value == 42
122-
123-
def test_cf_fill_value(self, cf_fill_value_hdf5_file):
124-
f = h5py.File(cf_fill_value_hdf5_file)
125-
ds = f["data"]
126-
if ds.dtype.kind in "S":
127-
pytest.xfail("Investigate fixed-length binary encoding in Zarr v3")
128-
if ds.dtype.names:
129-
pytest.xfail(
130-
"To fix, structured dtype fill value encoding for Zarr backend"
131-
)
132-
var = HDFVirtualBackend._dataset_to_variable(
133-
cf_fill_value_hdf5_file, ds, group=""
134-
)
135-
assert "_FillValue" in var.encoding
136-
137-
def test_cf_array_fill_value(self, cf_array_fill_value_hdf5_file):
138-
f = h5py.File(cf_array_fill_value_hdf5_file)
139-
ds = f["data"]
140-
var = HDFVirtualBackend._dataset_to_variable(
141-
cf_array_fill_value_hdf5_file, ds, group=""
142-
)
143-
assert not isinstance(var.encoding["_FillValue"], np.ndarray)
144-
145114

146115
@requires_hdf5plugin
147116
@requires_imagecodecs

virtualizarr/tests/test_readers/test_hdf/test_hdf_integration.py

Lines changed: 1 addition & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,9 @@
66
from virtualizarr.readers.hdf import HDFVirtualBackend
77
from virtualizarr.tests import (
88
requires_hdf5plugin,
9-
requires_icechunk,
109
requires_imagecodecs,
1110
requires_kerchunk,
1211
)
13-
from virtualizarr.tests.test_integration import roundtrip_as_in_memory_icechunk
1412

1513

1614
@requires_kerchunk
@@ -55,12 +53,8 @@ def test_filter_and_cf_roundtrip(self, tmpdir, filter_and_cf_roundtrip_hdf5_file
5553
vds.virtualize.to_kerchunk(kerchunk_file, format="json")
5654
roundtrip = xr.open_dataset(kerchunk_file, engine="kerchunk")
5755
xrt.assert_allclose(ds, roundtrip)
58-
assert (
59-
ds["temperature"].encoding["_FillValue"]
60-
== roundtrip["temperature"].encoding["_FillValue"]
61-
)
6256

63-
def test_non_coord_dim_roundtrip(self, tmpdir, non_coord_dim):
57+
def test_non_coord_dim(self, tmpdir, non_coord_dim):
6458
ds = xr.open_dataset(non_coord_dim)
6559
vds = virtualizarr.open_virtual_dataset(
6660
non_coord_dim, backend=HDFVirtualBackend
@@ -69,18 +63,3 @@ def test_non_coord_dim_roundtrip(self, tmpdir, non_coord_dim):
6963
vds.virtualize.to_kerchunk(kerchunk_file, format="json")
7064
roundtrip = xr.open_dataset(kerchunk_file, engine="kerchunk")
7165
xrt.assert_equal(ds, roundtrip)
72-
73-
@requires_icechunk
74-
def test_cf_fill_value_roundtrip(self, tmpdir, cf_fill_value_hdf5_file):
75-
ds = xr.open_dataset(cf_fill_value_hdf5_file, engine="h5netcdf")
76-
if ds["data"].dtype in [float, object]:
77-
pytest.xfail(
78-
"To fix handle fixed-length and structured type fill value \
79-
encoding in xarray zarr backend."
80-
)
81-
vds = virtualizarr.open_virtual_dataset(
82-
cf_fill_value_hdf5_file,
83-
backend=HDFVirtualBackend,
84-
)
85-
roundtrip = roundtrip_as_in_memory_icechunk(vds, tmpdir, decode_times=False)
86-
xrt.assert_equal(ds, roundtrip)

0 commit comments

Comments
 (0)