-
Notifications
You must be signed in to change notification settings - Fork 49
Non-kerchunk backend for HDF5/netcdf4 files. #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
106 commits
Select commit
Hold shift + click to select a range
6b7abe2
Generate chunk manifest backed variable from HDF5 dataset.
sharkinsspatial bca0aab
Transfer dataset attrs to variable.
sharkinsspatial 384ff6b
Get virtual variables dict from HDF5 file.
sharkinsspatial 4c5f9bd
Update virtual_vars_from_hdf to use fsspec and drop_variables arg.
sharkinsspatial 1dd3370
mypy fix to use ChunkKey and empty dimensions list.
sharkinsspatial d92c75c
Extract attributes from hdf5 root group.
sharkinsspatial 0ed8362
Use hdf reader for netcdf4 files.
sharkinsspatial f4485fa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3cc1254
Merge branch 'main' into hdf5_reader
sharkinsspatial 0123df7
Fix ruff complaints.
sharkinsspatial 332bcaa
First steps for handling HDF5 filters.
sharkinsspatial c51e615
Initial step for hdf5plugin supported codecs.
sharkinsspatial 0083f77
Small commit to check compression support in CI environment.
sharkinsspatial 3c00071
Merge branch 'main' into hdf5_reader
sharkinsspatial 207c4b5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c573800
Fix mypy complaints for hdf_filters.
sharkinsspatial ef0d7a8
Merge branch 'hdf5_reader' of https://github.com/TomNicholas/Virtuali…
sharkinsspatial 588e06b
Local pre-commit fix for hdf_filters.
sharkinsspatial 725333e
Use fsspec reader_options introduced in #37.
sharkinsspatial 72df108
Fix incorrect zarr_v3 if block position from merge commit ef0d7a8.
sharkinsspatial d1e85cb
Fix early return from hdf _extract_attrs.
sharkinsspatial 1e2b343
Test that _extract_attrs correctly handles multiple attributes.
sharkinsspatial 7f1c189
Initial attempt at scale and offset via numcodecs.
sharkinsspatial 908e332
Tests for cfcodec_from_dataset.
sharkinsspatial 0df332d
Temporarily relax integration tests to assert_allclose.
sharkinsspatial ca6b236
Add blosc_lz4 fixture parameterization to confirm libnetcdf environment.
sharkinsspatial b7426c5
Check for compatability with netcdf4 engine.
sharkinsspatial dac21dd
Use separate fixtures for h5netcdf and netcdf4 compression styles.
sharkinsspatial e968772
Print libhdf5 and libnetcdf4 versions to confirm compiled environment.
sharkinsspatial 9a98e57
Skip netcdf4 style compression tests when libhdf5 < 1.14.
sharkinsspatial 7590b87
Include imagecodecs.numcodecs to support HDF5 lzf filters.
sharkinsspatial e9fbc8a
Merge branch 'main' into hdf5_reader
sharkinsspatial 14bd709
Remove test that verifies call to read_kerchunk_references_from_file.
sharkinsspatial acdf0d7
Add additional codec support structures for imagecodecs and numcodecs.
sharkinsspatial 4ba323a
Add codec config test for Zstd.
sharkinsspatial e14e53b
Include initial cf decoding tests.
sharkinsspatial b808ded
Merge branch 'main' into hdf5_reader
sharkinsspatial b052f8c
Revert typo for scale_factor retrieval.
sharkinsspatial 01a3980
Update reader to use new numpy manifest representation.
sharkinsspatial c37d9e5
Temporarily skip test until blosc netcdf4 issue is solved.
sharkinsspatial 17b30d4
Fix Pydantic 2 migration warnings.
sharkinsspatial f6b596a
Include hdf5plugin and imagecodecs-numcodecs in mamba test environment.
sharkinsspatial eb6e24d
Mamba attempt with imagecodecs rather than imagecodecs-numcodecs.
sharkinsspatial c85bd16
Mamba attempt with latest imagecodecs release.
sharkinsspatial ca435da
Use correct iter_chunks callback function signtature.
sharkinsspatial 3017951
Include pip based imagecodecs-numcodecs until conda-forge availability.
sharkinsspatial ccf0b73
Merge branch 'main' into hdf5_reader
sharkinsspatial 32ba135
Handle non-coordinate dims which are serialized to hdf as empty dataset.
sharkinsspatial 64f446c
Use reader_options for filetype check and update failing kerchunk call.
sharkinsspatial 1c590bb
Merge branch 'main' into hdf5_reader
sharkinsspatial 9797346
Fix chunkmanifest shaping for chunked datasets.
sharkinsspatial c833e19
Handle scale_factor attribute serialization for compressed files.
sharkinsspatial 701bcfa
Include chunked roundtrip fixture.
sharkinsspatial 08c988e
Standardize xarray integration tests for hdf filters.
sharkinsspatial e6076bd
Merge branch 'hdf5_reader' of https://github.com/TomNicholas/Virtuali…
sharkinsspatial d684a84
Merge branch 'main' into hdf5_reader
sharkinsspatial 4cb4bac
Update reader selection logic for new filetype determination.
sharkinsspatial d352104
Use decode_times for integration test.
sharkinsspatial 3d89ea4
Standardize fixture names for hdf5 vs netcdf4 file types.
sharkinsspatial c9dd0d9
Handle array add_offset property for compressed data.
sharkinsspatial db5b421
Include h5py shuffle filter.
sharkinsspatial 9a1da32
Make ScaleAndOffset codec last in filters list.
sharkinsspatial 9b2b0f8
Apply ScaleAndOffset codec to _FillValue since it's value is now down…
sharkinsspatial 9ef1362
Coerce scale and add_offset values to native float for JSON serializa…
sharkinsspatial 30005bd
Merge branch 'main' into hdf5_reader
sharkinsspatial 14f7a99
Merge branch 'main' into hdf5_reader
sharkinsspatial f4f9c8f
Temporarily xfail integration tests for main
sharkinsspatial d257cb9
Merge branch 'main' into hdf5_reader
sharkinsspatial e795c2c
Merge branch 'main' into hdf5_reader
sharkinsspatial a9e59f2
Remove pydantic dependency as per pull/210.
sharkinsspatial 2b33bc2
Update test for new kerchunk reader module location.
sharkinsspatial a57ae9e
Fix branch typing errors.
sharkinsspatial e21fc69
Re-include automatic file type determination.
sharkinsspatial df69a12
Handle various hdf flavors of _FillValue storage.
sharkinsspatial 169337c
Include loadable variables in drop variables list.
sharkinsspatial bdcbfbf
Mock readers.hdf.virtual_vars_from_hdf to verify option passing.
sharkinsspatial 77f1689
Convert numpy _FillValue to native Python for serialization support.
sharkinsspatial 42c653a
Support groups with HDF5 reader.
sharkinsspatial 9c86e0d
Handle empty variables with a shape.
sharkinsspatial 001a4a7
Merge branch 'main' into hdf5_reader
sharkinsspatial 79f9921
Merge branch 'main' into hdf5_reader
sharkinsspatial 1589776
Import top-level version of xarray classes.
sharkinsspatial 772c580
Add option to explicitly specify use of an experimental hdf backend.
sharkinsspatial 3ab90c6
Include imagecodecs and hdf5plugin in all CI environments.
sharkinsspatial 150d06d
Add test_hdf_integration tests to be skipped for non-kerchunk env.
sharkinsspatial 8ccba34
Include imagecodecs in dependencies.
sharkinsspatial 81874e0
Diagnose imagecodecs-numcodecs installation failures in CI.
sharkinsspatial f87abe2
Ignore mypy complaints for VirtualBackend.
sharkinsspatial 70e7e29
Remove checksum assert which varies across different zstd versions.
sharkinsspatial 43bc0e4
Temporarily xfail integration tests with coordinate inconsistency.
sharkinsspatial 82a6321
Remove backend arg for non-hdf network file tests.
sharkinsspatial b34f260
Fix mypy comment moved by ruff formatting.
sharkinsspatial f9ead06
Make HDR reader dependencies optional.
sharkinsspatial 5608292
Handle optional imagecodecs and hdf5plugin dependency imports for tests.
sharkinsspatial 2fa548c
Prevent conflicts with explicit filetype and backend args.
sharkinsspatial bc0d925
Correctly convert root coordinate attributes to a list.
sharkinsspatial 783df94
Clarify that method extracts attrs from any specified group.
sharkinsspatial 16f288b
Restructure hdf reader and codec filters into a module namespace.
sharkinsspatial 3e216dc
Improve docstrings for hdf and filter modules.
sharkinsspatial 5b085a6
Explicitly specify HDF5VirtualBackend for test parameter.
sharkinsspatial 83ff577
Include isssue references for xfailed tests.
sharkinsspatial ee6fa0b
Use soft import strategy for optional dependencies see xarray/issues/…
sharkinsspatial 44bce08
Merge branch 'main' into hdf5_reader
sharkinsspatial 5de9d2c
Handle mypy for soft imports.
sharkinsspatial a8cc82f
Attempt at nested optional depedency usage.
sharkinsspatial 65a6b14
Handle use of soft import sub modules for typing.
sharkinsspatial File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,6 @@ channels: | |
- conda-forge | ||
- nodefaults | ||
dependencies: | ||
- h5netcdf | ||
- h5py | ||
- hdf5 | ||
- netcdf4 | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
from .hdf import ( | ||
HDFVirtualBackend, | ||
construct_virtual_dataset, | ||
open_loadable_vars_and_indexes, | ||
) | ||
|
||
__all__ = [ | ||
"HDFVirtualBackend", | ||
"construct_virtual_dataset", | ||
"open_loadable_vars_and_indexes", | ||
] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,195 @@ | ||
import dataclasses | ||
from typing import TYPE_CHECKING, List, Tuple, TypedDict, Union | ||
|
||
import numcodecs.registry as registry | ||
import numpy as np | ||
from numcodecs.abc import Codec | ||
from numcodecs.fixedscaleoffset import FixedScaleOffset | ||
from xarray.coding.variables import _choose_float_dtype | ||
|
||
from virtualizarr.utils import soft_import | ||
|
||
if TYPE_CHECKING: | ||
import h5py # type: ignore | ||
from h5py import Dataset, Group # type: ignore | ||
|
||
h5py = soft_import("h5py", "For reading hdf files", strict=False) | ||
if h5py: | ||
Dataset = h5py.Dataset | ||
Group = h5py.Group | ||
else: | ||
Dataset = dict() | ||
Group = dict() | ||
|
||
hdf5plugin = soft_import( | ||
"hdf5plugin", "For reading hdf files with filters", strict=False | ||
) | ||
imagecodecs = soft_import( | ||
"imagecodecs", "For reading hdf files with filters", strict=False | ||
) | ||
|
||
|
||
_non_standard_filters = { | ||
"gzip": "zlib", | ||
"lzf": "imagecodecs_lzf", | ||
} | ||
|
||
_hdf5plugin_imagecodecs = {"lz4": "imagecodecs_lz4h5", "bzip2": "imagecodecs_bz2"} | ||
|
||
|
||
@dataclasses.dataclass | ||
class BloscProperties: | ||
blocksize: int | ||
clevel: int | ||
shuffle: int | ||
cname: str | ||
|
||
def __post_init__(self): | ||
blosc_compressor_codes = { | ||
value: key | ||
for key, value in hdf5plugin._filters.Blosc._Blosc__COMPRESSIONS.items() | ||
} | ||
self.cname = blosc_compressor_codes[self.cname] | ||
|
||
|
||
@dataclasses.dataclass | ||
class ZstdProperties: | ||
level: int | ||
|
||
|
||
@dataclasses.dataclass | ||
class ShuffleProperties: | ||
elementsize: int | ||
|
||
|
||
@dataclasses.dataclass | ||
class ZlibProperties: | ||
level: int | ||
|
||
|
||
class CFCodec(TypedDict): | ||
target_dtype: np.dtype | ||
codec: Codec | ||
|
||
|
||
def _filter_to_codec( | ||
filter_id: str, filter_properties: Union[int, None, Tuple] = None | ||
) -> Codec: | ||
""" | ||
Convert an h5py filter to an equivalent numcodec | ||
|
||
Parameters | ||
---------- | ||
filter_id: str | ||
An h5py filter id code. | ||
filter_properties : int or None or Tuple | ||
A single or Tuple of h5py filter configuration codes. | ||
|
||
Returns | ||
------- | ||
A numcodec codec | ||
""" | ||
id_int = None | ||
id_str = None | ||
try: | ||
id_int = int(filter_id) | ||
except ValueError: | ||
id_str = filter_id | ||
conf = {} | ||
if id_str: | ||
if id_str in _non_standard_filters.keys(): | ||
id = _non_standard_filters[id_str] | ||
else: | ||
id = id_str | ||
if id == "zlib": | ||
zlib_props = ZlibProperties(level=filter_properties) # type: ignore | ||
conf = dataclasses.asdict(zlib_props) | ||
if id == "shuffle" and isinstance(filter_properties, tuple): | ||
shuffle_props = ShuffleProperties(elementsize=filter_properties[0]) | ||
conf = dataclasses.asdict(shuffle_props) | ||
conf["id"] = id # type: ignore[assignment] | ||
if id_int: | ||
filter = hdf5plugin.get_filters(id_int)[0] | ||
id = filter.filter_name | ||
if id in _hdf5plugin_imagecodecs.keys(): | ||
id = _hdf5plugin_imagecodecs[id] | ||
if id == "blosc" and isinstance(filter_properties, tuple): | ||
blosc_fields = [field.name for field in dataclasses.fields(BloscProperties)] | ||
blosc_props = BloscProperties( | ||
**{k: v for k, v in zip(blosc_fields, filter_properties[-4:])} | ||
) | ||
conf = dataclasses.asdict(blosc_props) | ||
if id == "zstd" and isinstance(filter_properties, tuple): | ||
zstd_props = ZstdProperties(level=filter_properties[0]) | ||
conf = dataclasses.asdict(zstd_props) | ||
conf["id"] = id | ||
codec = registry.get_codec(conf) | ||
return codec | ||
|
||
|
||
def cfcodec_from_dataset(dataset: Dataset) -> Codec | None: | ||
""" | ||
Converts select h5py dataset CF convention attrs to CFCodec | ||
|
||
Parameters | ||
---------- | ||
dataset: h5py.Dataset | ||
An h5py dataset. | ||
|
||
Returns | ||
------- | ||
CFCodec | ||
A CFCodec. | ||
""" | ||
attributes = {attr: dataset.attrs[attr] for attr in dataset.attrs} | ||
mapping = {} | ||
if "scale_factor" in attributes: | ||
try: | ||
scale_factor = attributes["scale_factor"][0] | ||
except IndexError: | ||
scale_factor = attributes["scale_factor"] | ||
mapping["scale_factor"] = float(1 / scale_factor) | ||
else: | ||
mapping["scale_factor"] = 1 | ||
if "add_offset" in attributes: | ||
try: | ||
offset = attributes["add_offset"][0] | ||
except IndexError: | ||
offset = attributes["add_offset"] | ||
mapping["add_offset"] = float(offset) | ||
else: | ||
mapping["add_offset"] = 0 | ||
if mapping["scale_factor"] != 1 or mapping["add_offset"] != 0: | ||
float_dtype = _choose_float_dtype(dtype=dataset.dtype, mapping=mapping) | ||
target_dtype = np.dtype(float_dtype) | ||
codec = FixedScaleOffset( | ||
offset=mapping["add_offset"], | ||
scale=mapping["scale_factor"], | ||
dtype=target_dtype, | ||
astype=dataset.dtype, | ||
) | ||
cfcodec = CFCodec(target_dtype=target_dtype, codec=codec) | ||
return cfcodec | ||
else: | ||
return None | ||
|
||
|
||
def codecs_from_dataset(dataset: Dataset) -> List[Codec]: | ||
""" | ||
Extracts a list of numcodecs from an h5py dataset | ||
|
||
Parameters | ||
---------- | ||
dataset: h5py.Dataset | ||
An h5py dataset. | ||
|
||
Returns | ||
------- | ||
list | ||
A list of numcodecs codecs. | ||
""" | ||
codecs = [] | ||
for filter_id, filter_properties in dataset._filters.items(): | ||
codec = _filter_to_codec(filter_id, filter_properties) | ||
codecs.append(codec) | ||
return codecs |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.