-
Notifications
You must be signed in to change notification settings - Fork 49
Zarr reader #271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Zarr reader #271
Changes from all commits
Commits
Show all changes
170 commits
Select commit
Hold shift + click to select a range
26a94df
wip toward zarr v2 reader
norlandrhagen cfb7b8d
removed _ARRAY_DIMENSIONS and trimmed down attrs
norlandrhagen 2f26f03
WIP for zarr reader
norlandrhagen eab87a6
adding in the key piece, the reader
norlandrhagen 13db375
virtual dataset is returned! Now to deal with fill_value
norlandrhagen cc30ad7
Merge branch 'main' into zarr_reader
norlandrhagen a047ff9
Update virtualizarr/readers/zarr.py
norlandrhagen 072bead
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen f7c9a3f
replace fsspec ls with zarr.getsize
norlandrhagen 2024606
lint
norlandrhagen 443435b
wip test_zarr
norlandrhagen 50fd8b5
removed pdb
norlandrhagen d93c932
zarr import in type checking
norlandrhagen 39be1c5
moved get_chunk_paths & get_chunk_size async funcs outside of constru…
norlandrhagen e718240
added a few notes from PR review.
norlandrhagen bbcd473
removed array encoding
norlandrhagen ed9f2b4
v2 passing, v3 skipped for now
norlandrhagen db89da7
added missed staged files
norlandrhagen e3d4318
fixed merge conflicts with main
norlandrhagen 410b2a3
missing return
norlandrhagen 8a69963
add network
norlandrhagen 3fca8e6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 34053b0
conftest fix
norlandrhagen 5c26b1f
naming
norlandrhagen fb784dc
comment out integration test for now
norlandrhagen 0444fd4
refactored test_dataset_from_zarr ZArray tests
norlandrhagen 66fd456
adds zarr v3 req opt
norlandrhagen 13fce09
zarr_v3 decorator
norlandrhagen c36962d
add more tests
norlandrhagen 4be4906
wip
norlandrhagen ca5ff32
adds missing await
norlandrhagen 88cbeca
more tests
norlandrhagen 1fbdc9c
wip
norlandrhagen 370621f
wip on v3
norlandrhagen 9bb0653
add note + xfail v3
norlandrhagen 7e03ea5
tmp run network
norlandrhagen 5c1e331
revert
norlandrhagen 9404625
update construct_virtual_array ordering
norlandrhagen 1a5a960
merge
norlandrhagen cc7d68c
updated ABC after merge
norlandrhagen ac105ea
wip
norlandrhagen 7b57bd0
Merge branch 'main' into zarr_reader
norlandrhagen ff01c92
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4f2470a
working for v2 and v3, but only local
norlandrhagen 0c1ff82
merge
norlandrhagen 05d4050
cleanup test_zarr reader test
norlandrhagen f40ba28
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] b5fb802
cleanup after zarr-python issue report
norlandrhagen be5280f
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen 690ffee
temp disabled validate_and_normalize_path_to_uri due to issue in zarr…
norlandrhagen 98600e7
Merge branch 'main' into zarr_reader
norlandrhagen 31a1b94
marked zarr integration test skipped b/c of zarr-v3 and kerchunk inco…
norlandrhagen 795c428
fixes some async behavior, reading from s3 seems to work
norlandrhagen c0004c6
lint + uri_fmt
norlandrhagen 60b8912
adds to releases.rst
norlandrhagen 8240997
nit
norlandrhagen 816e696
cleanup, comments and nits
norlandrhagen 31aacf9
progress on mypy
norlandrhagen 5d14b20
make mypy happy
norlandrhagen fb844b6
adds option for AsyncArray to _is_zarr_array
norlandrhagen 421f53f
big async rewrite
norlandrhagen cedad11
merge w/ main
norlandrhagen 1c5e42d
fixes merge conflict
norlandrhagen 89d8555
bit of restructure
norlandrhagen c1a5218
nit
norlandrhagen 6af84b4
WIP on ChunkManifest.from_arrays
norlandrhagen 349386f
v2/v3 c chunk fix + build ChunkManifest from numpy arrays
norlandrhagen c776ab9
removed method of creating ChunkManifests from dicts
norlandrhagen fb6fff7
cleanup
norlandrhagen 87c74d4
adds xfails to TestOpenVirtualDatasetZarr due to local filesystem zar…
norlandrhagen 9e44a8a
Merge branch 'main' into zarr_reader
norlandrhagen 87dbdae
some nits after merging w/ main
norlandrhagen 855fb5a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 29434f1
updates zarr v3 req
norlandrhagen 0dcfc91
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen 5f7040c
lint
norlandrhagen d3b0a92
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4716114
remove build_chunk_manifest_from_dict_mapping function since manifest…
norlandrhagen 32f7060
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen dc6e6f8
tmp ignore lint
norlandrhagen 9db4339
remove zarr fill_value skip
norlandrhagen 4e0fb99
fixes network req import in test_integration
norlandrhagen 72ae8b0
bump xarray to 2025.1.1 and icechunk to 0.1.0a10 in upstream
norlandrhagen 177f2cf
merge w/ dep bump
norlandrhagen d61e593
move zarr import into type checking
norlandrhagen 9edf706
move zarr import in test_zarr
norlandrhagen 3e68537
adding back in missing nbytes property
norlandrhagen 594d4a8
typing
norlandrhagen 3c6dc54
tmp testing & removing old xfail
norlandrhagen dd20c8a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 9be9455
merge w/ main
norlandrhagen dac6c77
adds back in validate_and_normalize_path_to_uri after upstream zarr f…
norlandrhagen 3d230dc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 62ffe02
removing kerchunk from zarr integration test
norlandrhagen 843c286
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen 231b703
removed zarr manifest + lint
norlandrhagen 0d4d653
wip on testing
norlandrhagen 7724969
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] a8c498a
revert min-deps change
norlandrhagen 9684dd3
merge
norlandrhagen 9fb201d
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen e310f04
revert environment.yaml
norlandrhagen 800a165
removed zarr manifest writing
norlandrhagen 6609ae3
Merge branch 'main' into zarr_reader
norlandrhagen ac538c7
cleanup and consolidation in zarr reader
norlandrhagen 3be76ff
typing
norlandrhagen a74047c
Merge branch 'main' into zarr_reader
norlandrhagen 1e60835
test_unsupported_zarr_python to zarr v3
norlandrhagen 5d91679
rel path issue?
norlandrhagen 5084adf
revert accidental icechunk commit
norlandrhagen 136cc2f
merge w/ main
norlandrhagen e7a36d7
wip on fixing codecs
norlandrhagen df5a19e
cleaup of tests + codecs
norlandrhagen 5f88589
Merge branch 'main' into zarr_reader
norlandrhagen efd0064
renived test_zarr writer
norlandrhagen 442e519
bumping icechunk for now
norlandrhagen 54308fb
typing lint
norlandrhagen 973f6b0
remove zarr writer test
norlandrhagen f1c6c7d
merge w/ main
norlandrhagen a17fb23
merge w/ develop branch
norlandrhagen 024b020
adds Zarr V2 reader not supported exception
norlandrhagen db4e617
updates usage and releases and lints upstream.yaml
norlandrhagen 1dc93c2
lint + clarified some todo/comments
norlandrhagen e45d953
quick nit, removed duplicated entry in ci
norlandrhagen 6bb11fe
removed some comments and reverted pyproject
norlandrhagen e7b0544
Merge branch 'develop' into zarr_reader
norlandrhagen 255ed37
pyproj de-dup
norlandrhagen c8d51c9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 42d79ef
util fpaht
norlandrhagen 05354ad
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen 001e09b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 9f8c6f9
adding test to check zarr key format in manifest
norlandrhagen 70b3796
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen 5df848f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 03d019c
switched Manifest creation back to dict
norlandrhagen f0d1a9c
merge
norlandrhagen db167eb
update zarr reader with merge
norlandrhagen 5f3ccc7
cleaned up zarr reader ArrayV3Metadata reading
norlandrhagen c55b905
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 9071e15
vendor cleanup
norlandrhagen 1541f14
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen 2ce1dff
merge
norlandrhagen b34ee09
merge w/ develop and update construct_virtual_dataset
norlandrhagen b07ec91
added _zstd_codec check in get_codec_config to fix numcodecs complaint
norlandrhagen 78e7f9d
mypy lint
norlandrhagen fdef913
mypy lint 2
norlandrhagen 6765a1d
lint
norlandrhagen 6df8f73
typing
norlandrhagen 864576e
adds check for filepath
norlandrhagen 7df8ecf
Merge branch 'develop' into zarr_reader
norlandrhagen 4349efd
spelling nit + revert hdf int
norlandrhagen 32c97dd
removed virtualizarr.zarr + cleanup nits
norlandrhagen e8c6244
cleanup + note
norlandrhagen 242e38b
updates docs/faq.md data table
norlandrhagen b951dcb
revert leading slash
norlandrhagen 5b6afd6
Merge branch 'develop' into zarr_reader
maxrjones a262c7d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 77074bd
Fix bad merge commit
maxrjones f28db33
Use ManifestStore in Zarr reader (#554)
maxrjones bdf4d20
filepath slash nit
norlandrhagen ad01521
Update docs/faq.md
norlandrhagen 4d0151e
Update virtualizarr/readers/zarr.py
norlandrhagen 243cd32
Update virtualizarr/readers/zarr.py
norlandrhagen dc2a266
Update virtualizarr/readers/zarr.py
norlandrhagen 1591412
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 7805033
adds back in todo
norlandrhagen 7d8f75a
adds wip test for scalar chunk testing
norlandrhagen ccc9279
adds test for scalar zarr + modifies get_chunk_mapping_prefix to acco…
norlandrhagen a238177
update localstore to memorystore
norlandrhagen 9f851d1
Merge branch 'develop' into zarr_reader
norlandrhagen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
from __future__ import annotations | ||
|
||
import asyncio | ||
from pathlib import Path # noqa | ||
from typing import ( | ||
Any, | ||
Hashable, | ||
Iterable, | ||
Mapping, | ||
Optional, | ||
) | ||
|
||
import numpy as np | ||
from xarray import Dataset, Index | ||
from zarr.api.asynchronous import open_group as open_group_async | ||
from zarr.core.metadata import ArrayV3Metadata | ||
|
||
from virtualizarr.manifests import ( | ||
ChunkManifest, | ||
ManifestArray, | ||
ManifestGroup, | ||
ManifestStore, | ||
) | ||
from virtualizarr.manifests.manifest import validate_and_normalize_path_to_uri # noqa | ||
from virtualizarr.readers.api import VirtualBackend | ||
from virtualizarr.vendor.zarr.core.common import _concurrent_map | ||
|
||
FillValueT = bool | str | float | int | list | None | ||
|
||
ZARR_DEFAULT_FILL_VALUE: dict[str, FillValueT] = { | ||
# numpy dtypes's hierarchy lets us avoid checking for all the widths | ||
# https://numpy.org/doc/stable/reference/arrays.scalars.html | ||
np.dtype("bool").kind: False, | ||
np.dtype("int").kind: 0, | ||
np.dtype("float").kind: 0.0, | ||
np.dtype("complex").kind: [0.0, 0.0], | ||
np.dtype("datetime64").kind: 0, | ||
} | ||
|
||
|
||
import zarr | ||
|
||
|
||
async def get_chunk_mapping_prefix(zarr_array: zarr.AsyncArray, filepath: str) -> dict: | ||
"""Create a dictionary to pass into ChunkManifest __init__""" | ||
|
||
# TODO: For when we want to support reading V2 we should parse the /c/ and "/" between chunks | ||
if zarr_array.shape == (): | ||
# If we have a scalar array `c` | ||
# https://zarr-specs.readthedocs.io/en/latest/v3/chunk-key-encodings/default/index.html#description | ||
|
||
prefix = zarr_array.name.lstrip("/") + "/c" | ||
prefix_keys = [(prefix,)] | ||
_lengths = [await zarr_array.store.getsize("c")] | ||
_dict_keys = ["c"] | ||
_paths = [filepath + "/" + _dict_keys[0]] | ||
|
||
else: | ||
prefix = zarr_array.name.lstrip("/") + "/c/" | ||
prefix_keys = [(x,) async for x in zarr_array.store.list_prefix(prefix)] | ||
_lengths = await _concurrent_map(prefix_keys, zarr_array.store.getsize) | ||
chunk_keys = [x[0].split(prefix)[1] for x in prefix_keys] | ||
_dict_keys = [key.replace("/", ".") for key in chunk_keys] | ||
_paths = [filepath + "/" + prefix + key for key in chunk_keys] | ||
|
||
_offsets = [0] * len(_lengths) | ||
return { | ||
key: {"path": path, "offset": offset, "length": length} | ||
for key, path, offset, length in zip( | ||
_dict_keys, | ||
_paths, | ||
_offsets, | ||
_lengths, | ||
) | ||
} | ||
|
||
|
||
async def build_chunk_manifest( | ||
zarr_array: zarr.AsyncArray, filepath: str | ||
) -> ChunkManifest: | ||
"""Build a ChunkManifest from a dictionary""" | ||
chunk_map = await get_chunk_mapping_prefix(zarr_array=zarr_array, filepath=filepath) | ||
return ChunkManifest(chunk_map) | ||
|
||
|
||
def get_metadata(zarr_array: zarr.AsyncArray[Any]) -> ArrayV3Metadata: | ||
fill_value = zarr_array.metadata.fill_value | ||
if fill_value is not None: | ||
fill_value = ZARR_DEFAULT_FILL_VALUE[zarr_array.metadata.fill_value.dtype.kind] | ||
|
||
zarr_format = zarr_array.metadata.zarr_format | ||
|
||
if zarr_format == 2: | ||
# TODO: Once we want to support V2, we will have to deconstruct the | ||
# zarr_array codecs etc. and reconstruct them with create_v3_array_metadata | ||
raise NotImplementedError("Reading Zarr V2 currently not supported.") | ||
|
||
elif zarr_format == 3: | ||
return zarr_array.metadata | ||
|
||
else: | ||
raise NotImplementedError("Zarr format is not recognized as v2 or v3.") | ||
|
||
|
||
async def _construct_manifest_array(zarr_array: zarr.AsyncArray[Any], filepath: str): | ||
array_metadata = get_metadata(zarr_array=zarr_array) | ||
|
||
chunk_manifest = await build_chunk_manifest(zarr_array, filepath=filepath) | ||
return ManifestArray(metadata=array_metadata, chunkmanifest=chunk_manifest) | ||
|
||
|
||
async def _construct_manifest_group( | ||
filepath: str, | ||
*, | ||
reader_options: Optional[dict] = None, | ||
drop_variables: str | Iterable[str] | None = None, | ||
group: str | None = None, | ||
): | ||
reader_options = reader_options or {} | ||
zarr_group = await open_group_async( | ||
filepath, | ||
storage_options=reader_options.get("storage_options"), | ||
path=group, | ||
mode="r", | ||
) | ||
|
||
zarr_array_keys = [key async for key in zarr_group.array_keys()] | ||
|
||
_drop_vars: list[Hashable] = [] if drop_variables is None else list(drop_variables) | ||
|
||
zarr_arrays = await asyncio.gather( | ||
*[zarr_group.getitem(var) for var in zarr_array_keys if var not in _drop_vars] | ||
) | ||
|
||
manifest_arrays = await asyncio.gather( | ||
*[ | ||
_construct_manifest_array(zarr_array=array, filepath=filepath) # type: ignore[arg-type] | ||
for array in zarr_arrays | ||
] | ||
) | ||
|
||
manifest_dict = { | ||
array.basename: result for array, result in zip(zarr_arrays, manifest_arrays) | ||
} | ||
return ManifestGroup(manifest_dict, attributes=zarr_group.attrs) | ||
|
||
|
||
def _construct_manifest_store( | ||
filepath: str, | ||
*, | ||
reader_options: Optional[dict] = None, | ||
drop_variables: str | Iterable[str] | None = None, | ||
group: str | None = None, | ||
) -> ManifestStore: | ||
import asyncio | ||
|
||
manifest_group = asyncio.run( | ||
_construct_manifest_group( | ||
filepath=filepath, | ||
group=group, | ||
drop_variables=drop_variables, | ||
reader_options=reader_options, | ||
) | ||
) | ||
return ManifestStore(manifest_group) | ||
|
||
|
||
class ZarrVirtualBackend(VirtualBackend): | ||
@staticmethod | ||
def open_virtual_dataset( | ||
filepath: str, | ||
group: str | None = None, | ||
drop_variables: str | Iterable[str] | None = None, | ||
loadable_variables: Iterable[str] | None = None, | ||
decode_times: bool | None = None, | ||
indexes: Mapping[str, Index] | None = None, | ||
virtual_backend_kwargs: Optional[dict] = None, | ||
reader_options: Optional[dict] = None, | ||
) -> Dataset: | ||
filepath = validate_and_normalize_path_to_uri( | ||
filepath, fs_root=Path.cwd().as_uri() | ||
) | ||
|
||
manifest_store = _construct_manifest_store( | ||
filepath=filepath, | ||
group=group, | ||
drop_variables=drop_variables, | ||
reader_options=reader_options, | ||
) | ||
|
||
ds = manifest_store.to_virtual_dataset( | ||
loadable_variables=loadable_variables, | ||
decode_times=decode_times, | ||
indexes=indexes, | ||
) | ||
return ds |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an issue to track learning to read zarr v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#565