Skip to content

Commit 14b80a9

Browse files
mpiannucciTomNicholaspre-commit-ci[bot]
authored
Sync with icechunk alpha 8 (#368)
* Update to icechunk alpha 8 * Update some docs * Update release notes * Add datetime checksum functionality to icechunk writer * Add test for checksum that fails * Add workign test for checksums * Fix typing * Breakout checksum functionality to its own test * Docstrings * Update docs a bit * Update virtualizarr/accessor.py Co-authored-by: Tom Nicholas <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add test case for invalid `last_updated_at` * Add note about precision of `last_udpated_at * Fix typing * Typing * Ignore zarr import errors at project level * Update docs --------- Co-authored-by: Tom Nicholas <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent ef8f039 commit 14b80a9

File tree

7 files changed

+232
-108
lines changed

7 files changed

+232
-108
lines changed

ci/upstream.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,6 @@ dependencies:
2828
- fsspec
2929
- pip
3030
- pip:
31-
- icechunk>=0.1.0a7 # Installs zarr v3 as dependency
31+
- icechunk>=0.1.0a8 # Installs zarr v3 as dependency
3232
# - git+https://github.com/fsspec/kerchunk@main # kerchunk is currently incompatible with zarr-python v3 (https://github.com/fsspec/kerchunk/pull/516)
3333
- imagecodecs-numcodecs==2024.6.1

docs/releases.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@ New Features
1111

1212
- Added a ``.nbytes`` accessor method which displays the bytes needed to hold the virtual references in memory.
1313
(:issue:`167`, :pull:`227`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
14+
- Sync with Icechunk v0.1.0a8 (:pull:`368`) By `Matthew Iannucci <https://github.com/mpiannucci>`. This also adds support
15+
for the `to_icechunk` method to add timestamps as checksums when writing virtual references to an icechunk store. This
16+
is useful for ensuring that virtual references are not stale when reading from an icechunk store, which can happen if the
17+
underlying data has changed since the virtual references were written.
1418

1519
Breaking changes
1620
~~~~~~~~~~~~~~~~

docs/usage.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -421,14 +421,16 @@ By default references are placed in separate parquet file when the total number
421421
We can also write these references out as an [IcechunkStore](https://icechunk.io/). `Icechunk` is a Open-source, cloud-native transactional tensor storage engine that is compatible with zarr version 3. To export our virtual dataset to an `Icechunk` Store, we simply use the {py:meth}`vds.virtualize.to_icechunk <virtualizarr.VirtualiZarrDatasetAccessor.to_icechunk>` accessor method.
422422

423423
```python
424-
# create an icechunk store
425-
from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig
426-
storage = StorageConfig.filesystem(str('combined'))
427-
store = IcechunkStore.create(storage=storage, mode="w", config=StoreConfig(
428-
virtual_ref_config=VirtualRefConfig.s3_anonymous(region='us-east-1'),
429-
))
430-
431-
combined_vds.virtualize.to_icechunk(store)
424+
# create an icechunk repository, session and write the virtual dataset to the session
425+
from icechunk import Repository, Storage, VirtualChunkContainer, local_filesystem_storage
426+
storage = local_filesystem_storage(str('combined'))
427+
428+
# By default, local virtual references and public remote virtual references can be read wihtout extra configuration.
429+
repo = Repository.create(storage=storage)
430+
session = repo.writeable_session("main")
431+
432+
# write the virtual dataset to the session with the IcechunkStore
433+
combined_vds.virtualize.to_icechunk(session.store)
432434
```
433435

434436
See the [Icechunk documentation](https://icechunk.io/icechunk-python/virtual/#creating-a-virtual-dataset-with-virtualizarr) for more details.

pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ hdf_reader = [
3939
"numcodecs"
4040
]
4141
icechunk = [
42-
"icechunk>=0.1.0a7",
42+
"icechunk>=0.1.0a8",
4343
]
4444
test = [
4545
"codecov",
@@ -103,6 +103,10 @@ ignore_missing_imports = true
103103
module = "ujson.*"
104104
ignore_missing_imports = true
105105

106+
[[tool.mypy.overrides]]
107+
module = "zarr.*"
108+
ignore_missing_imports = true
109+
106110
[tool.ruff]
107111
# Same as Black.
108112
line-length = 88

virtualizarr/accessor.py

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
from datetime import datetime
12
from pathlib import Path
23
from typing import TYPE_CHECKING, Callable, Literal, Optional, overload
34

@@ -39,7 +40,10 @@ def to_zarr(self, storepath: str) -> None:
3940
dataset_to_zarr(self.ds, storepath)
4041

4142
def to_icechunk(
42-
self, store: "IcechunkStore", append_dim: Optional[str] = None
43+
self,
44+
store: "IcechunkStore",
45+
append_dim: Optional[str] = None,
46+
last_updated_at: Optional[datetime] = None,
4347
) -> None:
4448
"""
4549
Write an xarray dataset to an Icechunk store.
@@ -48,10 +52,30 @@ def to_icechunk(
4852
4953
If `append_dim` is provided, the virtual dataset will be appended to the existing IcechunkStore along the `append_dim` dimension.
5054
55+
If `last_updated_at` is provided, it will be used as a checksum for any virtual chunks written to the store with this operation.
56+
At read time, if any of the virtual chunks have been updated since this provided datetime, an error will be raised.
57+
This protects against reading outdated virtual chunks that have been updated since the last read. When not provided, no check is performed.
58+
This value is stored in Icechunk with seconds precision, so be sure to take that into account when providing this value.
59+
5160
Parameters
5261
----------
5362
store: IcechunkStore
5463
append_dim: str, optional
64+
When provided, specifies the dimension along which to append the virtual dataset.
65+
last_updated_at: datetime, optional
66+
When provided, uses provided datetime as a checksum for any virtual chunks written to the store with this operation.
67+
When not provided (default), no check is performed.
68+
69+
Examples
70+
--------
71+
To ensure an error is raised if the files containing referenced virtual chunks are modified at any time from now on, pass the current time to ``last_updated_at``.
72+
73+
>>> from datetime import datetime
74+
>>>
75+
>>> vds.virtualize.to_icechunk(
76+
... icechunkstore,
77+
... last_updated_at=datetime.now(),
78+
... )
5579
"""
5680
from virtualizarr.writers.icechunk import dataset_to_icechunk
5781

0 commit comments

Comments
 (0)