-
-
Notifications
You must be signed in to change notification settings - Fork 364
Open
Labels
bugPotential issues with the zarr-python libraryPotential issues with the zarr-python library
Description
Zarr version
main
Numcodecs version
0.16.3
Python Version
3.13
Operating System
mac
Installation
uv run
Description
This is coming from: pydata/xarray#10831
when calling dask.to_zarr
with sharding and explcitly passing chunks to create_array
then there is potential for data loss due to misalignment of chunks, sharding inner chunks and dask chunks. Raising here first instead of dask because I also found that if you comment out the explicit chunk passing when creating the array then zarr throws an error protecting you:
ValueError: The array's `chunk_shape` (got (510, 255, 255)) needs to be divisible by the shard's inner `chunk_shape` (got (8, 3, 5)).
but when explcitly passing chunks you end up with this data loss:
Steps to reproduce
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "zarr @ git+https://github.com/zarr-developers/zarr-python.git",
# "numpy",
# "dask[array] @ git+https://github.com/dask/dask.git",
# ]
# ///
import dask.array as da
import numpy as np
import zarr
rng = da.random.default_rng(seed=42)
dask_array = rng.integers(
0, 2, size=(1000, 300, 300), chunks=(255, 255, 255), dtype=np.int64
)
original_sum = dask_array.sum().compute()
store = zarr.storage.LocalStore("bug.zarr")
group = zarr.open_group(store=store, mode="w")
zarr_array = group.create_array(
name="data",
shape=dask_array.shape,
chunks=(255, 255, 255),
shards=(510, 255, 255),
dtype=dask_array.dtype,
overwrite=True,
)
da.to_zarr(dask_array, zarr_array)
store_read = zarr.storage.LocalStore("bug.zarr")
group_read = zarr.open_group(store=store_read, mode="r")
array_read = group_read["data"]
read_sum = array_read[:].sum()
assert read_sum == original_sum, (
f"Data corruption: expected {original_sum}, got {read_sum}"
)
Additional output
assert read_sum == original_sum, (
^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Data corruption: expected 45004136, got 25746614
Holmgren825
Metadata
Metadata
Assignees
Labels
bugPotential issues with the zarr-python libraryPotential issues with the zarr-python library