Skip to content

Data loss when writing sharded arrays with daskΒ #3514

@ianhi

Description

@ianhi

Zarr version

main

Numcodecs version

0.16.3

Python Version

3.13

Operating System

mac

Installation

uv run

Description

This is coming from: pydata/xarray#10831

when calling dask.to_zarr with sharding and explcitly passing chunks to create_array then there is potential for data loss due to misalignment of chunks, sharding inner chunks and dask chunks. Raising here first instead of dask because I also found that if you comment out the explicit chunk passing when creating the array then zarr throws an error protecting you:

ValueError: The array's `chunk_shape` (got (510, 255, 255)) needs to be divisible by the shard's inner `chunk_shape` (got (8, 3, 5)).

but when explcitly passing chunks you end up with this data loss:

Steps to reproduce

# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "zarr @ git+https://github.com/zarr-developers/zarr-python.git",
#     "numpy",
#     "dask[array] @ git+https://github.com/dask/dask.git",
# ]
# ///
import dask.array as da
import numpy as np
import zarr

rng = da.random.default_rng(seed=42)
dask_array = rng.integers(
    0, 2, size=(1000, 300, 300), chunks=(255, 255, 255), dtype=np.int64
)
original_sum = dask_array.sum().compute()

store = zarr.storage.LocalStore("bug.zarr")
group = zarr.open_group(store=store, mode="w")
zarr_array = group.create_array(
    name="data",
    shape=dask_array.shape,
    chunks=(255, 255, 255),
    shards=(510, 255, 255),
    dtype=dask_array.dtype,
    overwrite=True,
)

da.to_zarr(dask_array, zarr_array)

store_read = zarr.storage.LocalStore("bug.zarr")
group_read = zarr.open_group(store=store_read, mode="r")
array_read = group_read["data"]
read_sum = array_read[:].sum()

assert read_sum == original_sum, (
    f"Data corruption: expected {original_sum}, got {read_sum}"
)

Additional output

    assert read_sum == original_sum, (
           ^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Data corruption: expected 45004136, got 25746614

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugPotential issues with the zarr-python library

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions