observing the chunk layout for an array #4082

d-v-b · 2026-06-18T15:38:39Z

d-v-b
Jun 18, 2026
Maintainer

Some recent changes (adding rectilinear chunks) and upcoming ones (lazy slicing?) are straining the APIs that tell users how an array is partitioned. I think this is really important information to get right, and we would benefit from thinking through the design, maybe in a discussion. hence this discussion. For background, we had a related discussion prior to the 3.x release.

here is a quick summary of our current situation:

Arrays have a chunks attribute, and a shards attribute. Neither chunks nor shards are array metadata fields in the v3 spec. We use these fields so that users could do create_array(chunks=(10,), shards=(20,)) to create a sensibly sharded array without threading the inner chunk shape through the codecs.

We kept chunks for backwards compatibility with zarr-python 2.x;
We chose chunks to denote "smallest readable unit" in this context to ensure that readers consuming zarr arrays (like dask) would pick the right granularity for reading by checking the chunks attribute.

Rectilinear chunking breaks the chunks attribute. The introduction of rectilinear chunking means chunks is not a plain tuple but potentially something large, as each individual chunk can have a unique shape. Rather than widen the type of this attribute, which might be a breaking change for consumers that expect tuple[int, ...], array.chunks raises a NotImplementedError when the rectilinear chunk grid is used:

import zarr
zarr.config.set({'array.rectilinear_chunks': True})

z = zarr.create_array(
    "memory:///foo",
    shape=(18,),
    shards=((10, 8),),
    chunks=(2,),
    dtype='uint8',
)

print(z.read_chunk_sizes)
# ((2, 2, 2, 2, 2, 2, 2, 2, 2),)
print(z.write_chunk_sizes)
# ((10, 8),)
print(z.chunks)
"""
NotImplementedError: The `chunks` attribute is only defined for arrays using regular chunk grids. This array has a rectilinear chunk grid. Use `read_chunk_sizes` for general access.
"""

With rectilinear chunking we got two new array attributes: read_chunk_sizes and write_chunk_sizes, which you can see in the code snippet above. But by focusing on abstract "read size" and "write size", these two attributes obscure important information about the array, like the actual layout of each chunk. The "read size" and "write size" is an instruction to the reader / writer about how the granularity of that operation, but an array user might also care about the stored layout of a chunk. for example, these two arrays have similar "read" and "write" sizes, but different physical chunks:

Array a
```
# 
import zarr
zarr.config.set({'array.rectilinear_chunks': True})
z = zarr.create_array(
    "memory:///foo",
    shape=(18,),
    chunks=([7,7,4],),
    dtype='uint8',
)

print(z.read_chunk_sizes)
# ((7, 7, 4),)
print(z.write_chunk_sizes)
# ((7, 7, 4),)
print(z.chunks)
# (7,)
```
Array B
```
import zarr
zarr.config.set({'array.rectilinear_chunks': True})

z = zarr.create_array(
    "memory:///foo",
    shape=(18,),
    chunks=(7,),
    dtype='uint8',
)

print(z.read_chunk_sizes)
# ((7, 7, 4),)
print(z.write_chunk_sizes)
# ((7, 7, 4),)
print(z.chunks)
# (7,)
```
In the above examples, array A uses the rectilinear chunk grid and so the stored chunks are sub-arrays with sizes (7,7,4). Array B uses the regular chunk grid, and its stored chunks are subarrays with sizes (7,7,7).

From an array indexing POV these two chunk grids behave identically, but they have different chunks, and I think we want to ensure that users can easily distinguish these two cases with methods or attributes on the Array class.

And here are some complications:

Lazy slicing will affect how we map a zarr.Array to stored chunks. Lazy slicing will create views of subsets of chunks. So for a lazily sliced array, we need to enumerate the projection of that array's selection on to the underlying chunks, which isn't the same as the size of the underlying chunks. That means for a lazy indexing operation like subset_2 = array.lazy[::2], subset_2.read_chunk_sizes isn't well defined as a collection of chunk sizes that sum to subset_2.shape, since subset_2 isn't defined from whole chunks.
The effect write / read granularity depends on more than just the presence / configuration of the sharding codec. We need to evaluate the read / write granularity over the entire codec pipeline, given the capabilities of a storage backend. Users can theoretically combine sharding with a bytes-bytes codec that negates the ability to read or write subchunks. It is not a good idea, but it's expressible in metadata. And on local storage, or memory storage, with no compression, individual scalars can be written if the store supports byte-range writes. Hopefully we get this feature in zarr-python soon! It's very important!

With all that said, @maxrjones has a PR that outlines a new chunk_layout data structure. I think this will help convey some of the information array consumers need. I'd also like to use this discussion as a venue to enumerate exactly what kind of information we think array consumers need, given the complexity in the array API.

Given some potentially lazy-sliced array A, I think users need easy access to the following info:

Which chunk files will I read / write if I collect the values of A? How big is each of these chunks? How is each chunk selected to produce A?
Which chunk files will I read / write if I write new values to A? (recall that partial chunk writes do a read first) How big is each of these chunks? How is each chunk selected when I write new values?
Which regions of A should I iterate over if I want to read 1 chunk per region? How big is the chunk I have to read, per region?
Which regions of A should I iterate over if I want to write 1 chunk per region? How big is that chunk (this is not necessarily the same size as the region!)?

Some questions for participants:

Did I miss anything in this enumeration? Are there any other chunked operations we need to support, given lazy slicing and the subtleties of a logical chunk size vs physical chunk size?
Should we revisit our use of the terms "shards" and "chunks"?
Does the proposal in poc: ChunkLayout for chunk and shard inspection #4040 get us there?
If not, what are we missing?

I'm especially keen to hear from zarr array consumers, @psobolewskiPhD .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

observing the chunk layout for an array #4082

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Array a

Array B

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

observing the chunk layout for an array #4082

Uh oh!

Uh oh!

d-v-b Jun 18, 2026 Maintainer

Array a

Array B

Replies: 0 comments

d-v-b
Jun 18, 2026
Maintainer