You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some recent changes (adding rectilinear chunks) and upcoming ones (lazy slicing?) are straining the APIs that tell users how an array is partitioned. I think this is really important information to get right, and we would benefit from thinking through the design, maybe in a discussion. hence this discussion. For background, we had a related discussion prior to the 3.x release.
here is a quick summary of our current situation:
Arrays have a chunks attribute, and a shards attribute. Neither chunks nor shards are array metadata fields in the v3 spec. We use these fields so that users could do create_array(chunks=(10,), shards=(20,)) to create a sensibly sharded array without threading the inner chunk shape through the codecs.
We kept chunks for backwards compatibility with zarr-python 2.x;
We chose chunks to denote "smallest readable unit" in this context to ensure that readers consuming zarr arrays (like dask) would pick the right granularity for reading by checking the chunks attribute.
Rectilinear chunking breaks the chunks attribute. The introduction of rectilinear chunking means chunks is not a plain tuple but potentially something large, as each individual chunk can have a unique shape. Rather than widen the type of this attribute, which might be a breaking change for consumers that expect tuple[int, ...], array.chunks raises a NotImplementedError when the rectilinear chunk grid is used:
importzarrzarr.config.set({'array.rectilinear_chunks': True})
z=zarr.create_array(
"memory:///foo",
shape=(18,),
shards=((10, 8),),
chunks=(2,),
dtype='uint8',
)
print(z.read_chunk_sizes)
# ((2, 2, 2, 2, 2, 2, 2, 2, 2),)print(z.write_chunk_sizes)
# ((10, 8),)print(z.chunks)
"""NotImplementedError: The `chunks` attribute is only defined for arrays using regular chunk grids. This array has a rectilinear chunk grid. Use `read_chunk_sizes` for general access."""
With rectilinear chunking we got two new array attributes: read_chunk_sizes and write_chunk_sizes, which you can see in the code snippet above. But by focusing on abstract "read size" and "write size", these two attributes obscure important information about the array, like the actual layout of each chunk. The "read size" and "write size" is an instruction to the reader / writer about how the granularity of that operation, but an array user might also care about the stored layout of a chunk. for example, these two arrays have similar "read" and "write" sizes, but different physical chunks:
In the above examples, array A uses the rectilinear chunk grid and so the stored chunks are sub-arrays with sizes (7,7,4). Array B uses the regular chunk grid, and its stored chunks are subarrays with sizes (7,7,7).
From an array indexing POV these two chunk grids behave identically, but they have different chunks, and I think we want to ensure that users can easily distinguish these two cases with methods or attributes on the Array class.
And here are some complications:
Lazy slicing will affect how we map a zarr.Array to stored chunks. Lazy slicing will create views of subsets of chunks. So for a lazily sliced array, we need to enumerate the projection of that array's selection on to the underlying chunks, which isn't the same as the size of the underlying chunks. That means for a lazy indexing operation like subset_2 = array.lazy[::2], subset_2.read_chunk_sizes isn't well defined as a collection of chunk sizes that sum to subset_2.shape, since subset_2 isn't defined from whole chunks.
The effect write / read granularity depends on more than just the presence / configuration of the sharding codec. We need to evaluate the read / write granularity over the entire codec pipeline, given the capabilities of a storage backend. Users can theoretically combine sharding with a bytes-bytes codec that negates the ability to read or write subchunks. It is not a good idea, but it's expressible in metadata. And on local storage, or memory storage, with no compression, individual scalars can be written if the store supports byte-range writes. Hopefully we get this feature in zarr-python soon! It's very important!
With all that said, @maxrjones has a PR that outlines a new chunk_layout data structure. I think this will help convey some of the information array consumers need. I'd also like to use this discussion as a venue to enumerate exactly what kind of information we think array consumers need, given the complexity in the array API.
Given some potentially lazy-sliced array A, I think users need easy access to the following info:
Which chunk files will I read / write if I collect the values of A? How big is each of these chunks? How is each chunk selected to produce A?
Which chunk files will I read / write if I write new values to A? (recall that partial chunk writes do a read first) How big is each of these chunks? How is each chunk selected when I write new values?
Which regions of A should I iterate over if I want to read 1 chunk per region? How big is the chunk I have to read, per region?
Which regions of A should I iterate over if I want to write 1 chunk per region? How big is that chunk (this is not necessarily the same size as the region!)?
Some questions for participants:
Did I miss anything in this enumeration? Are there any other chunked operations we need to support, given lazy slicing and the subtleties of a logical chunk size vs physical chunk size?
Should we revisit our use of the terms "shards" and "chunks"?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Some recent changes (adding rectilinear chunks) and upcoming ones (lazy slicing?) are straining the APIs that tell users how an array is partitioned. I think this is really important information to get right, and we would benefit from thinking through the design, maybe in a discussion. hence this discussion. For background, we had a related discussion prior to the 3.x release.
here is a quick summary of our current situation:
chunksattribute, and ashardsattribute. Neitherchunksnorshardsare array metadata fields in the v3 spec. We use these fields so that users could docreate_array(chunks=(10,), shards=(20,))to create a sensibly sharded array without threading the inner chunk shape through the codecs.We kept
chunksfor backwards compatibility with zarr-python 2.x;We chose
chunksto denote "smallest readable unit" in this context to ensure that readers consuming zarr arrays (like dask) would pick the right granularity for reading by checking thechunksattribute.Rectilinear chunking breaks the
chunksattribute. The introduction of rectilinear chunking meanschunksis not a plain tuple but potentially something large, as each individual chunk can have a unique shape. Rather than widen the type of this attribute, which might be a breaking change for consumers that expecttuple[int, ...],array.chunksraises aNotImplementedErrorwhen the rectilinear chunk grid is used:With rectilinear chunking we got two new array attributes:
read_chunk_sizesandwrite_chunk_sizes, which you can see in the code snippet above. But by focusing on abstract "read size" and "write size", these two attributes obscure important information about the array, like the actual layout of each chunk. The "read size" and "write size" is an instruction to the reader / writer about how the granularity of that operation, but an array user might also care about the stored layout of a chunk. for example, these two arrays have similar "read" and "write" sizes, but different physical chunks:Array a
Array B
In the above examples, array A uses the rectilinear chunk grid and so the stored chunks are sub-arrays with sizes
(7,7,4). Array B uses the regular chunk grid, and its stored chunks are subarrays with sizes(7,7,7).From an array indexing POV these two chunk grids behave identically, but they have different chunks, and I think we want to ensure that users can easily distinguish these two cases with methods or attributes on the
Arrayclass.And here are some complications:
zarr.Arrayto stored chunks. Lazy slicing will create views of subsets of chunks. So for a lazily sliced array, we need to enumerate the projection of that array's selection on to the underlying chunks, which isn't the same as the size of the underlying chunks. That means for a lazy indexing operation likesubset_2 = array.lazy[::2],subset_2.read_chunk_sizesisn't well defined as a collection of chunk sizes that sum tosubset_2.shape, sincesubset_2isn't defined from whole chunks.With all that said, @maxrjones has a PR that outlines a new
chunk_layoutdata structure. I think this will help convey some of the information array consumers need. I'd also like to use this discussion as a venue to enumerate exactly what kind of information we think array consumers need, given the complexity in the array API.Given some potentially lazy-sliced array
A, I think users need easy access to the following info:A? How big is each of these chunks? How is each chunk selected to produceA?A? (recall that partial chunk writes do a read first) How big is each of these chunks? How is each chunk selected when I write new values?Ashould I iterate over if I want to read 1 chunk per region? How big is the chunk I have to read, per region?Ashould I iterate over if I want to write 1 chunk per region? How big is that chunk (this is not necessarily the same size as the region!)?Some questions for participants:
I'm especially keen to hear from zarr array consumers, @psobolewskiPhD .
Beta Was this translation helpful? Give feedback.
All reactions