-
-
Notifications
You must be signed in to change notification settings - Fork 356
fix chunk/shard iteration #3299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nd add a failing test
Co-authored-by: Bojidar Marinov <[email protected]>
src/zarr/core/array.py
Outdated
return self.chunk_grid_shape | ||
|
||
@property | ||
def chunk_grid_shape(self) -> ChunkCoords: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new property called chunk_grid_shape
because cdata_shape
is ambiguous. cdata_shape
is still around, but it now uses chunk_grid_shape
src/zarr/core/array.py
Outdated
return tuple(starmap(ceildiv, zip(self.shape, self.chunks, strict=True))) | ||
|
||
@property | ||
def shard_grid_shape(self) -> ChunkCoords: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this complements chunk_grid_shape
.
…nto fix/_iter_chunk_keys
Ooh, that looks like a much more complete implementation of what I did in that other PR! Kudos; I wouldn't have been able to push things this far ✨✨ |
It turned out to be more than I expected 😅 . |
src/zarr/core/indexing.py
Outdated
|
||
else: | ||
msg = f"Indexing order {order} is not supported at this time." # type: ignore[unreachable] | ||
raise NotImplementedError(msg) | ||
|
||
|
||
def iter_regions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a new function for iterating over contiguous regions. When we support irregular chunking, we can overload the type of region_shape
accordingly.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3299 +/- ##
==========================================
+ Coverage 94.55% 94.63% +0.07%
==========================================
Files 79 79
Lines 9448 9500 +52
==========================================
+ Hits 8934 8990 +56
+ Misses 514 510 -4
🚀 New features to boost your workflow:
|
if self.shards is None: | ||
shard_shape = self.chunks | ||
else: | ||
shard_shape = self.shards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it. matches how I think about things.
Just to make sure we're on the same page here, this is where we're heading in my mind, as something that can go in the user guide:
|
some interesting new build errors that seem unrelated to my changes: https://github.com/zarr-developers/zarr-python/actions/runs/16720510145/job/47323677345#step:5:33
|
I've started reviewing code, but first wanted to provide comments at a higher level: To understand what this PR results in, I ran a little test script that compares properties for sharded and non-sharded arrays (see below). I like the concept that even when shards are not set, the shard properties still return the chunk values, but since this is a new concept it should be accomapanied with appropriate explanation in the Sharding user guide section, and in the release notes (which could just be a link to the sharding user guide section). It is a bit confusing that It also feels like the scope of this PR continues to expand beyond just a bugfix into (very welcome!) additions and new concepts, so I would advocate for it landing in 3.2.0 instead of a micro release. I have some pending inline comments, but I thought I'd leave them until we've hased out the higher level stuff. Thanks for this PR - I like where it's going a lot 😄 import zarr
arr = zarr.create_array(store={}, shape=(12,), chunks=(2,), dtype='uint8')
arr[0] = 1
print("For a non-sharded array")
print("-----------------------")
print(f"{arr.chunks=}")
print(f"{arr.chunk_grid_shape=}")
print(f"{arr.nchunks=}")
print(f"{arr.nchunks_initialized=}")
print()
print(f"{arr.shards=}")
print(f"{arr.shard_grid_shape=}")
print(f"{arr.nshards=}")
print(f"{arr.nshards_initialized=}")
print()
arr = zarr.create_array(store={}, shape=(12,), chunks=(2,), shards=(6,), dtype='uint8')
arr[0] = 1
print("For a sharded array")
print("-------------------")
print(f"{arr.chunks=}")
print(f"{arr.chunk_grid_shape=}")
print(f"{arr.nchunks=}")
print(f"{arr.nchunks_initialized=}")
print()
print(f"{arr.shards=}")
print(f"{arr.shard_grid_shape=}")
print(f"{arr.nshards=}")
print(f"{arr.nshards_initialized=}")
print()
|
IMO getting the bugfix out is important enough that this should be targeted for release very soon, ideally the next release. We have a few options here but the simplest might be to make all the new API private, and then make it public for 3.2.0. Making the new shards API private would also defer a decision about |
I agree, but there are lots of changes here that seem orthogonal to just fixing that bug. |
They are conceptually extremely simple. we already have tools for iterating over the shape defined by a chunk. I am adding tools for iterating over the shape defined by a shard, where a shard == a chunk in many cases. if you want time to study these changes in greater detail, lets just make them private. |
Consider the new shard-centric API a fix not only for one specific bug, but a potentially large number of future bugs caused by iterating over chunk-sized regions when writing data. Now that we have added sharding, but not added a user-friendly API for iterating over shard-sized regions, we are setting up a massive footgun for users to break their IO routines in a concurrent context. That's why this PR adds new functions / methods. |
They are not conceptually simple to me, and we've had a long discussion here before landing on a design. We shouldn't just make everything private and merge it without proper review, since private code still needs to be understood by developers. If code takes a long time to review, then it takes a long time to review - here there is the option of slimming this PR right down to only that needed to fix the bug, which would make for a quicker review. My worry is the scope here has crept well beyond a targeted a minimal fix for a bug, to a set of wider changes that introduces new concepts and lots of new API that is not needed to fix that bug. The extra lines of code make PR review harder, and increase the surface area through which bugs could creep in, which is my reasoning for something this big to go in a minor (not micro) release. So to move forward, I guess I can continue reviewing this, but it won't happen quickly because it's a large PR with lots of new API (public and private), or a smaller targeted just-a-bugfix PR would probably be quicker to review if there is an shorter term need to fix the bug. |
can you indicate which functions or routines are hard to understand here? |
It's less a case of hard to understand, more a case of new concepts (what is a chunk? what is a shard? what do the eight chunk/shard properties mean? how does passing different values to functions affect these properties? ...). Again, I think this is really good and important work, but big enough and introduces/formalises enough concepts that it will take me a while to wrap my head around it all and review, and should in my opinion land in a minor release with comprehensive user guide documentation and examples to match the new API and concepts. |
A chunk is the smallest region of an array that you can read in parallel. A shard is the smallest unit of an array that you can write in parallel. Using this terminology, there is a 1:1 correspondence between shards and files. There is not necessarily a 1:1 correspondence between chunks and files. This confusion caused the bug that this PR fixes (along with some other bugs). For arrays that don't use the sharding codec, the chunk shape is the same as the effective shard shape. For arrays that do use sharding, the shard shape is an integer multiple of the chunk shape. For reference, these are not new concepts. The chunk / shard distinction already exists in |
as an indication of the low-complexity of these changes, read the tests I added. . There's really nothing elaborate or deep going on here. |
I think my earlier comments still needs addressing here:
and
Happy to try and review in full once we work these points out (especially the second one). |
I agree that it's confusing, and I also agree that the And sorry for all the pings recently but @zarr-developers/python-core-devs it would be great to get some other POVs here, since we are talking about some potential changes to the user-facing For this PR, i'm totally fine making all the new |
This is better.
I'm 50/50 on this, given that historically Zarr has only had chunks, not shards; and that these are also kwargs to the constructor. It's going to be confusing regardless. Should we just call it
Let's do this ASAP, and punt on the changing the meaning of the |
Perhaps the recommendation to replace Certainly |
the new API is all private, let me know if we need to do anything else |
In
main
there are some routines for iterating over the chunks of an array, but these routines do not distinguish between chunks and shards (i.e., stored objects) for arrays with sharding.This PR adds a separate set of shard-specific iteration routines to complement our chunk-specific iteration routines. Various bugs related to iterating over chunks, when shards were the intended iteration target, have been fixed by these changes, notably bugs causing memory races when creating arrays via
create_array
(xref ##3169)I think this supersedes #3217, @bojidar-bg I credited you as a co-author on one of these commits because your idea to change the iteration from chunks to shards was correct.