Skip to content

fix chunk/shard iteration #3299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Jul 25, 2025

In main there are some routines for iterating over the chunks of an array, but these routines do not distinguish between chunks and shards (i.e., stored objects) for arrays with sharding.

This PR adds a separate set of shard-specific iteration routines to complement our chunk-specific iteration routines. Various bugs related to iterating over chunks, when shards were the intended iteration target, have been fixed by these changes, notably bugs causing memory races when creating arrays via create_array (xref ##3169)

I think this supersedes #3217, @bojidar-bg I credited you as a co-author on one of these commits because your idea to change the iteration from chunks to shards was correct.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 25, 2025
return self.chunk_grid_shape

@property
def chunk_grid_shape(self) -> ChunkCoords:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new property called chunk_grid_shape because cdata_shape is ambiguous. cdata_shape is still around, but it now uses chunk_grid_shape

return tuple(starmap(ceildiv, zip(self.shape, self.chunks, strict=True)))

@property
def shard_grid_shape(self) -> ChunkCoords:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this complements chunk_grid_shape.

@bojidar-bg
Copy link
Contributor

Ooh, that looks like a much more complete implementation of what I did in that other PR! Kudos; I wouldn't have been able to push things this far ✨✨

@d-v-b
Copy link
Contributor Author

d-v-b commented Jul 25, 2025

Ooh, that looks like a much more complete implementation of what I did in that other PR! Kudos; I wouldn't have been able to push things this far ✨✨

It turned out to be more than I expected 😅 .


else:
msg = f"Indexing order {order} is not supported at this time." # type: ignore[unreachable]
raise NotImplementedError(msg)


def iter_regions(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a new function for iterating over contiguous regions. When we support irregular chunking, we can overload the type of region_shape accordingly.

Copy link

codecov bot commented Jul 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.63%. Comparing base (c21d1f9) to head (aa9d0b7).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3299      +/-   ##
==========================================
+ Coverage   94.55%   94.63%   +0.07%     
==========================================
  Files          79       79              
  Lines        9448     9500      +52     
==========================================
+ Hits         8934     8990      +56     
+ Misses        514      510       -4     
Files with missing lines Coverage Δ
src/zarr/core/array.py 97.43% <100.00%> (+0.33%) ⬆️
src/zarr/core/indexing.py 96.40% <100.00%> (+0.31%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b d-v-b requested a review from a team July 26, 2025 21:02
Comment on lines +1221 to +1224
if self.shards is None:
shard_shape = self.chunks
else:
shard_shape = self.shards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it. matches how I think about things.

@dstansby
Copy link
Contributor

Just to make sure we're on the same page here, this is where we're heading in my mind, as something that can go in the user guide:

In zarr-python, a chunk of data represents a single file in storage. There is one exception: when a single sharding codec is part of the codecs. In this case a shard represents a single file in storage, and a chunk represents a unit of data the same size or smaller that can be read independently of other chunks in the shard. This distinction is made to allow zarr-python to optimize the creation and processing of sharded arrays.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 4, 2025

some interesting new build errors that seem unrelated to my changes:

https://github.com/zarr-developers/zarr-python/actions/runs/16720510145/job/47323677345#step:5:33

3s
Run hatch build
──────────────────────────────────── sdist ─────────────────────────────────────
Setting up build environment
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/runner/.local/share/hatch/env/virtual/zarr/nOVTQ5-s/zarr-build/lib/python3.11/site-packages/hatchling/__main__.py", line 6, in <module>
    sys.exit(hatchling())
             ^^^^^^^^^^^
  File "/home/runner/.local/share/hatch/env/virtual/zarr/nOVTQ5-s/zarr-build/lib/python3.11/site-packages/hatchling/cli/__init__.py", line 26, in hatchling
    command(**kwargs)
  File "/home/runner/.local/share/hatch/env/virtual/zarr/nOVTQ5-s/zarr-build/lib/python3.11/site-packages/hatchling/cli/build/__init__.py", line 82, in build_impl
    for artifact in builder.build(
  File "/home/runner/.local/share/hatch/env/virtual/zarr/nOVTQ5-s/zarr-build/lib/python3.11/site-packages/hatchling/builders/plugin/interface.py", line 147, in build
    build_hook.initialize(version, build_data)
  File "/home/runner/.local/share/hatch/env/virtual/zarr/nOVTQ5-s/zarr-build/lib/python3.11/site-packages/hatch_vcs/build_hook.py", line 35, in initialize
    dump_version(self.root, self.metadata.version, self.config_version_file, **kwargs)
  File "/home/runner/.local/share/hatch/env/virtual/zarr/nOVTQ5-s/zarr-build/lib/python3.11/site-packages/setuptools_scm/_integration/dump_version.py", line 77, in dump_version
    write_version_to_path(
  File "/home/runner/.local/share/hatch/env/virtual/zarr/nOVTQ5-s/zarr-build/lib/python3.11/site-packages/setuptools_scm/_integration/dump_version.py", line 111, in write_version_to_path
    content = final_template.format(version=version, version_tuple=version_tuple)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'scm_version'

@dstansby
Copy link
Contributor

dstansby commented Aug 4, 2025

I've started reviewing code, but first wanted to provide comments at a higher level:

To understand what this PR results in, I ran a little test script that compares properties for sharded and non-sharded arrays (see below). I like the concept that even when shards are not set, the shard properties still return the chunk values, but since this is a new concept it should be accomapanied with appropriate explanation in the Sharding user guide section, and in the release notes (which could just be a link to the sharding user guide section).

It is a bit confusing that .shards now returns None with just chunks, where all the other shard releated properties return the equivalent chunk values. I would suggest returning self.chunks from .shards if there's no sharding to match the other new properties, and adding a new .is_sharded: bool property to determine whether shards are being used (previously one would have done self.shards is None).

It also feels like the scope of this PR continues to expand beyond just a bugfix into (very welcome!) additions and new concepts, so I would advocate for it landing in 3.2.0 instead of a micro release.

I have some pending inline comments, but I thought I'd leave them until we've hased out the higher level stuff. Thanks for this PR - I like where it's going a lot 😄

import zarr

arr = zarr.create_array(store={}, shape=(12,), chunks=(2,), dtype='uint8')
arr[0] = 1
print("For a non-sharded array")
print("-----------------------")
print(f"{arr.chunks=}")
print(f"{arr.chunk_grid_shape=}")
print(f"{arr.nchunks=}")
print(f"{arr.nchunks_initialized=}")
print()
print(f"{arr.shards=}")
print(f"{arr.shard_grid_shape=}")
print(f"{arr.nshards=}")
print(f"{arr.nshards_initialized=}")
print()

arr = zarr.create_array(store={}, shape=(12,), chunks=(2,), shards=(6,), dtype='uint8')
arr[0] = 1
print("For a sharded array")
print("-------------------")
print(f"{arr.chunks=}")
print(f"{arr.chunk_grid_shape=}")
print(f"{arr.nchunks=}")
print(f"{arr.nchunks_initialized=}")
print()
print(f"{arr.shards=}")
print(f"{arr.shard_grid_shape=}")
print(f"{arr.nshards=}")
print(f"{arr.nshards_initialized=}")
print()
For a non-sharded array
-----------------------
arr.chunks=(2,)
arr.chunk_grid_shape=(6,)
arr.nchunks=6
arr.nchunks_initialized=1

arr.shards=None
arr.shard_grid_shape=(6,)
arr.nshards=6
arr.nshards_initialized=1

For a sharded array
-------------------
arr.chunks=(2,)
arr.chunk_grid_shape=(6,)
arr.nchunks=6
arr.nchunks_initialized=3

arr.shards=(6,)
arr.shard_grid_shape=(2,)
arr.nshards=2
arr.nshards_initialized=1

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 4, 2025

It also feels like the scope of this PR continues to expand beyond just a bugfix into (very welcome!) additions and new concepts, so I would advocate for it landing in 3.2.0 instead of a micro release.

IMO getting the bugfix out is important enough that this should be targeted for release very soon, ideally the next release. We have a few options here but the simplest might be to make all the new API private, and then make it public for 3.2.0.

Making the new shards API private would also defer a decision about shards=None vs shards=<tuple>

@dstansby
Copy link
Contributor

dstansby commented Aug 4, 2025

IMO getting the bugfix out is important enough that this should be targeted for release very soon, ideally the next release.

I agree, but there are lots of changes here that seem orthogonal to just fixing that bug.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 4, 2025

IMO getting the bugfix out is important enough that this should be targeted for release very soon, ideally the next release.

I agree, but there are lots of changes here that seem orthogonal to just fixing that bug.

They are conceptually extremely simple. we already have tools for iterating over the shape defined by a chunk. I am adding tools for iterating over the shape defined by a shard, where a shard == a chunk in many cases. if you want time to study these changes in greater detail, lets just make them private.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 4, 2025

Consider the new shard-centric API a fix not only for one specific bug, but a potentially large number of future bugs caused by iterating over chunk-sized regions when writing data. Now that we have added sharding, but not added a user-friendly API for iterating over shard-sized regions, we are setting up a massive footgun for users to break their IO routines in a concurrent context. That's why this PR adds new functions / methods.

@dstansby
Copy link
Contributor

dstansby commented Aug 4, 2025

They are conceptually extremely simple.

They are not conceptually simple to me, and we've had a long discussion here before landing on a design. We shouldn't just make everything private and merge it without proper review, since private code still needs to be understood by developers. If code takes a long time to review, then it takes a long time to review - here there is the option of slimming this PR right down to only that needed to fix the bug, which would make for a quicker review.

My worry is the scope here has crept well beyond a targeted a minimal fix for a bug, to a set of wider changes that introduces new concepts and lots of new API that is not needed to fix that bug. The extra lines of code make PR review harder, and increase the surface area through which bugs could creep in, which is my reasoning for something this big to go in a minor (not micro) release.

So to move forward, I guess I can continue reviewing this, but it won't happen quickly because it's a large PR with lots of new API (public and private), or a smaller targeted just-a-bugfix PR would probably be quicker to review if there is an shorter term need to fix the bug.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 4, 2025

since private code still needs to be understood by developers.

can you indicate which functions or routines are hard to understand here?

@dstansby
Copy link
Contributor

dstansby commented Aug 4, 2025

It's less a case of hard to understand, more a case of new concepts (what is a chunk? what is a shard? what do the eight chunk/shard properties mean? how does passing different values to functions affect these properties? ...). Again, I think this is really good and important work, but big enough and introduces/formalises enough concepts that it will take me a while to wrap my head around it all and review, and should in my opinion land in a minor release with comprehensive user guide documentation and examples to match the new API and concepts.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 4, 2025

what is a chunk? what is a shard?

A chunk is the smallest region of an array that you can read in parallel. A shard is the smallest unit of an array that you can write in parallel. Using this terminology, there is a 1:1 correspondence between shards and files. There is not necessarily a 1:1 correspondence between chunks and files. This confusion caused the bug that this PR fixes (along with some other bugs).

For arrays that don't use the sharding codec, the chunk shape is the same as the effective shard shape. For arrays that do use sharding, the shard shape is an integer multiple of the chunk shape.

For reference, these are not new concepts. The chunk / shard distinction already exists in create_array and in the chunks and shards attributes of an array. In fact I don't think this PR introduces any new concepts at all. I'm simply taking an existing chunk-centric API, and making a shard-centric version of that same API, and in the process resolving multiple bugs.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 4, 2025

as an indication of the low-complexity of these changes, read the tests I added. . There's really nothing elaborate or deep going on here.

@d-v-b d-v-b requested a review from dcherian August 5, 2025 15:17
@dstansby
Copy link
Contributor

I think my earlier comments still needs addressing here:

It is a bit confusing that .shards now returns None with just chunks, where all the other shard releated properties return the equivalent chunk values. I would suggest returning self.chunks from .shards if there's no sharding to match the other new properties, and adding a new .is_sharded: bool property to determine whether shards are being used (previously one would have done self.shards is None).

and

I like the concept that even when shards are not set, the shard properties still return the chunk values, but since this is a new concept it should be accomapanied with appropriate explanation in the Sharding user guide section, and in the release notes (which could just be a link to the sharding user guide section).

Happy to try and review in full once we work these points out (especially the second one).

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 11, 2025

It is a bit confusing that .shards now returns None with just chunks, where all the other shard releated properties return the equivalent chunk values. I would suggest returning self.chunks from .shards if there's no sharding to match the other new properties, and adding a new .is_sharded: bool property to determine whether shards are being used (previously one would have done self.shards is None).

I agree that it's confusing, and I also agree that the Array.shards attribute should make sense in light of the other various shard-related properties. But an is_sharded property might also be confusing, especially for an array where is_sharded == False but shards = <some tuple>. What about naming this property something like uses_sharding_codec? IMO we should be normalizing "all arrays are sharded" as much as possible across the entire stack, and to me this argues for treating the way an array is sharded (via the sharding codec, or not) as an implementation detail. But I don't feel strongly here except that whatever we do should not be confusing.

And sorry for all the pings recently but @zarr-developers/python-core-devs it would be great to get some other POVs here, since we are talking about some potential changes to the user-facing Array object.

For this PR, i'm totally fine making all the new shard methods private until we can decide on a good story for the array API. We can still use these methods to fix the various bugs related to mixing up chunks and shards. This would also address your second point, because we wouldn't need to add any new user-facing docs at this time.

@dcherian
Copy link
Contributor

What about naming this property something like uses_sharding_codec?

This is better.

IMO we should be normalizing "all arrays are sharded" as much as possible across the entire stack, and to me this argues for treating the way an array is sharded (via the sharding codec, or not) as an implementation detail.

I'm 50/50 on this, given that historically Zarr has only had chunks, not shards; and that these are also kwargs to the constructor. It's going to be confusing regardless. Should we just call it shards_or_chunks :D

i'm totally fine making all the new shard methods private until we can decide on a good story for the array API.

Let's do this ASAP, and punt on the changing the meaning of the .shards property.

@dstansby
Copy link
Contributor

Perhaps the recommendation to replace array.shards == None could be array.shards == array.chunks instead?

Certainly .shards shouldn't be changed in code that goes in 3.1.x now, so again I'd advocate for any new API being private, and only the minimum API added that's needed to fix the original issue.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 13, 2025

the new API is all private, let me know if we need to do anything else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants