Skip to content

Conversation

RFLeijenaar
Copy link
Contributor

Following #3433 (@d-v-b), I have implemented a registry for chunk key encodings. This allows users to subclass ChunkKeyEncoding and create their own implementation.

The scope of ChunkKeyEncoding.from_dict() function is reduced to what you expect it do: build from a dict. I placed the parsing function in chunk_key_encodings.py as it is used in both array and metadata v3 construction. I wasn't sure where else to put it.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Sep 5, 2025
Copy link

codecov bot commented Sep 5, 2025

Codecov Report

❌ Patch coverage is 56.60377% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.23%. Comparing base (3c883a3) to head (46d1780).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/core/chunk_key_encodings.py 51.42% 17 Missing ⚠️
src/zarr/registry.py 58.33% 5 Missing ⚠️
src/zarr/core/metadata/v3.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3436      +/-   ##
==========================================
- Coverage   61.24%   61.23%   -0.02%     
==========================================
  Files          83       83              
  Lines        9885     9897      +12     
==========================================
+ Hits         6054     6060       +6     
- Misses       3831     3837       +6     
Files with missing lines Coverage Δ
src/zarr/core/array.py 69.10% <100.00%> (-0.04%) ⬇️
src/zarr/core/metadata/v3.py 58.03% <50.00%> (ø)
src/zarr/registry.py 63.39% <58.33%> (-0.24%) ⬇️
src/zarr/core/chunk_key_encodings.py 33.82% <51.42%> (-0.03%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RFLeijenaar
Copy link
Contributor Author

RFLeijenaar commented Sep 5, 2025

Note that the current implementation does not use the global config. This means that it only supports a single implementation (registration) for each chunk key encoding 'type' as indicated by the name field. Like codecs (parse_codecs), it uses the name field to retrieve the correct class from the registry. This does mean that the chunk key encoding should always be registered with a qualname that matches the name field.

With the entrypoint plugin mechanism this currently doesn't work correctly. It uses the full class path rather than the entrypoint name e.name for the key.

I will update the registry to bring it in line with how codecs are implemented, with a dict of registries, that map a chunk key encoding type (name) to a registry that contains possibly multiple implementations for that type.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 5, 2025

I will update the registry to bring it in line with how codecs are implemented, with a dict of registries, that map a chunk key encoding type (name) to a registry that contains possibly multiple implementations for that type.

You don't need to copy the codec registry. A simple {name: class} mapping is fine for chunk key encodings.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 5, 2025

for context, the codec registry associates multiple codec classes with a single codec identifier because of need specific to codecs (running the same codec algorithm on a CPU vs GPU). Chunk key encodings will never be run on specialized hardware, so we can use a simpler mapping.

@RFLeijenaar
Copy link
Contributor Author

for context, the codec registry associates multiple codec classes with a single codec identifier because of need specific to codecs (running the same codec algorithm on a CPU vs GPU). Chunk key encodings will never be run on specialized hardware, so we can use a simpler mapping.

Originally, I implemented it such that it would only allow one implementation, but there were some issues as noted in my other comment. That said, these can be resolved by always using the name field as the key, and for any entrypoint e setting the key to e.name.

Regarding different implementations, it doesn't need to be on special hardware, right? One could also make their own implementation, possibly with bindings in a faster language. Not that I think that this would be common for chunk key encodings.

Let me know if you want me to revert to the simpeler, one implementation only design.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 12, 2025

Let me know if you want me to revert to the simpeler, one implementation only design.

Yes, if it's not too much work I think this is a better design. I don't see a situation where a chunk key encoding would require specialized hardware, so I don't think we need to copy the codec registry design.

@RFLeijenaar
Copy link
Contributor Author

I already have another branch that only allows one implementation, I can move that one into this PR!

In the meantime, I have been using a custom ChunkKeyEncoding in one of my repos. It works excellent to create a nested structure to reduce the number of files in a single directory in filesystems that might otherwise become slow:

@dataclass(frozen=True)
class FanoutChunkKeyEncoding(ChunkKeyEncoding):
    name: str = "fanout"
    separator: SeparatorLiteral = "/"
    max_files_per_dir: int = 1000

    def _fanout_coord(self, coord: int) -> list[str]:
        # Break a coordinate into fanout parts
        parts = []
        while coord >= self.max_files_per_dir:
            coord, rem = divmod(coord, self.max_files_per_dir)
            parts.append(str(rem))
        parts.append(str(coord))
        parts.reverse()
        return parts

    def encode_chunk_key(self, chunk_coords: tuple[int, ...]) -> str:
        parts = []
        for dim, coord in enumerate(chunk_coords):
            coords = self._fanout_coord(coord)
            coords = [f"d{dim}"] + coords
            parts.extend(coords)
        parts.append("c")
        return self.separator.join(parts)

I believe that right now decode_chunk_key(self, chunk_key: str) is unused. Are there any plans to use this function in the future? Otherwise, it might be better to get rid of it now before making the extendibility public.

Are there any other things you would like me to do for this PR?

@d-v-b
Copy link
Contributor

d-v-b commented Sep 13, 2025

I already have another branch that only allows one implementation, I can move that one into this PR!

Great!

In the meantime, I have been using a custom ChunkKeyEncoding in one of my repos. It works excellent to create a nested structure to reduce the number of files in a single directory in filesystems that might otherwise become slow:

That's very cool, if you're ever interested in sharing this chunk key encoding more widely, you might consider writing it up as a spec in https://github.com/zarr-developers/zarr-extensions/

I believe that right now decode_chunk_key(self, chunk_key: str) is unused. Are there any plans to use this function in the future? Otherwise, it might be better to get rid of it now before making the extendibility public.

I haven't thought much about this, but I think we would only need decode_chunk_key in a situation when we have a random chunk key and we need to know which chunk it contains. I'm pretty sure this scenario never happens in the normal operation of this library, so you're probably right that this should be eventually removed from the public API. Defining chunk decoding is useful for testing but I don't think we need it as a public method.

Are there any other things you would like me to do for this PR?

other than moving to the simpler registry design, this looks good!

- Change register_chunk_key_encoding function to take key as first arg
similar to codec.
This enables users to add additional fields to a custom ChunkKeyEncoding
without having to override __init__ and taking care of immutability of
the attrs.
- Enforce encode_chunk_key to be implemented (abstractmethod in ABC)
- Make decode_chunk_key optional (raise NotImplementedError by default)
Note, the latter is never raised by the current zarr implementation.
@RFLeijenaar
Copy link
Contributor Author

I have overwritten this PR with the other branch (it largely had the same commit history).

I haven't thought much about this, but I think we would only need decode_chunk_key in a situation when we have a random chunk key and we need to know which chunk it contains. I'm pretty sure this scenario never happens in the normal operation of this library, so you're probably right that this should be eventually removed from the public API. Defining chunk decoding is useful for testing but I don't think we need it as a public method.

Yes, which could be useful to iterate/find the current chunks in the array store. But I believe it is almost always safer to then just iterate the chunk coords, encode, and see which chunk keys are present.

Anyway, I left the decode_chunk_key in there, but it is no longer an @abstractmethod, and raises NotImplementedError by default. Note, this will never be raised by the current library. On the other hand, a subclass that is instantiated enforces the implementation of encode_chunk_key. I added docstrings to make subclassing clear to the user. It can be considered later if decode_chunk_key should be removed altogether.

That's very cool, if you're ever interested in sharing this chunk key encoding more widely, you might consider writing it up as a spec in https://github.com/zarr-developers/zarr-extensions/

Thanks, I will consider writing a spec for it. I was also thinking about a hashed key encoding, which could improve load balancing on distributed storage such as s3 or gcs (see zarr-specs #115). But I will have to test which array/chunk/read shape configurations would actually cause a bottleneck to see how useful it is.


@dataclass(frozen=True)
class V2ChunkKeyEncoding(ChunkKeyEncoding):
name: Literal["v2"] = "v2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would annotating this as a ClassVar work here? Then we wouldn't need to redundantly pass a name parameter to __init__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name should not be an instance configurable parameter but a class constant. Agreed, ClassVar works perfectly here.

class ChunkKeyEncoding(Metadata):
class ChunkKeyEncoding(ABC, Metadata):
name: str
separator: SeparatorLiteral = "."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to remove separator as a standard configuration parameter. For example, a hash-based encoding does not need a separator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the base ABC probably only needs to require a name attribute, and everything else can be methods

@d-v-b
Copy link
Contributor

d-v-b commented Sep 17, 2025

this looks good to me @RFLeijenaar. could you write up some release notes? then we can get this in the next release

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Sep 17, 2025
@d-v-b d-v-b merged commit 6805332 into zarr-developers:main Sep 17, 2025
31 checks passed
@d-v-b
Copy link
Contributor

d-v-b commented Sep 17, 2025

thanks @RFLeijenaar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants