Add registry for chunk key encodings for extensibility #3436

RFLeijenaar · 2025-09-05T10:50:39Z

Following #3433 (@d-v-b), I have implemented a registry for chunk key encodings. This allows users to subclass ChunkKeyEncoding and create their own implementation.

The scope of ChunkKeyEncoding.from_dict() function is reduced to what you expect it do: build from a dict. I placed the parsing function in chunk_key_encodings.py as it is used in both array and metadata v3 construction. I wasn't sure where else to put it.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

codecov · 2025-09-05T11:11:17Z

Codecov Report

❌ Patch coverage is 56.60377% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.23%. Comparing base (3c883a3) to head (46d1780).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/zarr/core/chunk_key_encodings.py	51.42%	17 Missing ⚠️
src/zarr/registry.py	58.33%	5 Missing ⚠️
src/zarr/core/metadata/v3.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3436      +/-   ##
==========================================
- Coverage   61.24%   61.23%   -0.02%     
==========================================
  Files          83       83              
  Lines        9885     9897      +12     
==========================================
+ Hits         6054     6060       +6     
- Misses       3831     3837       +6

Files with missing lines	Coverage Δ
src/zarr/core/array.py	`69.10% <100.00%> (-0.04%)`	⬇️
src/zarr/core/metadata/v3.py	`58.03% <50.00%> (ø)`
src/zarr/registry.py	`63.39% <58.33%> (-0.24%)`	⬇️
src/zarr/core/chunk_key_encodings.py	`33.82% <51.42%> (-0.03%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

RFLeijenaar · 2025-09-05T11:50:46Z

Note that the current implementation does not use the global config. This means that it only supports a single implementation (registration) for each chunk key encoding 'type' as indicated by the name field. Like codecs (parse_codecs), it uses the name field to retrieve the correct class from the registry. This does mean that the chunk key encoding should always be registered with a qualname that matches the name field.

With the entrypoint plugin mechanism this currently doesn't work correctly. It uses the full class path rather than the entrypoint name e.name for the key.

I will update the registry to bring it in line with how codecs are implemented, with a dict of registries, that map a chunk key encoding type (name) to a registry that contains possibly multiple implementations for that type.

d-v-b · 2025-09-05T13:20:51Z

I will update the registry to bring it in line with how codecs are implemented, with a dict of registries, that map a chunk key encoding type (name) to a registry that contains possibly multiple implementations for that type.

You don't need to copy the codec registry. A simple {name: class} mapping is fine for chunk key encodings.

d-v-b · 2025-09-05T13:22:30Z

for context, the codec registry associates multiple codec classes with a single codec identifier because of need specific to codecs (running the same codec algorithm on a CPU vs GPU). Chunk key encodings will never be run on specialized hardware, so we can use a simpler mapping.

RFLeijenaar · 2025-09-05T13:37:17Z

for context, the codec registry associates multiple codec classes with a single codec identifier because of need specific to codecs (running the same codec algorithm on a CPU vs GPU). Chunk key encodings will never be run on specialized hardware, so we can use a simpler mapping.

Originally, I implemented it such that it would only allow one implementation, but there were some issues as noted in my other comment. That said, these can be resolved by always using the name field as the key, and for any entrypoint e setting the key to e.name.

Regarding different implementations, it doesn't need to be on special hardware, right? One could also make their own implementation, possibly with bindings in a faster language. Not that I think that this would be common for chunk key encodings.

Let me know if you want me to revert to the simpeler, one implementation only design.

d-v-b · 2025-09-12T15:46:43Z

Let me know if you want me to revert to the simpeler, one implementation only design.

Yes, if it's not too much work I think this is a better design. I don't see a situation where a chunk key encoding would require specialized hardware, so I don't think we need to copy the codec registry design.

RFLeijenaar · 2025-09-12T16:04:27Z

I already have another branch that only allows one implementation, I can move that one into this PR!

In the meantime, I have been using a custom ChunkKeyEncoding in one of my repos. It works excellent to create a nested structure to reduce the number of files in a single directory in filesystems that might otherwise become slow:

@dataclass(frozen=True)
class FanoutChunkKeyEncoding(ChunkKeyEncoding):
    name: str = "fanout"
    separator: SeparatorLiteral = "/"
    max_files_per_dir: int = 1000

    def _fanout_coord(self, coord: int) -> list[str]:
        # Break a coordinate into fanout parts
        parts = []
        while coord >= self.max_files_per_dir:
            coord, rem = divmod(coord, self.max_files_per_dir)
            parts.append(str(rem))
        parts.append(str(coord))
        parts.reverse()
        return parts

    def encode_chunk_key(self, chunk_coords: tuple[int, ...]) -> str:
        parts = []
        for dim, coord in enumerate(chunk_coords):
            coords = self._fanout_coord(coord)
            coords = [f"d{dim}"] + coords
            parts.extend(coords)
        parts.append("c")
        return self.separator.join(parts)

I believe that right now decode_chunk_key(self, chunk_key: str) is unused. Are there any plans to use this function in the future? Otherwise, it might be better to get rid of it now before making the extendibility public.

Are there any other things you would like me to do for this PR?

d-v-b · 2025-09-13T10:09:52Z

I already have another branch that only allows one implementation, I can move that one into this PR!

Great!

In the meantime, I have been using a custom ChunkKeyEncoding in one of my repos. It works excellent to create a nested structure to reduce the number of files in a single directory in filesystems that might otherwise become slow:

That's very cool, if you're ever interested in sharing this chunk key encoding more widely, you might consider writing it up as a spec in https://github.com/zarr-developers/zarr-extensions/

I believe that right now decode_chunk_key(self, chunk_key: str) is unused. Are there any plans to use this function in the future? Otherwise, it might be better to get rid of it now before making the extendibility public.

I haven't thought much about this, but I think we would only need decode_chunk_key in a situation when we have a random chunk key and we need to know which chunk it contains. I'm pretty sure this scenario never happens in the normal operation of this library, so you're probably right that this should be eventually removed from the public API. Defining chunk decoding is useful for testing but I don't think we need it as a public method.

Are there any other things you would like me to do for this PR?

other than moving to the simpler registry design, this looks good!

- Change register_chunk_key_encoding function to take key as first arg similar to codec.

This enables users to add additional fields to a custom ChunkKeyEncoding without having to override __init__ and taking care of immutability of the attrs.

- Enforce encode_chunk_key to be implemented (abstractmethod in ABC) - Make decode_chunk_key optional (raise NotImplementedError by default) Note, the latter is never raised by the current zarr implementation.

RFLeijenaar · 2025-09-13T13:12:18Z

I have overwritten this PR with the other branch (it largely had the same commit history).

I haven't thought much about this, but I think we would only need decode_chunk_key in a situation when we have a random chunk key and we need to know which chunk it contains. I'm pretty sure this scenario never happens in the normal operation of this library, so you're probably right that this should be eventually removed from the public API. Defining chunk decoding is useful for testing but I don't think we need it as a public method.

Yes, which could be useful to iterate/find the current chunks in the array store. But I believe it is almost always safer to then just iterate the chunk coords, encode, and see which chunk keys are present.

Anyway, I left the decode_chunk_key in there, but it is no longer an @abstractmethod, and raises NotImplementedError by default. Note, this will never be raised by the current library. On the other hand, a subclass that is instantiated enforces the implementation of encode_chunk_key. I added docstrings to make subclassing clear to the user. It can be considered later if decode_chunk_key should be removed altogether.

That's very cool, if you're ever interested in sharing this chunk key encoding more widely, you might consider writing it up as a spec in https://github.com/zarr-developers/zarr-extensions/

Thanks, I will consider writing a spec for it. I was also thinking about a hashed key encoding, which could improve load balancing on distributed storage such as s3 or gcs (see zarr-specs #115). But I will have to test which array/chunk/read shape configurations would actually cause a bottleneck to see how useful it is.

d-v-b · 2025-09-13T19:52:36Z

src/zarr/core/chunk_key_encodings.py


 @dataclass(frozen=True)
 class V2ChunkKeyEncoding(ChunkKeyEncoding):
    name: Literal["v2"] = "v2"


Would annotating this as a ClassVar work here? Then we wouldn't need to redundantly pass a name parameter to __init__

name should not be an instance configurable parameter but a class constant. Agreed, ClassVar works perfectly here.

RFLeijenaar · 2025-09-14T09:54:56Z

src/zarr/core/chunk_key_encodings.py

-class ChunkKeyEncoding(Metadata):
+class ChunkKeyEncoding(ABC, Metadata):
    name: str
    separator: SeparatorLiteral = "."


It might be better to remove separator as a standard configuration parameter. For example, a hash-based encoding does not need a separator.

yes, the base ABC probably only needs to require a name attribute, and everything else can be methods

This automatically removes it as an init argument.

d-v-b · 2025-09-17T10:42:40Z

this looks good to me @RFLeijenaar. could you write up some release notes? then we can get this in the next release

d-v-b · 2025-09-17T21:03:04Z

thanks @RFLeijenaar!

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Sep 5, 2025

RFLeijenaar added 3 commits September 13, 2025 12:09

Add registry for chunk key encodings.

0289779

Fix error message for unknown chunk key encoding in create_array test

9d78166

Removed unneccsary type ignore

c6ee5c6

RFLeijenaar force-pushed the extension-keyenc branch from a581332 to 289308d Compare September 13, 2025 10:19

RFLeijenaar added 2 commits September 13, 2025 12:28

Use entrypoint.name as the key for registering chunk key encodings.

ec607ae

- Change register_chunk_key_encoding function to take key as first arg similar to codec.

Move parsing of init args in CKE to __post_init__.

c4ce5df

This enables users to add additional fields to a custom ChunkKeyEncoding without having to override __init__ and taking care of immutability of the attrs.

RFLeijenaar force-pushed the extension-keyenc branch from 289308d to c4ce5df Compare September 13, 2025 10:29

Clarify ChunkKeyEncoding base class

0566131

- Enforce encode_chunk_key to be implemented (abstractmethod in ABC) - Make decode_chunk_key optional (raise NotImplementedError by default) Note, the latter is never raised by the current zarr implementation.

Merge branch 'main' into extension-keyenc

70d99e4

d-v-b reviewed Sep 13, 2025

View reviewed changes

RFLeijenaar commented Sep 14, 2025

View reviewed changes

RFLeijenaar and others added 6 commits September 14, 2025 12:15

Make name a ClassVar in ChunkKeyEncoding.

1f09f76

This automatically removes it as an init argument.

Remove separator from ChunkKeyEncoding base.

e89af81

Fix typing errors.

9a1de95

Update docs output to match code changes.

472899f

Merge branch 'main' into extension-keyenc

0c67e37

Merge branch 'main' into extension-keyenc

7b31ed4

Add release notes

46d1780

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Sep 17, 2025

d-v-b approved these changes Sep 17, 2025

View reviewed changes

d-v-b merged commit 6805332 into zarr-developers:main Sep 17, 2025
31 checks passed

Uh oh!

Add registry for chunk key encodings for extensibility #3436

Add registry for chunk key encodings for extensibility #3436

Uh oh!

Conversation

RFLeijenaar commented Sep 5, 2025

Uh oh!

codecov bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RFLeijenaar commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Sep 5, 2025

Uh oh!

d-v-b commented Sep 5, 2025

Uh oh!

RFLeijenaar commented Sep 5, 2025

Uh oh!

d-v-b commented Sep 12, 2025

Uh oh!

RFLeijenaar commented Sep 12, 2025

Uh oh!

d-v-b commented Sep 13, 2025

Uh oh!

RFLeijenaar commented Sep 13, 2025

Uh oh!

d-v-b Sep 13, 2025

Choose a reason for hiding this comment

Uh oh!

RFLeijenaar Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

RFLeijenaar Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Sep 17, 2025

Uh oh!

Uh oh!

d-v-b commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Sep 5, 2025 •

edited

Loading

RFLeijenaar commented Sep 5, 2025 •

edited

Loading