-
-
Notifications
You must be signed in to change notification settings - Fork 364
Add registry for chunk key encodings for extensibility #3436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3436 +/- ##
==========================================
- Coverage 61.24% 61.23% -0.02%
==========================================
Files 83 83
Lines 9885 9897 +12
==========================================
+ Hits 6054 6060 +6
- Misses 3831 3837 +6
🚀 New features to boost your workflow:
|
Note that the current implementation does not use the global config. This means that it only supports a single implementation (registration) for each chunk key encoding 'type' as indicated by the With the entrypoint plugin mechanism this currently doesn't work correctly. It uses the full class path rather than the entrypoint name I will update the registry to bring it in line with how codecs are implemented, with a dict of registries, that map a chunk key encoding type (name) to a registry that contains possibly multiple implementations for that type. |
You don't need to copy the codec registry. A simple |
for context, the codec registry associates multiple codec classes with a single codec identifier because of need specific to codecs (running the same codec algorithm on a CPU vs GPU). Chunk key encodings will never be run on specialized hardware, so we can use a simpler mapping. |
Originally, I implemented it such that it would only allow one implementation, but there were some issues as noted in my other comment. That said, these can be resolved by always using the Regarding different implementations, it doesn't need to be on special hardware, right? One could also make their own implementation, possibly with bindings in a faster language. Not that I think that this would be common for chunk key encodings. Let me know if you want me to revert to the simpeler, one implementation only design. |
Yes, if it's not too much work I think this is a better design. I don't see a situation where a chunk key encoding would require specialized hardware, so I don't think we need to copy the codec registry design. |
I already have another branch that only allows one implementation, I can move that one into this PR! In the meantime, I have been using a custom @dataclass(frozen=True)
class FanoutChunkKeyEncoding(ChunkKeyEncoding):
name: str = "fanout"
separator: SeparatorLiteral = "/"
max_files_per_dir: int = 1000
def _fanout_coord(self, coord: int) -> list[str]:
# Break a coordinate into fanout parts
parts = []
while coord >= self.max_files_per_dir:
coord, rem = divmod(coord, self.max_files_per_dir)
parts.append(str(rem))
parts.append(str(coord))
parts.reverse()
return parts
def encode_chunk_key(self, chunk_coords: tuple[int, ...]) -> str:
parts = []
for dim, coord in enumerate(chunk_coords):
coords = self._fanout_coord(coord)
coords = [f"d{dim}"] + coords
parts.extend(coords)
parts.append("c")
return self.separator.join(parts) I believe that right now Are there any other things you would like me to do for this PR? |
Great!
That's very cool, if you're ever interested in sharing this chunk key encoding more widely, you might consider writing it up as a spec in https://github.com/zarr-developers/zarr-extensions/
I haven't thought much about this, but I think we would only need
other than moving to the simpler registry design, this looks good! |
a581332
to
289308d
Compare
- Change register_chunk_key_encoding function to take key as first arg similar to codec.
This enables users to add additional fields to a custom ChunkKeyEncoding without having to override __init__ and taking care of immutability of the attrs.
289308d
to
c4ce5df
Compare
- Enforce encode_chunk_key to be implemented (abstractmethod in ABC) - Make decode_chunk_key optional (raise NotImplementedError by default) Note, the latter is never raised by the current zarr implementation.
I have overwritten this PR with the other branch (it largely had the same commit history).
Yes, which could be useful to iterate/find the current chunks in the array store. But I believe it is almost always safer to then just iterate the chunk coords, encode, and see which chunk keys are present. Anyway, I left the
Thanks, I will consider writing a spec for it. I was also thinking about a hashed key encoding, which could improve load balancing on distributed storage such as s3 or gcs (see zarr-specs #115). But I will have to test which array/chunk/read shape configurations would actually cause a bottleneck to see how useful it is. |
src/zarr/core/chunk_key_encodings.py
Outdated
|
||
@dataclass(frozen=True) | ||
class V2ChunkKeyEncoding(ChunkKeyEncoding): | ||
name: Literal["v2"] = "v2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would annotating this as a ClassVar
work here? Then we wouldn't need to redundantly pass a name
parameter to __init__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name
should not be an instance configurable parameter but a class constant. Agreed, ClassVar
works perfectly here.
src/zarr/core/chunk_key_encodings.py
Outdated
class ChunkKeyEncoding(Metadata): | ||
class ChunkKeyEncoding(ABC, Metadata): | ||
name: str | ||
separator: SeparatorLiteral = "." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to remove separator
as a standard configuration parameter. For example, a hash-based encoding does not need a separator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the base ABC probably only needs to require a name
attribute, and everything else can be methods
This automatically removes it as an init argument.
this looks good to me @RFLeijenaar. could you write up some release notes? then we can get this in the next release |
thanks @RFLeijenaar! |
Following #3433 (@d-v-b), I have implemented a registry for chunk key encodings. This allows users to subclass
ChunkKeyEncoding
and create their own implementation.The scope of ChunkKeyEncoding.from_dict() function is reduced to what you expect it do: build from a dict. I placed the parsing function in
chunk_key_encodings.py
as it is used in both array and metadata v3 construction. I wasn't sure where else to put it.TODO:
docs/user-guide/*.rst
changes/