From 2bf20693c02e91a3f8db6a965961d0d28af54328 Mon Sep 17 00:00:00 2001 From: Joseph Hamman Date: Sun, 29 Dec 2024 16:14:39 -0700 Subject: [PATCH 1/8] docs: add docs on extending zarr 3 --- docs/user-guide/extending.rst | 87 +++++++++++++++++++++++++++++++++++ docs/user-guide/index.rst | 2 +- 2 files changed, 88 insertions(+), 1 deletion(-) create mode 100644 docs/user-guide/extending.rst diff --git a/docs/user-guide/extending.rst b/docs/user-guide/extending.rst new file mode 100644 index 0000000000..7830484c8b --- /dev/null +++ b/docs/user-guide/extending.rst @@ -0,0 +1,87 @@ + +Extending Zarr +============== + +Zarr-Python 3 was designed to be extensible. This means that you can extend +the library by writing custom classes and plugins. Currently, Zarr can be extended +in the following ways: + +Custom codecs +------------- + +There are three types of codecs in Zarr: array-to-array, array-to-bytes, and bytes-to-bytes. +Array-to-array codecs are used to transform the n-dimensional array data before serializing +to bytes. Examples include delta encoding or scaling codecs. Array-to-bytes codecs are used +for serializing the array data to bytes. In Zarr, the main codec to use for numeric arrays +is the :class:`zarr.codecs.BytesCodec`. Bytes-to-bytes transform the serialized bytestreams +of the array data. Examples include compression codecs, such as +:class:`zarr.codecs.GzipCodec`, :class:`zarr.codecs.BloscCodec` or +:class:`zarr.codecs.ZstdCodec`, and codecs that add a checksum to the bytestream, such as +:class:`zarr.codecs.Crc32cCodec`. + +Custom codecs for Zarr are implemented by subclassing the relevant base class, see +:class:`zarr.abc.codec.ArrayArrayCodec`, :class:`zarr.abc.codec.ArrayBytesCodec` and +:class:`zarr.abc.codec.BytesBytesCodec`. Most custom codecs should implemented the +``_encode_single`` and ``_decode_single`` methods. These methods operate on single chunks +of the array data. Alternatively, custom codecs can implement the ``encode`` and ``decode`` +methods, which operate on batches of chunks, in case the codec is intended to implement +its own batch processing. + +Custom codecs should also implement the following methods: + +- ``compute_encoded_size``, which returns the byte size of the encoded data given the byte + size of the original data. It should raise ``NotImplementedError`` for codecs with + variable-sized outputs, such as compression codecs. +- ``validate``, which can be used to check that the codec metadata is compatible with the + array metadata. It should raise errors if not. +- ``resolve_metadata`` (optional), which is important for codecs that change the shape, + dtype or fill value of a chunk. +- ``evolve_from_array_spec`` (optional), which can be useful for automatically filling in + codec configuration metadata from the array metadata. + +To use custom codecs in Zarr, they need to be registered using the +`entrypoint mechanism `_. +Commonly, entrypoints are declared in the ``pyproject.toml`` of your package under the +``[project.entry-points."zarr.codecs"]`` section. Zarr will automatically discover and +load all codecs registered with the entrypoint mechanism from imported modules. + +.. code-block:: toml + + [project.entry-points."zarr.codecs"] + "custompackage.fancy_codec" = "custompackage:FancyCodec" + +New codecs need to have their own unique identifier. To avoid naming collisions, it is +strongly recommended to prefix the codec identifier with a unique name. For example, +the codecs from ``numcodecs`` are prefixed with ``numcodecs.``, e.g. ``numcodecs.delta``. + +.. note:: + Note that the extension mechanism for the Zarr version 3 is still under development. + Requirements for custom codecs including the choice of codec identifiers might + change in the future. + +It is also possible to register codecs as replacements for existing codecs. This might be +useful for providing specialized implementations, such as GPU-based codecs. In case of +multiple codecs, the :mod:`zarr.core.config` mechanism can be used to select the preferred +implementation. + +.. note:: + This sections explains how custom codecs can be created for Zarr version 3. For Zarr + version 2, codecs should subclass the + `numcodecs.abc.Codec `_ + base class and register through + `numcodecs.registry.register_codec `_. + +Custom stores +------------- + +Coming soon. + +Custom array buffers +-------------------- + +Coming soon. + +Other extensions +---------------- + +In the future, Zarr will support writing custom custom data types and chunk grids. diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst index d9d79a7f98..193d11775c 100644 --- a/docs/user-guide/index.rst +++ b/docs/user-guide/index.rst @@ -24,10 +24,10 @@ Advanced Topics performance consolidated_metadata + extending whatsnew_v3 v3_todos .. Coming soon async - extending From 0f2405f06645668d4dc9799ef6023055145594ab Mon Sep 17 00:00:00 2001 From: Norman Rzepka Date: Mon, 30 Dec 2024 13:07:58 +0100 Subject: [PATCH 2/8] Apply suggestions from code review Co-authored-by: David Stansby --- docs/user-guide/extending.rst | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/docs/user-guide/extending.rst b/docs/user-guide/extending.rst index 7830484c8b..595dd30c7f 100644 --- a/docs/user-guide/extending.rst +++ b/docs/user-guide/extending.rst @@ -9,8 +9,11 @@ in the following ways: Custom codecs ------------- -There are three types of codecs in Zarr: array-to-array, array-to-bytes, and bytes-to-bytes. -Array-to-array codecs are used to transform the n-dimensional array data before serializing +There are three types of codecs in Zarr: +- array-to-array +- array-to-bytes +- bytes-to-bytes. +Array-to-array codecs are used to transform the array data before serializing to bytes. Examples include delta encoding or scaling codecs. Array-to-bytes codecs are used for serializing the array data to bytes. In Zarr, the main codec to use for numeric arrays is the :class:`zarr.codecs.BytesCodec`. Bytes-to-bytes transform the serialized bytestreams @@ -32,7 +35,7 @@ Custom codecs should also implement the following methods: - ``compute_encoded_size``, which returns the byte size of the encoded data given the byte size of the original data. It should raise ``NotImplementedError`` for codecs with variable-sized outputs, such as compression codecs. -- ``validate``, which can be used to check that the codec metadata is compatible with the +- ``validate`` (optional), which can be used to check that the codec metadata is compatible with the array metadata. It should raise errors if not. - ``resolve_metadata`` (optional), which is important for codecs that change the shape, dtype or fill value of a chunk. @@ -65,7 +68,7 @@ multiple codecs, the :mod:`zarr.core.config` mechanism can be used to select the implementation. .. note:: - This sections explains how custom codecs can be created for Zarr version 3. For Zarr + This section explains how custom codecs can be created for Zarr version 3 data. For Zarr version 2, codecs should subclass the `numcodecs.abc.Codec `_ base class and register through From 134cd415efc0f3cc9ca102e1c0122792abc86514 Mon Sep 17 00:00:00 2001 From: Norman Rzepka Date: Mon, 30 Dec 2024 13:12:16 +0100 Subject: [PATCH 3/8] move note up --- docs/user-guide/extending.rst | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/docs/user-guide/extending.rst b/docs/user-guide/extending.rst index 595dd30c7f..7d2784e2cc 100644 --- a/docs/user-guide/extending.rst +++ b/docs/user-guide/extending.rst @@ -9,10 +9,18 @@ in the following ways: Custom codecs ------------- +.. note:: + This section explains how custom codecs can be created for Zarr version 3 data. For Zarr + version 2, codecs should subclass the + `numcodecs.abc.Codec `_ + base class and register through + `numcodecs.registry.register_codec `_. + There are three types of codecs in Zarr: - array-to-array - array-to-bytes -- bytes-to-bytes. +- bytes-to-bytes + Array-to-array codecs are used to transform the array data before serializing to bytes. Examples include delta encoding or scaling codecs. Array-to-bytes codecs are used for serializing the array data to bytes. In Zarr, the main codec to use for numeric arrays @@ -67,13 +75,6 @@ useful for providing specialized implementations, such as GPU-based codecs. In c multiple codecs, the :mod:`zarr.core.config` mechanism can be used to select the preferred implementation. -.. note:: - This section explains how custom codecs can be created for Zarr version 3 data. For Zarr - version 2, codecs should subclass the - `numcodecs.abc.Codec `_ - base class and register through - `numcodecs.registry.register_codec `_. - Custom stores ------------- From 2998561337de1c0581250183dc7a8307303a74bd Mon Sep 17 00:00:00 2001 From: Davis Bennett Date: Wed, 1 Jan 2025 22:36:19 +0100 Subject: [PATCH 4/8] remove test.py (#2612) --- test.py | 7 ------- 1 file changed, 7 deletions(-) delete mode 100644 test.py diff --git a/test.py b/test.py deleted file mode 100644 index 29dac92c8b..0000000000 --- a/test.py +++ /dev/null @@ -1,7 +0,0 @@ -import zarr - -store = zarr.DirectoryStore("data") -r = zarr.open_group(store=store) -z = r.full("myArray", 42, shape=(), dtype="i4", compressor=None) - -print(z.oindex[...]) From b9699f5c5a9b1f76a7509c333277334dbc2d415d Mon Sep 17 00:00:00 2001 From: David Stansby Date: Thu, 2 Jan 2025 15:17:35 +0000 Subject: [PATCH 5/8] Note that whole directories can be deleted in LocalStore (#2606) --- src/zarr/storage/local.py | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/src/zarr/storage/local.py b/src/zarr/storage/local.py index f9b1747c31..f4226792cb 100644 --- a/src/zarr/storage/local.py +++ b/src/zarr/storage/local.py @@ -189,6 +189,18 @@ async def set_partial_values( await concurrent_map(args, asyncio.to_thread, limit=None) # TODO: fix limit async def delete(self, key: str) -> None: + """ + Remove a key from the store. + + Parameters + ---------- + key : str + + Notes + ----- + If ``key`` is a directory within this store, the entire directory + at ``store.root / key`` is deleted. + """ # docstring inherited self._check_writable() path = self.root / key From 25355036835a91b82fff1b816f647785b5ee6521 Mon Sep 17 00:00:00 2001 From: Joe Hamman Date: Thu, 2 Jan 2025 09:20:09 -0800 Subject: [PATCH 6/8] fix: run-coverage command now tracks src directory (#2615) --- pyproject.toml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/pyproject.toml b/pyproject.toml index 75bbbf15d3..a92c30ab9f 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -136,8 +136,8 @@ numpy = ["1.25", "2.1"] features = ["gpu"] [tool.hatch.envs.test.scripts] -run-coverage = "pytest --cov-config=pyproject.toml --cov=pkg --cov=tests" -run-coverage-gpu = "pip install cupy-cuda12x && pytest -m gpu --cov-config=pyproject.toml --cov=pkg --cov=tests" +run-coverage = "pytest --cov-config=pyproject.toml --cov=pkg --cov=src" +run-coverage-gpu = "pip install cupy-cuda12x && pytest -m gpu --cov-config=pyproject.toml --cov=pkg --cov=src" run = "run-coverage --no-cov" run-verbose = "run-coverage --verbose" run-mypy = "mypy src" @@ -157,7 +157,7 @@ numpy = ["1.25", "2.1"] version = ["minimal"] [tool.hatch.envs.gputest.scripts] -run-coverage = "pytest -m gpu --cov-config=pyproject.toml --cov=pkg --cov=tests" +run-coverage = "pytest -m gpu --cov-config=pyproject.toml --cov=pkg --cov=src" run = "run-coverage --no-cov" run-verbose = "run-coverage --verbose" run-mypy = "mypy src" From 029c23e6527f8e306c08edcf6aaa84f89d6f3b2d Mon Sep 17 00:00:00 2001 From: Norman Rzepka Date: Thu, 2 Jan 2025 18:29:37 +0100 Subject: [PATCH 7/8] fix doc build --- docs/user-guide/arrays.rst | 4 ++-- docs/user-guide/index.rst | 2 -- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/user-guide/arrays.rst b/docs/user-guide/arrays.rst index 76fb8e6910..4d1ad12abd 100644 --- a/docs/user-guide/arrays.rst +++ b/docs/user-guide/arrays.rst @@ -196,7 +196,7 @@ algorithm (compression level 3) internally within Blosc, and with the bit-shuffle filter applied. When using a compressor, it can be useful to get some diagnostics on the -compression ratio. Zarr arrays provide the :property:`zarr.Array.info` property +compression ratio. Zarr arrays provide the :attr:`zarr.Array.info` property which can be used to print useful diagnostics, e.g.: .. ipython:: python @@ -212,7 +212,7 @@ prints additional diagnostics, e.g.: .. note:: :func:`zarr.Array.info_complete` will inspect the underlying store and may - be slow for large arrays. Use :property:`zarr.Array.info` if detailed storage + be slow for large arrays. Use :attr:`zarr.Array.info` if detailed storage statistics are not needed. If you don't specify a compressor, by default Zarr uses the Blosc diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst index d7936e30b7..8647eeb3e6 100644 --- a/docs/user-guide/index.rst +++ b/docs/user-guide/index.rst @@ -25,8 +25,6 @@ Advanced Topics performance consolidated_metadata extending - whatsnew_v3 - v3_todos .. Coming soon From 8b7da42919ed8a04b20c7f71ca375d1381171ee9 Mon Sep 17 00:00:00 2001 From: Davis Bennett Date: Thu, 2 Jan 2025 18:46:08 +0100 Subject: [PATCH 8/8] Update docs/user-guide/extending.rst --- docs/user-guide/extending.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user-guide/extending.rst b/docs/user-guide/extending.rst index 7d2784e2cc..405dcb92c0 100644 --- a/docs/user-guide/extending.rst +++ b/docs/user-guide/extending.rst @@ -24,7 +24,7 @@ There are three types of codecs in Zarr: Array-to-array codecs are used to transform the array data before serializing to bytes. Examples include delta encoding or scaling codecs. Array-to-bytes codecs are used for serializing the array data to bytes. In Zarr, the main codec to use for numeric arrays -is the :class:`zarr.codecs.BytesCodec`. Bytes-to-bytes transform the serialized bytestreams +is the :class:`zarr.codecs.BytesCodec`. Bytes-to-bytes codecs transform the serialized bytestreams of the array data. Examples include compression codecs, such as :class:`zarr.codecs.GzipCodec`, :class:`zarr.codecs.BloscCodec` or :class:`zarr.codecs.ZstdCodec`, and codecs that add a checksum to the bytestream, such as