Add CLI for converting v2 metadata to v3 #3257

K-Meech · 2025-07-16T15:10:16Z

Adds a CLI using typer to convert v2 metadata (.zarray / .zattrs...) to v3 metadata zarr.json.

To test, you will need to install the new optional cli dependency e.g.
pip install -e ".[remote,cli]"

This should make the zarr-converter command available e.g. try:

zarr-converter --help
zarr-converter convert --help
zarr-converter clear --help

convert adds zarr.json files to every group / array, leaving the v2 metadata as-is. A zarr with both sets of metadata can still be opened with zarr.open, but will give a UserWarning: Both zarr.json (Zarr format 3) and .zarray (Zarr format 2) metadata objects exist... Zarr v3 will be used.. This can be avoided by passing zarr_format=3 to zarr.open, or by using the clear command to remove the v2 metadata.

clear can also remove v3 metadata. This is useful if the conversion fails part way through e.g. if one of the arrays uses a codec with no v3 equivalent.

All code for the cli is in src/zarr/core/metadata/converter/cli.py, with the actual conversion functions in src/zarr/core/metadata/converter/converter_v2_v3.py. These functions can be called directly, for those who don't want to use the CLI (although currently they are part of /core which is considered private API, so it may be best to move them elsewhere in the package).

Some points to consider:

I had to modify set_path from test_dtype_registry.py and test_codec_entrypoints.py, as they were causing the CLI tests to fail if they were run after. This seems to be due to the lazy_load_list of the numcodecs codecs registries being cleared, meaning they were no longer available in my code which finds the numcodecs.zarr3 equivalent of a numcodecs codec.
I tested this on local zarr images, so it would be great if someone with access to s3 / google cloud etc., could try it out on some small example images there.
I'm happy to add docs about how to use the CLI, but wanted to get feedback on the general structure first

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

…onversion

…sting a zarr version greater than 3

dstansby

🎉 I think this is good now - I have one question about use of the logger, but it's not a blocker. I'll let this sit for a week or so because it's complicated, and would benefit from a second reviewer. If no-one reviews by then, I'll merge.

dstansby · 2025-08-11T10:48:49Z

src/zarr/_cli/cli.py

+
+app = typer.Typer()
+
+logger = logging.getLogger(__name__)


Is it deliberate that this is a new logger, instead of importing the logger object from zarr? I don't tihnk it matters too much, but re-using zarr._logger might save some code duplication because you could remove functions from this file for configuring the logger.

Yes this was intentional - most of the other libraries I've seen have their root logger (i.e. zarr._logger), then each individual file still calls logger = logging.getLogger(__name__) to create a child of that logger.

I could use zarr._logger here, but there are already other files in zarr that use logger = logging.getLogger(__name__) - so I thought this would be more consistent. Also, I'm not sure changing this would remove code in this file? _set_logging_level uses verbose to determine if the logging level should be INFO or WARNING, then calls zarr.set_log_level and zarr.set_format to set this on zarr._logger. So it's already using settings on the overall logger?

d-v-b · 2025-08-13T10:41:03Z

src/zarr/metadata/migrate_v3.py

+    return False
+
+
+def _convert_array_metadata(metadata_v2: ArrayV2Metadata) -> ArrayV3Metadata:


later on we can put this in zarr.core.metadata, e.g. in a conversion module

d-v-b · 2025-08-13T10:54:33Z

a high-level comment about the conversion strategy. this is not a blocker for this PR, but the current conversion strategy is serial over arrays and groups. This means converting an array / group waits until the previous one is converted, and it increases the likelihood of conversion breaking in the middle, leaving a hybrid v2 / v3 hierarchy.

We could also do conversion with a concurrent API based on concurrently reading the entire input hierarchy with dict(group.members(max_depth=None)), preparing new metadata documents for each member, then concurrently writing those documents with create_hierarchy.

This could be considered merely a performance concern, but I suspect reading and writing the hierarchy concurrently might be the difference between success and failure for this CLI when people use it on high-latency storage backends.

@K-Meech if you want to experience how your CLI tool handles IO latency, we have a store wrapper class that adds latency to any other store class. Try running your conversion on a large zarr hierarchy with 100ms of latency added to get and set on a local store. If it takes more than a few seconds, then this strongly argues for moving away from the serial conversion strategy and using something concurrent.

K-Meech · 2025-08-18T10:52:23Z

Thanks @d-v-b . I tried converting a Zarr with 30 nested groups (each with a small array inside) and this took about 10 seconds total (with a LatencyStore set up as LatencyStore(local_store, get_latency=0.1, set_latency=0.1)). I've created a gist of this small test, if you wanted to try it out.

Happy to look at changing this to use concurrent writes with create_hierachy - perhaps best to do in a separate PR after this one is merged? What do you think?

I've been keeping a list of potential follow up changes mentioned on this PR - happy to convert these into issues after merge:

Handle conversion of consolidated metadata (.zmetadata) files
Handle any combination of numcodecs codecs - see discussion in thread. Currently the code only handles filters that are ArrayArrayCodecs and compressors that are BytesBytesCodecs.
Use concurrent writes, rather than serial

d-v-b · 2025-08-18T10:57:43Z

Thanks @d-v-b . I tried converting a Zarr with 30 nested groups (each with a small array inside) and this took about 10 seconds total (with a LatencyStore set up as LatencyStore(local_store, get_latency=0.1, set_latency=0.1)). I've created a gist of this small test, if you wanted to try it out.

cool, thanks for running this experiment! IMO 10 seconds for converting 30 arrays and 30 groups is fine to start with. We can treat the performance tuning as an implementation detail for a later PR.

K-Meech · 2025-09-16T08:14:03Z

I've fixed the merge conflicts with this branch, and updated docstrings to reference the new zarr.codecs.numcodecs (rather than numcodecs.zarr3). I think I've addressed everyone's comments as well - @dstansby , see my replies to your comments above.
Do let me know if anything else needs changing / updating here - otherwise it might be ready to go in?

d-v-b · 2025-09-16T11:32:36Z

@zarr-developers/python-core-devs I would like to merge this soon (24 hours) so now is a great time to look over this PR!

d-v-b · 2025-09-17T08:23:07Z

it's going in once the tests are green. Thanks so much @K-Meech! If you don't mind, I'd like to make a post about your work over at image.sc, so people know about this functionality. This might be a good way to get some early feedback from users.

lumberbot-app · 2025-09-17T08:26:48Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 3.1.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 3c883a3c578b6e9fdb4d5fd7a160ce992aaec1b3

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #3257: Add CLI for converting v2 metadata to v3'

Push to a named branch:

git push YOURFORK 3.1.x:auto-backport-of-pr-3257-on-3.1.x

Create a PR against branch 3.1.x, I would have named this PR:

"Backport PR #3257 on branch 3.1.x (Add CLI for converting v2 metadata to v3)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

K-Meech · 2025-09-17T09:38:53Z

Great - thanks @d-v-b ! I'll make some follow up issues for additional features from the reviews on this PR.

joshmoore · 2025-09-17T10:17:38Z

👏🏽 👏🏽 - The one thought I had while pondering this, @K-Meech, (and sorry if it was already discussed) is to what degree the CLI will follow the semver of the library itself. Not that anything needs doing but might be worth expectation management on the part of consumers.

d-v-b · 2025-09-17T10:20:52Z

👏🏽 👏🏽 - The one thought I had while pondering this, @K-Meech, (and sorry if it was already discussed) is to what degree the CLI will follow the semver of the library itself. Not that anything needs doing but might be worth expectation management on the part of consumers.

the library doesn't follow semver -- we are using effver

joshmoore · 2025-09-17T10:22:34Z

Fair enough, effver then. Can a user at this point reasonably expect the same level of stability from the CLI that they would for the API?

d-v-b · 2025-09-17T10:26:23Z

I don't see why we should version the CLI any differently, other than the fact that it's pretty new and so it might need some polishing as people use it. To give ourselves some flexibility, we could consider adding a disclaimer to the CLI stating that it's still experimental and might experience breaking changes in future versions of zarr python.

d-v-b · 2025-09-17T10:30:10Z

see #3465

K-Meech added 27 commits July 1, 2025 11:08

add rough cli converter structure

45bb4e5

allow zstd, gzip and numcodecs zarr 3 compression

456c9e7

convert filters to v3

242a338

create BytesCodec with correct endian

1045c33

handle C vs F order in v2 metadata

4e2442f

save group and array metadata to file

c63f0b8

create overall conversion functions for store, array or group

2947ce4

add minimal typer cli

ba81755

add initial tests for converter

67f9580

add tests for conversion of groups and nested groups and arrays

0d7c2c8

add tests for conversion of compressors and filters

cf39580

test conversion of order and endianness

11499e7

add tests for edge cases of incorrect codecs

90b0996

add tests for / separator

85159bb

draft of metadata remover and add test for internal paths

53ba166

add clear command to cli with tests

d4cdc04

add test for metadata removal with path#

dfdc729

add verbose logging option

ad60991

add dry run option to cli

66bae0d

add test for dry-run

97df9bf

add zarr-converter script and enable cli dep in tests

42e0435

use v2 chunk key encoding type

9e20b39

Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…

6586e66

…onversion

update endianness of test data type

ce409a3

Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…

fb7136b

…onversion

check converted arrays can be accessed

6585f24

Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…

46e958d

…onversion

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 16, 2025

K-Meech added 2 commits July 16, 2025 16:58

remove uses of pathlib walk, as it didn't exist in python 3.11

08fc138

include tags in checkout for gpu test, to avoid numcodecs.zarr3 reque…

3540434

…sting a zarr version greater than 3

Merge branch 'main' into km/v2-v3-conversion

14b9cfd

dstansby approved these changes Aug 11, 2025

View reviewed changes

d-v-b reviewed Aug 13, 2025

View reviewed changes

TomNicholas mentioned this pull request Sep 2, 2025

Virtualize Native Zarr V2 format zarr-developers/VirtualiZarr#565

Closed

K-Meech and others added 3 commits September 15, 2025 13:17

merge upstream changes

c6ce404

update docstrings to reference zarr.codecs.numcodecs

96608d9

Merge branch 'main' into km/v2-v3-conversion

d98ae2e

d-v-b approved these changes Sep 16, 2025

View reviewed changes

Merge branch 'main' into km/v2-v3-conversion

63779b3

d-v-b enabled auto-merge (squash) September 17, 2025 08:13

d-v-b merged commit 3c883a3 into zarr-developers:main Sep 17, 2025
31 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Sep 17, 2025

K-Meech mentioned this pull request Sep 17, 2025

zarr v2 to zarr v3 conversion HEFTIEProject/zarr-tooling#5

Closed

This was referenced Sep 17, 2025

Support consolidated metadata in v2 > v3 conversion CLI #3466

Open

Support any combination of numcodecs codecs in v2 > v3 conversion CLI #3467

Open

Use concurrent writes for v2 > v3 conversion CLI #3468

Open

TomNicholas mentioned this pull request Sep 24, 2025

Parse zarr v2 zarr-developers/VirtualiZarr#806

Closed

7 tasks

		return False


		def _convert_array_metadata(metadata_v2: ArrayV2Metadata) -> ArrayV3Metadata:

Uh oh!

Add CLI for converting v2 metadata to v3 #3257

Add CLI for converting v2 metadata to v3 #3257

Uh oh!

Conversation

K-Meech commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstansby left a comment

Choose a reason for hiding this comment

Uh oh!

dstansby Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

K-Meech Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Aug 13, 2025

Uh oh!

K-Meech commented Aug 18, 2025

Uh oh!

d-v-b commented Aug 18, 2025

Uh oh!

K-Meech commented Sep 16, 2025

Uh oh!

d-v-b commented Sep 16, 2025

Uh oh!

d-v-b commented Sep 17, 2025

Uh oh!

Uh oh!

lumberbot-app bot commented Sep 17, 2025

Uh oh!

K-Meech commented Sep 17, 2025

Uh oh!

joshmoore commented Sep 17, 2025

Uh oh!

d-v-b commented Sep 17, 2025

Uh oh!

joshmoore commented Sep 17, 2025

Uh oh!

d-v-b commented Sep 17, 2025

Uh oh!

d-v-b commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

K-Meech commented Jul 16, 2025 •

edited

Loading