-
-
Notifications
You must be signed in to change notification settings - Fork 364
Add CLI for converting v2 metadata to v3 #3257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…sting a zarr version greater than 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 I think this is good now - I have one question about use of the logger, but it's not a blocker. I'll let this sit for a week or so because it's complicated, and would benefit from a second reviewer. If no-one reviews by then, I'll merge.
|
||
app = typer.Typer() | ||
|
||
logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it deliberate that this is a new logger, instead of importing the logger object from zarr
? I don't tihnk it matters too much, but re-using zarr._logger
might save some code duplication because you could remove functions from this file for configuring the logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this was intentional - most of the other libraries I've seen have their root logger (i.e. zarr._logger
), then each individual file still calls logger = logging.getLogger(__name__)
to create a child of that logger.
I could use zarr._logger
here, but there are already other files in zarr
that use logger = logging.getLogger(__name__)
- so I thought this would be more consistent. Also, I'm not sure changing this would remove code in this file? _set_logging_level
uses verbose
to determine if the logging level should be INFO or WARNING, then calls zarr.set_log_level
and zarr.set_format
to set this on zarr._logger
. So it's already using settings on the overall logger?
return False | ||
|
||
|
||
def _convert_array_metadata(metadata_v2: ArrayV2Metadata) -> ArrayV3Metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
later on we can put this in zarr.core.metadata
, e.g. in a conversion
module
a high-level comment about the conversion strategy. this is not a blocker for this PR, but the current conversion strategy is serial over arrays and groups. This means converting an array / group waits until the previous one is converted, and it increases the likelihood of conversion breaking in the middle, leaving a hybrid v2 / v3 hierarchy. We could also do conversion with a concurrent API based on concurrently reading the entire input hierarchy with This could be considered merely a performance concern, but I suspect reading and writing the hierarchy concurrently might be the difference between success and failure for this CLI when people use it on high-latency storage backends. @K-Meech if you want to experience how your CLI tool handles IO latency, we have a store wrapper class that adds latency to any other store class. Try running your conversion on a large zarr hierarchy with 100ms of latency added to |
Thanks @d-v-b . I tried converting a Zarr with 30 nested groups (each with a small array inside) and this took about 10 seconds total (with a Happy to look at changing this to use concurrent writes with I've been keeping a list of potential follow up changes mentioned on this PR - happy to convert these into issues after merge:
|
cool, thanks for running this experiment! IMO 10 seconds for converting 30 arrays and 30 groups is fine to start with. We can treat the performance tuning as an implementation detail for a later PR. |
I've fixed the merge conflicts with this branch, and updated docstrings to reference the new |
@zarr-developers/python-core-devs I would like to merge this soon (24 hours) so now is a great time to look over this PR! |
it's going in once the tests are green. Thanks so much @K-Meech! If you don't mind, I'd like to make a post about your work over at image.sc, so people know about this functionality. This might be a good way to get some early feedback from users. |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
Great - thanks @d-v-b ! I'll make some follow up issues for additional features from the reviews on this PR. |
👏🏽 👏🏽 - The one thought I had while pondering this, @K-Meech, (and sorry if it was already discussed) is to what degree the CLI will follow the semver of the library itself. Not that anything needs doing but might be worth expectation management on the part of consumers. |
the library doesn't follow semver -- we are using effver |
Fair enough, effver then. Can a user at this point reasonably expect the same level of stability from the CLI that they would for the API? |
I don't see why we should version the CLI any differently, other than the fact that it's pretty new and so it might need some polishing as people use it. To give ourselves some flexibility, we could consider adding a disclaimer to the CLI stating that it's still experimental and might experience breaking changes in future versions of zarr python. |
see #3465 |
For #1798
Adds a CLI using
typer
to convert v2 metadata (.zarray
/.zattrs
...) to v3 metadatazarr.json
.To test, you will need to install the new optional cli dependency e.g.
pip install -e ".[remote,cli]"
This should make the
zarr-converter
command available e.g. try:convert
addszarr.json
files to every group / array, leaving the v2 metadata as-is. A zarr with both sets of metadata can still be opened withzarr.open
, but will give a UserWarning:Both zarr.json (Zarr format 3) and .zarray (Zarr format 2) metadata objects exist... Zarr v3 will be used.
. This can be avoided by passingzarr_format=3
tozarr.open
, or by using theclear
command to remove the v2 metadata.clear
can also remove v3 metadata. This is useful if the conversion fails part way through e.g. if one of the arrays uses a codec with no v3 equivalent.All code for the cli is in
src/zarr/core/metadata/converter/cli.py
, with the actual conversion functions insrc/zarr/core/metadata/converter/converter_v2_v3.py
. These functions can be called directly, for those who don't want to use the CLI (although currently they are part of/core
which is considered private API, so it may be best to move them elsewhere in the package).Some points to consider:
set_path
fromtest_dtype_registry.py
andtest_codec_entrypoints.py
, as they were causing the CLI tests to fail if they were run after. This seems to be due to thelazy_load_list
of the numcodecs codecs registries being cleared, meaning they were no longer available in my code which finds thenumcodecs.zarr3
equivalent of a numcodecs codec.TODO:
docs/user-guide/*.rst
changes/