Skip to content

Commit 63712de

Browse files
normanrzphilippottomarkbader
authored
Add a CLI for copy_dataset (#1259)
* make Zarr3 default DataFormat * compress=True * remove deprecations * test fixes * down to 16 * changelog * test fixes * fix /test_dataset_add_remote_mag_and_layer.py * stuff * ci * ci * error on fork * ci * ci * less alignment checks * allow_unaligned * ci.yml aktualisieren * ci testing * ci * sequential tests * ci * ci * ci * ci * parameterize python for kubernetes dockerfile * test * change defaults * mirrored test images * mp logging * mp debugging * debug * debug * debug * debugging * pyproject.toml * py3.12 * debugging * wip * all python versions * revert debug changes in cluster_tools * fixes * larger ci runner * default ci runner * rm pytest-timeout * test * Revert "rm pytest-timeout" This reverts commit 6bc2185. * Revert "test" This reverts commit 8d57971. * ci * ci * ci * ci * ci * ci * properly implement SequentialExecutor * ci * changelog * allow_unaligned wip * ci * wip * fix tests * fix test * examples * longer sleep in slurm test * format * longer sleep in slurm test * Apply suggestions from code review Co-authored-by: Philipp Otto <[email protected]> * add methods * more robust patch * comment * derive_nd_bounding_box_from_shape * refactor Dataset.open to not require an additional IOop * cassettes * format * lint * docs * deprecate chunks_per_shard * deprecate dtype_per_layer * type * fixes for add_layer_from_images * Vec3Int.from_vec_or_int * export defaults * write_layer args * MagLike + mag in write_layer * doc * docstring * adds copy-dataset CLI tool * test * change default data_format in cli * docs * changelog * changelog * changelog * better progress descriptor * remove leading slash * fix TensorStoreArray.open * Update webknossos/webknossos/cli/copy_dataset.py Co-authored-by: Mark Bader <[email protected]> * Update webknossos/Changelog.md Co-authored-by: Mark Bader <[email protected]> * docstring * more kwargs * tests * add exists-ok flag * -vv --------- Co-authored-by: Philipp Otto <[email protected]> Co-authored-by: Mark Bader <[email protected]>
1 parent 74c0288 commit 63712de

File tree

10 files changed

+235
-31
lines changed

10 files changed

+235
-31
lines changed

docs/src/cli/index.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ The WEBKNOSSOS CLI offers many useful commands to work with WEBKNOSSOS datasets:
1111
- `webknossos convert-knossos`: Converts a KNOSSOS dataset to a WEBKNOSSOS dataset
1212
- `webknossos convert-raw`: Converts a RAW image file to a WEBKNOSSOS dataset
1313
- `webknossos convert-zarr`: Converts a Zarr dataset to a WEBKNOSSOS dataset
14+
- `webknossos copy-dataset`: Makes a copy of a WEBKNOSSOS dataset
1415
- `webknossos download`: Download a dataset from a WEBKNOSSOS server
1516
- `webknossos downsample`: Downsample a WEBKNOSSOS dataset
1617
- `webknossos merge-fallback`: Merge a volume layer of a WEBKNOSSOS dataset with an annotation
@@ -61,7 +62,12 @@ webknossos convert-knossos --layer-name color --voxel-size 11.24,11.24,25 data/s
6162
# Convert RAW file to wkw file
6263
webknossos convert-raw --layer-name color --voxel-size 10,10,30 --dtype uint8 --shape 2048,2048,1024 data/source/raw_file.raw data/target
6364

64-
65+
# Copy a local dataset to a remote storage
66+
AWS_ACCESS_KEY_ID=XXX AWS_SECRET_ACCESS_KEY=XXX \
67+
webknossos copy-dataset \
68+
--data-format zarr3 \
69+
--jobs 4 \
70+
data/source s3://webknossos-bucket/target
6571
```
6672

6773
### Parallelization

webknossos/Changelog.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,7 @@ For upgrade instructions, please check the respective _Breaking Changes_ section
154154

155155

156156
### Added
157+
- Added the `webknossos copy-dataset` CLI command. [#1259](https://github.com/scalableminds/webknossos-libs/pull/1259)
157158
- Added `Dataset.write_layer` method for writing entire layers in one go. [#1242](https://github.com/scalableminds/webknossos-libs/pull/1242)
158159

159160
### Changed

webknossos/test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ export MULTIPROCESSING_DEFAULT_START_METHOD=forkserver
1010
# this will ensure that the current directory is added to sys.path
1111
# (which is standard python behavior). This is necessary so that the imports
1212
# refer to the checked out (and potentially modified) code.
13-
PYTEST="uv run --all-extras --python ${PYTHON_VERSION:-3.13} -m pytest --suppress-no-test-exit-code"
13+
PYTEST="uv run --all-extras --python ${PYTHON_VERSION:-3.13} -m pytest --suppress-no-test-exit-code -vv"
1414

1515
# Within the tests folder is a binaryData folder of the local running webknossos instance. This folder is cleaned up before running the tests.
1616
# This find command gets all directories in binaryData/Organization_X except for the l4_sample and e2006_knossos directories and deletes them.

webknossos/tests/dataset/test_dataset.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3060,6 +3060,16 @@ def test_wkw_copy_to_remote_dataset() -> None:
30603060
)
30613061

30623062

3063+
def test_copy_dataset_exists_ok() -> None:
3064+
ds_path = prepare_dataset_path(DataFormat.WKW, REMOTE_TESTOUTPUT_DIR, "copied")
3065+
wkw_ds = Dataset.open(TESTDATA_DIR / "simple_wkw_dataset")
3066+
3067+
wkw_ds.copy_dataset(ds_path, data_format=DataFormat.Zarr3)
3068+
with pytest.raises(RuntimeError):
3069+
wkw_ds.copy_dataset(ds_path, data_format=DataFormat.Zarr3)
3070+
wkw_ds.copy_dataset(ds_path, data_format=DataFormat.Zarr3, exists_ok=True)
3071+
3072+
30633073
@pytest.mark.use_proxay
30643074
def test_remote_dataset_access_metadata() -> None:
30653075
ds = Dataset.open_remote("l4_sample", "Organization_X")

webknossos/tests/test_cli.py

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
TESTDATA_DIR,
2424
use_minio,
2525
)
26-
from webknossos import BoundingBox, DataFormat, Dataset
26+
from webknossos import BoundingBox, DataFormat, Dataset, Mag
2727
from webknossos.cli.export_as_tiff import _make_tiff_name
2828
from webknossos.cli.main import app
2929
from webknossos.dataset.dataset import PROPERTIES_FILE_NAME
@@ -134,6 +134,43 @@ def test_check_equality() -> None:
134134
)
135135

136136

137+
def test_copy_dataset(tmp_path: Path) -> None:
138+
"""Tests the functionality of copy_dataset subcommand."""
139+
140+
result_without_args = runner.invoke(app, ["copy-dataset"])
141+
assert result_without_args.exit_code == 2
142+
143+
result = runner.invoke(
144+
app,
145+
[
146+
"copy-dataset",
147+
str(TESTDATA_DIR / "simple_wkw_dataset"),
148+
str(tmp_path / "simple_wkw_dataset"),
149+
"--data-format",
150+
"zarr3",
151+
],
152+
)
153+
assert result.exit_code == 0
154+
# verify that data is
155+
target_ds = Dataset.open(tmp_path / "simple_wkw_dataset")
156+
target_layer = target_ds.get_layer("color")
157+
assert target_layer.data_format == DataFormat.Zarr3
158+
assert Mag(1) in target_layer.mags
159+
160+
result = runner.invoke(
161+
app,
162+
[
163+
"copy-dataset",
164+
str(TESTDATA_DIR / "simple_wkw_dataset"),
165+
str(tmp_path / "simple_wkw_dataset"),
166+
"--data-format",
167+
"zarr3",
168+
"--exists-ok",
169+
],
170+
)
171+
assert result.exit_code == 0
172+
173+
137174
def test_check_not_equal() -> None:
138175
"""Tests that the check_equality subcommand detects differing datasets."""
139176

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
"""This module copies a WEBKNOSSOS datasets."""
2+
3+
import logging
4+
from argparse import Namespace
5+
from multiprocessing import cpu_count
6+
from typing import Any, Optional
7+
8+
import typer
9+
from typing_extensions import Annotated
10+
11+
from ..dataset import DataFormat, Dataset
12+
from ..geometry import Vec3Int
13+
from ..utils import get_executor_for_args
14+
from ._utils import DistributionStrategy, parse_path, parse_vec3int
15+
16+
logger = logging.getLogger(__name__)
17+
18+
19+
def main(
20+
*,
21+
source: Annotated[
22+
Any,
23+
typer.Argument(
24+
help="Path to the source WEBKNOSSOS dataset.",
25+
show_default=False,
26+
parser=parse_path,
27+
),
28+
],
29+
target: Annotated[
30+
Any,
31+
typer.Argument(
32+
help="Path to the target WEBKNOSSOS dataset.",
33+
show_default=False,
34+
parser=parse_path,
35+
),
36+
],
37+
data_format: Annotated[
38+
Optional[DataFormat],
39+
typer.Option(
40+
help="Data format to store the target dataset in.",
41+
),
42+
] = None,
43+
chunk_shape: Annotated[
44+
Optional[Vec3Int],
45+
typer.Option(
46+
help="Number of voxels to be stored as a chunk in the target dataset "
47+
"(e.g. `32` or `32,32,32`).",
48+
parser=parse_vec3int,
49+
metavar="Vec3Int",
50+
),
51+
] = None,
52+
shard_shape: Annotated[
53+
Optional[Vec3Int],
54+
typer.Option(
55+
help="Number of voxels to be stored as a shard in the target dataset "
56+
"(e.g. `1024` or `1024,1024,1024`).",
57+
parser=parse_vec3int,
58+
metavar="Vec3Int",
59+
),
60+
] = None,
61+
exists_ok: Annotated[
62+
bool, typer.Option(help="Whether it should overwrite an existing dataset.")
63+
] = False,
64+
jobs: Annotated[
65+
int,
66+
typer.Option(
67+
help="Number of processes to be spawned.",
68+
rich_help_panel="Executor options",
69+
),
70+
] = cpu_count(),
71+
distribution_strategy: Annotated[
72+
DistributionStrategy,
73+
typer.Option(
74+
help="Strategy to distribute the task across CPUs or nodes.",
75+
rich_help_panel="Executor options",
76+
),
77+
] = DistributionStrategy.MULTIPROCESSING,
78+
job_resources: Annotated[
79+
Optional[str],
80+
typer.Option(
81+
help="Necessary when using slurm as distribution strategy. Should be a JSON string "
82+
'(e.g., --job-resources=\'{"mem": "10M"}\')\'',
83+
rich_help_panel="Executor options",
84+
),
85+
] = None,
86+
) -> None:
87+
"""Make a copy of the WEBKNOSSOS dataset.
88+
89+
Remote paths (i.e. https and s3) are also allowed.
90+
Use the following environment variables to configure remote paths:
91+
- HTTP_BASIC_USER
92+
- HTTP_BASIC_PASSWORD
93+
- S3_ENDPOINT_URL
94+
- AWS_ACCESS_KEY_ID
95+
- AWS_SECRET_ACCESS_KEY
96+
"""
97+
98+
executor_args = Namespace(
99+
jobs=jobs,
100+
distribution_strategy=distribution_strategy.value,
101+
job_resources=job_resources,
102+
)
103+
104+
source_dataset = Dataset.open(source)
105+
106+
with get_executor_for_args(args=executor_args) as executor:
107+
source_dataset.copy_dataset(
108+
target,
109+
chunk_shape=chunk_shape,
110+
shard_shape=shard_shape,
111+
data_format=data_format,
112+
exists_ok=exists_ok,
113+
executor=executor,
114+
)

webknossos/webknossos/cli/main.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
convert_knossos,
1010
convert_raw,
1111
convert_zarr,
12+
copy_dataset,
1213
download,
1314
downsample,
1415
export_as_tiff,
@@ -25,6 +26,7 @@
2526
app.command("convert-knossos")(convert_knossos.main)
2627
app.command("convert-raw")(convert_raw.main)
2728
app.command("convert-zarr")(convert_zarr.main)
29+
app.command("copy-dataset")(copy_dataset.main)
2830
app.command("download")(download.main)
2931
app.command("downsample")(downsample.main)
3032
app.command("export-wkw-as-tiff")(export_as_tiff.main)

webknossos/webknossos/dataset/_array.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -430,7 +430,7 @@ def _make_kvstore(path: Path) -> Union[str, Dict[str, Union[str, List[str]]]]:
430430
parsed_url = urlparse(str(path))
431431
kvstore_spec: dict[str, Any] = {
432432
"driver": "s3",
433-
"path": parsed_url.path,
433+
"path": parsed_url.path.lstrip("/"),
434434
"bucket": parsed_url.netloc,
435435
}
436436
if endpoint_url := path.storage_options.get("client_kwargs", {}).get(

webknossos/webknossos/dataset/dataset.py

Lines changed: 37 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2171,6 +2171,7 @@ def add_copy_layer(
21712171
chunks_per_shard: Optional[Union[Vec3IntLike, int]] = None,
21722172
data_format: Optional[Union[str, DataFormat]] = None,
21732173
compress: Optional[bool] = None,
2174+
exists_ok: bool = False,
21742175
executor: Optional[Executor] = None,
21752176
) -> Layer:
21762177
"""Copy layer from another dataset to this one.
@@ -2186,6 +2187,7 @@ def add_copy_layer(
21862187
chunks_per_shard: Deprecated, use shard_shape. Optional number of chunks per shard
21872188
data_format: Optional format to store copied data ('wkw', 'zarr', etc.)
21882189
compress: Optional whether to compress copied data
2190+
exists_ok: Whether to overwrite existing layers
21892191
executor: Optional executor for parallel copying
21902192
21912193
Returns:
@@ -2217,30 +2219,46 @@ def add_copy_layer(
22172219
if new_layer_name is None:
22182220
new_layer_name = foreign_layer.name
22192221

2220-
if new_layer_name in self.layers.keys():
2221-
raise IndexError(
2222-
f"Cannot copy {foreign_layer}. This dataset already has a layer called {new_layer_name}."
2222+
if exists_ok:
2223+
layer = self.get_or_add_layer(
2224+
new_layer_name,
2225+
category=foreign_layer.category,
2226+
dtype_per_channel=foreign_layer.dtype_per_channel,
2227+
num_channels=foreign_layer.num_channels,
2228+
data_format=data_format or foreign_layer.data_format,
2229+
largest_segment_id=foreign_layer._get_largest_segment_id_maybe(),
2230+
bounding_box=foreign_layer.bounding_box,
2231+
)
2232+
else:
2233+
if new_layer_name in self.layers.keys():
2234+
raise IndexError(
2235+
f"Cannot copy {foreign_layer}. This dataset already has a layer called {new_layer_name}."
2236+
)
2237+
layer = self.add_layer(
2238+
new_layer_name,
2239+
category=foreign_layer.category,
2240+
dtype_per_channel=foreign_layer.dtype_per_channel,
2241+
num_channels=foreign_layer.num_channels,
2242+
data_format=data_format or foreign_layer.data_format,
2243+
largest_segment_id=foreign_layer._get_largest_segment_id_maybe(),
2244+
bounding_box=foreign_layer.bounding_box,
22232245
)
2224-
2225-
layer = self.add_layer(
2226-
new_layer_name,
2227-
category=foreign_layer.category,
2228-
dtype_per_channel=foreign_layer.dtype_per_channel,
2229-
num_channels=foreign_layer.num_channels,
2230-
data_format=data_format or foreign_layer.data_format,
2231-
largest_segment_id=foreign_layer._get_largest_segment_id_maybe(),
2232-
)
2233-
layer.bounding_box = foreign_layer.bounding_box
22342246

22352247
for mag_view in foreign_layer.mags.values():
2248+
progress_desc = (
2249+
f"Copying {mag_view.layer.name}/{mag_view.mag.to_layer_name()}"
2250+
)
2251+
22362252
layer.add_copy_mag(
22372253
mag_view,
22382254
extend_layer_bounding_box=False,
22392255
chunk_shape=chunk_shape,
22402256
shard_shape=shard_shape,
22412257
chunks_per_shard=chunks_per_shard,
22422258
compress=compress,
2259+
exists_ok=exists_ok,
22432260
executor=executor,
2261+
progress_desc=progress_desc,
22442262
)
22452263

22462264
return layer
@@ -2458,6 +2476,7 @@ def copy_dataset(
24582476
chunks_per_shard: Optional[Union[Vec3IntLike, int]] = None,
24592477
data_format: Optional[Union[str, DataFormat]] = None,
24602478
compress: Optional[bool] = None,
2479+
exists_ok: bool = False,
24612480
executor: Optional[Executor] = None,
24622481
voxel_size_with_unit: Optional[VoxelSize] = None,
24632482
) -> "Dataset":
@@ -2473,6 +2492,7 @@ def copy_dataset(
24732492
chunks_per_shard: Deprecated, use shard_shape. Optional number of chunks per shard
24742493
data_format: Optional format to store data ('wkw', 'zarr', 'zarr3')
24752494
compress: Optional whether to compress data
2495+
exists_ok: Whether to overwrite existing datasets and layers
24762496
executor: Optional executor for parallel copying
24772497
voxel_size_with_unit: Optional voxel size specification with units
24782498
@@ -2507,13 +2527,13 @@ def copy_dataset(
25072527
if data_format == DataFormat.WKW:
25082528
assert is_fs_path(
25092529
new_dataset_path
2510-
), "Cannot create WKW-based remote datasets. Use `data_format='zarr'` instead."
2530+
), "Cannot create WKW-based remote datasets. Use `data_format='zarr3'` instead."
25112531
if data_format is None and any(
25122532
layer.data_format == DataFormat.WKW for layer in self.layers.values()
25132533
):
25142534
assert is_fs_path(
25152535
new_dataset_path
2516-
), "Cannot create WKW layers in remote datasets. Use explicit `data_format='zarr'`."
2536+
), "Cannot create WKW layers in remote datasets. Use explicit `data_format='zarr3'`."
25172537

25182538
if voxel_size_with_unit is None:
25192539
if voxel_size is None:
@@ -2523,7 +2543,7 @@ def copy_dataset(
25232543
new_ds = Dataset(
25242544
new_dataset_path,
25252545
voxel_size_with_unit=voxel_size_with_unit,
2526-
exist_ok=False,
2546+
exist_ok=exists_ok,
25272547
)
25282548

25292549
with get_executor_for_args(None, executor) as executor:
@@ -2535,6 +2555,7 @@ def copy_dataset(
25352555
chunks_per_shard=chunks_per_shard,
25362556
data_format=data_format,
25372557
compress=compress,
2558+
exists_ok=exists_ok,
25382559
executor=executor,
25392560
)
25402561
new_ds._export_as_json()

0 commit comments

Comments
 (0)