Skip to content

Commit d157465

Browse files
authored
Function to rechunk single variables or batch variables from existing MDIO (#368)
* Improve MDIO API with accessor modes and rechunk functionality The MDIO API has been enhanced with support for additional file operation modes ('w' for rechunking) and a new rechunking feature. The 'copy_mdio' function now accepts strongly typed arguments, and 'rechunk' functions have been added to efficiently resize chunks for large datasets, with progress tracking and error handling improvements. * Add usage examples to rechunk functions Expanded the docstrings in `convenience.py` to include examples illustrating how to use `rechunk` and `rechunk_batch` functions for clarity and ease of use for developers. * Add convenience functions section to docs The new section "Convenience Functions" has been added to the reference documentation. It specifically includes documentation for the `mdio.api.convenience` module, excluding `create_rechunk_plan` and `write_rechunked_values`. * Add optional compressor parameter to rechunk functions The rechunk operations in the MDIO API now accept an optional compressor parameter, allowing users to specify a custom data compression codec. The default compressor, Blosc('zstd'), is set if none is provided, ensuring backward compatibility. * Add rechunk function TODO comment Inserted a TODO comment in `rechunk` function for writing tests, referencing the relevant issue. * Refactor buffer size and improve documentation Removed an extraneous newline and introduced a constant MAX_BUFFER to handle buffer size for chunking. Updated the create_rechunk_plan function's docstring to include the buffer size details, making it clearer how the buffer size can be adjusted by altering the MAX_BUFFER variable. This change enhances code readability and maintainability. * Add rechunking optimization demo Added a new demo `rechunking.ipynb` to demonstrate how to optimize access patterns using rechunking and lossy compression. The notebook includes detailed steps and code snippets to create optimized, compressed copies for different access patterns, enhancing read performance. * Refactor notebook: reset execution counts and tidy metadata Streamlined the Jupyter notebook by resetting execution counts and cleaning up metadata fields. This provides a fresh state for the execution environment and a more structured document for other developers to follow. * Update rechunking notebook with minor tweaks Corrected the reference from 'notebook' to 'page', added a parenthetical clarification to a section heading, and updated the performance benchmark outputs. These changes improve document clarity and provide the latest performance metrics.
1 parent 64b9ded commit d157465

File tree

5 files changed

+742
-20
lines changed

5 files changed

+742
-20
lines changed

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ maxdepth: 1
1818
installation
1919
notebooks/quickstart
2020
notebooks/compression
21+
notebooks/rechunking
2122
usage
2223
reference
2324
contributing

docs/notebooks/rechunking.ipynb

Lines changed: 514 additions & 0 deletions
Large diffs are not rendered by default.

docs/reference.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,3 +54,11 @@ and
5454
.. automodule:: mdio.core.serialization
5555
:members:
5656
```
57+
58+
## Convenience Functions
59+
60+
```{eval-rst}
61+
.. automodule:: mdio.api.convenience
62+
:members:
63+
:exclude-members: create_rechunk_plan, write_rechunked_values
64+
```

src/mdio/api/accessor.py

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
"""MDIO accessor APIs."""
22

3-
43
from __future__ import annotations
54

65
import dask.array as da
@@ -61,7 +60,9 @@ class MDIOAccessor:
6160
mdio_path_or_buffer: Store URL for MDIO file. This can be either on
6261
a local disk, or a cloud object store.
6362
mode: Read or read/write mode. The file must exist. Options are
64-
in {'r', 'r+'}.
63+
in {'r', 'r+', 'w'}. 'r' is read only, 'r+' is append mode where
64+
only existing arrays can be modified, 'w' is similar to 'r+'
65+
but rechunking or other file-wide operations are allowed.
6566
access_pattern: Chunk access pattern, optional. Default is "012".
6667
Examples: '012', '01', '01234'.
6768
storage_options: Options for the storage backend. By default,
@@ -133,9 +134,9 @@ def __init__(
133134
mdio_path_or_buffer: str,
134135
mode: str,
135136
access_pattern: str,
136-
storage_options: dict,
137+
storage_options: dict | None,
137138
return_metadata: bool,
138-
new_chunks: tuple[int, ...],
139+
new_chunks: tuple[int, ...] | None,
139140
backend: str,
140141
memory_cache_size: int,
141142
disk_cache: bool,
@@ -191,10 +192,13 @@ def _validate_store(self, storage_options):
191192
def _connect(self):
192193
"""Open the zarr root."""
193194
try:
194-
self.root = zarr.open_consolidated(
195-
store=self.store,
196-
mode=self.mode,
197-
)
195+
if self.mode in {"r", "r+"}:
196+
self.root = zarr.open_consolidated(store=self.store, mode=self.mode)
197+
elif self.mode == "w":
198+
self.root = zarr.open(store=self.store, mode="r+")
199+
else:
200+
msg = f"Invalid mode: {self.mode}"
201+
raise ValueError(msg)
198202
except KeyError as e:
199203
msg = (
200204
f"MDIO file not found or corrupt at {self.store.path}. "
@@ -377,7 +381,7 @@ def _data_group(self) -> zarr.Group:
377381
def __getitem__(self, item: int | tuple) -> npt.ArrayLike | da.Array | tuple:
378382
"""Data getter."""
379383
if self._return_metadata is True:
380-
if isinstance(item, int) or isinstance(item, slice):
384+
if isinstance(item, (int, slice)):
381385
meta_index = item
382386
elif len(item) == len(self.shape):
383387
meta_index = tuple(dim for dim in item[:-1])
@@ -400,7 +404,7 @@ def coord_to_index(
400404
self,
401405
*args,
402406
dimensions: str | list[str] | None = None,
403-
) -> tuple[NDArray[np.int], ...]:
407+
) -> tuple[NDArray[int], ...]:
404408
"""Convert dimension coordinate to zero-based index.
405409
406410
The coordinate labels of the array dimensions are converted to
@@ -576,8 +580,8 @@ def __init__(
576580
return_metadata: bool = False,
577581
new_chunks: tuple[int, ...] = None,
578582
backend: str = "zarr",
579-
memory_cache_size=0,
580-
disk_cache=False,
583+
memory_cache_size: int = 0,
584+
disk_cache: bool = False,
581585
): # TODO: Disabled all caching by default, sometimes causes performance issues
582586
"""Initialize super class with `r` permission."""
583587
super().__init__(
@@ -632,8 +636,8 @@ def __init__(
632636
return_metadata: bool = False,
633637
new_chunks: tuple[int, ...] = None,
634638
backend: str = "zarr",
635-
memory_cache_size=0,
636-
disk_cache=False,
639+
memory_cache_size: int = 0,
640+
disk_cache: bool = False,
637641
): # TODO: Disabled all caching by default, sometimes causes performance issues
638642
"""Initialize super class with `r+` permission."""
639643
super().__init__(

src/mdio/api/convenience.py

Lines changed: 201 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,31 @@
11
"""Convenience APIs for working with MDIO files."""
22

3-
43
from __future__ import annotations
54

5+
from typing import TYPE_CHECKING
6+
67
import zarr
8+
from tqdm.auto import tqdm
9+
from zarr import Blosc
710

811
from mdio.api.io_utils import process_url
12+
from mdio.core.indexing import ChunkIterator
13+
14+
15+
if TYPE_CHECKING:
16+
from numcodecs.abc import Codec
17+
from numpy.typing import NDArray
18+
from zarr import Array
919

20+
from mdio import MDIOAccessor
21+
from mdio import MDIOReader
1022

11-
def copy_mdio(
12-
source,
23+
24+
def copy_mdio( # noqa: PLR0913
25+
source: MDIOReader,
1326
dest_path_or_buffer: str,
14-
excludes="",
15-
includes="",
27+
excludes: str = "",
28+
includes: str = "",
1629
storage_options: dict | None = None,
1730
overwrite: bool = False,
1831
) -> None:
@@ -61,7 +74,7 @@ def copy_mdio(
6174
)
6275

6376
if len(excludes) > 0:
64-
data_path = "/".join(["data", excludes])
77+
data_path = f"data/{excludes}"
6578
source_array = source.root[data_path]
6679
dimension_separator = source_array._dimension_separator
6780

@@ -72,3 +85,185 @@ def copy_mdio(
7285
overwrite=overwrite,
7386
dimension_separator=dimension_separator,
7487
)
88+
89+
90+
CREATE_KW = {
91+
"dimension_separator": "/",
92+
"write_empty_chunks": False,
93+
}
94+
MAX_BUFFER = 512
95+
96+
97+
def create_rechunk_plan(
98+
source: MDIOAccessor,
99+
chunks_list: list[tuple[int, ...]],
100+
suffix_list: list[str],
101+
compressor: Codec | None = None,
102+
overwrite: bool = False,
103+
) -> tuple[[list[Array]], list[Array], NDArray, ChunkIterator]:
104+
"""Create rechunk plan based on source and user input.
105+
106+
It will buffer 512 x n-dimensions in memory. Approximately
107+
128MB. However, if you need to adjust the buffer size, change
108+
the `MAX_BUFFER` variable in this module.
109+
110+
Args:
111+
source: MDIO accessor instance. Data will be copied from here.
112+
chunks_list: List of tuples containing new chunk sizes.
113+
suffix_list: List of suffixes to append to new chunk sizes.
114+
compressor: Data compressor to use, optional. Default is Blosc('zstd').
115+
overwrite: Overwrite destination or not.
116+
117+
Returns:
118+
Tuple containing the rechunk plan variables and iterator.
119+
120+
Raises:
121+
NameError: if trying to write to original data.
122+
"""
123+
data_group = source._data_group
124+
metadata_group = source._metadata_group
125+
126+
data_array = source._traces
127+
metadata_array = source._headers
128+
live_mask = source.live_mask[:]
129+
130+
metadata_arrs = []
131+
data_arrs = []
132+
133+
header_compressor = Blosc("zstd")
134+
trace_compressor = Blosc("zstd") if compressor is None else compressor
135+
136+
for chunks, suffix in zip(chunks_list, suffix_list): # noqa: B905
137+
norm_chunks = [
138+
min(chunk, size) for chunk, size in zip(chunks, source.shape) # noqa: B905
139+
]
140+
141+
if suffix == source.access_pattern:
142+
msg = f"Can't write over source data with suffix {suffix}"
143+
raise NameError(msg)
144+
145+
metadata_arrs.append(
146+
metadata_group.zeros_like(
147+
name=f"chunked_{suffix}_trace_headers",
148+
data=metadata_array,
149+
chunks=norm_chunks[:-1],
150+
compressor=header_compressor,
151+
overwrite=overwrite,
152+
**CREATE_KW,
153+
)
154+
)
155+
156+
data_arrs.append(
157+
data_group.zeros_like(
158+
name=f"chunked_{suffix}",
159+
data=data_array,
160+
chunks=norm_chunks,
161+
compressor=trace_compressor,
162+
overwrite=overwrite,
163+
**CREATE_KW,
164+
)
165+
)
166+
167+
n_dimension = len(data_array.shape)
168+
dummy_array = zarr.empty_like(data_array, chunks=(MAX_BUFFER,) * n_dimension)
169+
iterator = ChunkIterator(dummy_array)
170+
171+
return metadata_arrs, data_arrs, live_mask, iterator
172+
173+
174+
def write_rechunked_values( # noqa: PLR0913
175+
source: MDIOAccessor,
176+
suffix_list: list[str],
177+
metadata_arrs_out: list[Array],
178+
data_arrs_out: list[Array],
179+
live_mask: NDArray,
180+
iterator: ChunkIterator,
181+
) -> None:
182+
"""Create rechunk plan based on source and user input.
183+
184+
Args:
185+
source: MDIO accessor instance. Data will be copied from here.
186+
suffix_list: List of suffixes to append to new chunk sizes.
187+
metadata_arrs_out: List of new metadata Zarr arrays.
188+
data_arrs_out: List of new data Zarr arrays.
189+
live_mask: Live mask to apply during copies.
190+
iterator: The chunk iterator to use.
191+
"""
192+
suffix_names = ",".join(suffix_list)
193+
for slice_ in tqdm(iterator, desc=f"Rechunking to {suffix_names}", unit="chunk"):
194+
meta_slice = slice_[:-1]
195+
196+
if live_mask[meta_slice].sum() == 0:
197+
continue
198+
199+
for array in metadata_arrs_out:
200+
array[meta_slice] = source._headers[meta_slice]
201+
202+
for array in data_arrs_out:
203+
array[slice_] = source._traces[slice_]
204+
205+
zarr.consolidate_metadata(source.store)
206+
207+
208+
def rechunk_batch(
209+
source: MDIOAccessor,
210+
chunks_list: list[tuple[int, ...]],
211+
suffix_list: list[str],
212+
compressor: Codec | None = None,
213+
overwrite: bool = False,
214+
) -> None:
215+
"""Rechunk MDIO file to multiple variables, reading it once.
216+
217+
Args:
218+
source: MDIO accessor instance. Data will be copied from here.
219+
chunks_list: List of tuples containing new chunk sizes.
220+
suffix_list: List of suffixes to append to new chunk sizes.
221+
compressor: Data compressor to use, optional. Default is Blosc('zstd').
222+
overwrite: Overwrite destination or not.
223+
224+
Examples:
225+
To rechunk multiple variables we can do things like:
226+
227+
>>> accessor = MDIOAccessor(...)
228+
>>> rechunk_batch(
229+
>>> accessor,
230+
>>> chunks_list=[(1, 1024, 1024), (1024, 1, 1024), (1024, 1024, 1)],
231+
>>> suffix_list=["fast_il", "fast_xl", "fast_z"],
232+
>>> )
233+
"""
234+
plan = create_rechunk_plan(
235+
source,
236+
chunks_list=chunks_list,
237+
suffix_list=suffix_list,
238+
compressor=compressor,
239+
overwrite=overwrite,
240+
)
241+
242+
write_rechunked_values(source, suffix_list, *plan)
243+
244+
245+
def rechunk(
246+
source: MDIOAccessor,
247+
chunks: tuple[int, ...],
248+
suffix: str,
249+
compressor: Codec | None = None,
250+
overwrite: bool = False,
251+
) -> None:
252+
"""Rechunk MDIO file adding a new variable.
253+
254+
Args:
255+
source: MDIO accessor instance. Data will be copied from here.
256+
chunks: Tuple containing chunk sizes for new rechunked array.
257+
suffix: Suffix to append to new rechunked array.
258+
compressor: Data compressor to use, optional. Default is Blosc('zstd').
259+
overwrite: Overwrite destination or not.
260+
261+
Examples:
262+
To rechunk a single variable we can do this
263+
264+
>>> accessor = MDIOAccessor(...)
265+
>>> rechunk(accessor, (1, 1024, 1024), suffix="fast_il")
266+
"""
267+
# TODO(Anyone): Write tests for rechunking functions
268+
# https://github.com/TGSAI/mdio-python/issues/369
269+
rechunk_batch(source, [chunks], [suffix], compressor, overwrite)

0 commit comments

Comments
 (0)