Skip to content

Commit e7c3ce3

Browse files
authored
Merge pull request #80 from bcdev/forman-78-simplify_slice_sources
Simplify using custom slice sources
2 parents 9bf3160 + 643fc07 commit e7c3ce3

25 files changed

+858
-891
lines changed

CHANGES.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@
44
target dataset. An existing target dataset (and its lock) will be
55
permanently deleted before appending of slice datasets begins. [#72]
66

7+
* Simplified writing of custom slice sources for users. The configuration setting
8+
`slice_source` can now be a `SliceSource` class or any function that returns a
9+
_slice item_: an `xarray.Dataset` object, a `SliceSource` object or
10+
local file path or URI of type `str` or `FileObj`.
11+
Dropped concept of _slice factories_ entirely. [#78]
12+
713
* Internal refactoring: Extracted `Config` class out of `Context` and
814
made available via new `Context.config: Config` property.
915
The change concerns any usages of the `ctx: Context` argument passed to

docs/api.md

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,18 +8,6 @@ All described objects can be imported from the `zappend.api` module.
88
options:
99
show_root_heading: true
1010

11-
## Function `to_slice_factories()`
12-
13-
::: zappend.api.to_slice_factories
14-
options:
15-
show_root_heading: true
16-
17-
## Function `to_slice_factory()`
18-
19-
::: zappend.api.to_slice_factory
20-
options:
21-
show_root_heading: true
22-
2311
## Class `SliceSource`
2412

2513
::: zappend.api.SliceSource
@@ -38,15 +26,14 @@ All described objects can be imported from the `zappend.api` module.
3826

3927
## Types
4028

41-
::: zappend.api.SliceObj
29+
::: zappend.api.SliceItem
4230
options:
4331
show_root_heading: true
4432

45-
::: zappend.api.SliceFactory
33+
::: zappend.api.SliceCallable
4634
options:
4735
show_root_heading: true
4836

49-
5037
::: zappend.api.ConfigItem
5138
options:
5239
show_root_heading: true

docs/guide.md

Lines changed: 113 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -555,9 +555,38 @@ back. Therefore, a log message with warning level will be emitted if the
555555

556556
### Slice Datasets
557557

558-
If the slice paths passed to the `zappend` tool are given as URIs,
559-
additional storage options may be provided for the filesystem given by the
560-
URI's protocol. They may be specified using the `slice_storage_options` setting.
558+
A _slice dataset_ is the dataset that is appended for every slice item passed
559+
to `zappend`. Slice datasets can be provided in various ways.
560+
561+
* When using the [zappend CLI command](cli.md), slice items are passed as
562+
command arguments where they point to slice datasets by local file paths or URIs.
563+
564+
* When using the [zappend Python function](api.md), slice items are passed
565+
using the `slices` argument, which is a Python iterable. You can pass a list or tuple
566+
of slice items or provide a Python generator that provides the slice items.
567+
568+
Each slice item can be a local file path or URI of type `str` or `FileObj`,
569+
a dataset of type `xarray.Dataset`, or a `SliceSource` object explained in more detail
570+
below.
571+
572+
#### Paths and URIs
573+
574+
A slice item of type `str` is interpreted as local file path or URI, in the case
575+
the path has a protocol prefix, such as `s3://` as described above.
576+
577+
In the majority of `zappend` use cases the slice datasets to be appended to a target
578+
dataset are passed as local file paths or URIs. A slice URI starts with a protocol
579+
prefix, such as `s3://`, or `memory://`. Additional storage options may be required
580+
for the filesystem given by the URI's protocol. They may be specified using the
581+
`slice_storage_options` setting.
582+
583+
```json
584+
{
585+
"slice_storage_options": {
586+
"anon": true
587+
}
588+
}
589+
```
561590

562591
Sometimes, the slice dataset to be processed are not yet available, e.g.,
563592
because another process is currently generating them. For such cases, the
@@ -584,82 +613,29 @@ Or use default polling:
584613
}
585614
```
586615

587-
588-
### Slice Sources
589-
590-
A _slice source_ is an object that provides a slice dataset of type `xarray.Dataset`
591-
for given parameters of any type.
592-
593-
The optional `slice_source` configuration setting is used to specify a custom
594-
slice source. If not specified, `zappend` selects the slice source based on the type
595-
of a given slice object. These types are described in following subsections.
596-
597-
If given, the value of the `slice_source` setting is a class derived from
598-
`zappend.api.SliceSource`, or a function that creates an instance of
599-
`zappend.api.SliceSource`, or the fully qualified name of the aforementioned.
600-
In the case `slice_source` is given, the _slices_ argument passed to the CLI
601-
command and Python function become parameters to the specified class constructor
602-
or factory function.
603-
The individual slice items in the `SLICES` arguments of the `zappend` CLI
604-
command are of type `str`, typically interpreted as file paths or URIs.
605-
The individual slice items passed in the `slices` argument of the
606-
`zappend.api.zappend()` function can be of any type, but the `tuple`, `list`,
607-
and `dict` types have a special meaning:
608-
609-
* `tuple`: a pair of the form `(args, kwargs)`, where `args` is a list
610-
or tuple of positional arguments and `kwargs` is a dictionary of keyword
611-
arguments;
612-
* `list`: positional arguments only;
613-
* `dict`: keyword arguments only;
614-
* Any other type is interpreted as single positional argument.
615-
616-
In addition, your class constructor or factory function specified by `slice_source`
617-
may specify a positional or keyword argument named `ctx`, which will receive the
618-
current processing context of type `zappend.api.Context`.
619-
620-
If the `slice_source` setting is _not_ specified, the slice items passed as `slices`
621-
argument to the [`zappend`](api.md) Python function can be one of the types described
622-
in the following subsections.
623-
624-
#### Types `str` and `FileObj`
625-
626-
A slice object of type `str` is interpreted as local file path or URI, in the case
627-
the path has a protocol prefix, such as `s3://`.
628-
629616
An alternative to providing the slice dataset as path or URI is using the
630617
`zappend.api.FileObj` class, which combines a URI with dedicated filesystem
631618
storage options.
632619

633-
```python
634-
from zappend.api import FileObj
635-
636-
slice_obj = FileObj(slice_uri, storage_options=dict(...))
637-
```
638-
639-
#### Type `Dataset`
640-
641-
In-memory slice objects can be passed as [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html)
642-
objects. Such objects may originate from opening datasets from some storage
643-
644-
```python
645-
import xarray as xr
620+
#### Dataset Objects
646621

647-
slice_obj = xr.open_dataset(slice_store, ...)
648-
```
622+
You can also use dataset objects of type
623+
[`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html)
624+
as slice item. Such objects may originate from opening datasets from some storage, e.g.,
625+
`xarray.open_dataset(slice_store, ...)` or by composing, aggregating, resampling
626+
slice datasets from other datasets and data variables.
649627

650-
or by composing, aggregating, resampling slice datasets from other datasets and
651-
data variables. To allow for out-of-core computation of large datasets [Dask arrays](https://docs.dask.org/en/stable/array.html)
652-
are used by both `xarray` and `zarr`. As a dask array may represent complex and/or
628+
Chunked data arrays of an `xarray.Dataset` are usually instances of
629+
[Dask arrays](https://docs.dask.org/en/stable/array.html), to allow for out-of-core
630+
computation of large datasets. As a dask array may represent complex and/or
653631
expensive processing graphs, high CPU loads and memory consumption are common issues
654632
for computed slice datasets, especially if the specified target dataset chunking is
655633
different from the slice dataset chunking. This may cause Dask graphs to be
656634
computed multiple times if the source chunking overlaps multiple target chunks,
657635
potentially causing large resource overheads while recomputing and/or reloading the same
658-
source chunks multiple times.
659-
660-
In such cases it can help to "terminate" such computations for each slice by
661-
persisting the computed dataset first and then to reopen it. This can be specified
662-
using the `persist_mem_slice` setting:
636+
source chunks multiple times. In such cases it can help to "terminate" such
637+
computations for each slice by persisting the computed dataset first and then to
638+
reopen it. This can be specified using the `persist_mem_slice` setting:
663639

664640
```json
665641
{
@@ -671,34 +647,82 @@ If the flag is set, in-memory slices will be persisted to a temporary Zarr befor
671647
appending them to the target dataset. It may prevent expensive re-computation of chunks
672648
at the cost of additional i/o. It therefore defaults to `false`.
673649

674-
#### Type `SliceSource`
650+
#### Slice Sources
651+
652+
If you need some custom cleanup after a slice has been processed and appended to the
653+
target dataset, you can use an instance of `zappend.api.SliceSource` as slice item.
654+
A `SliceSource` class requires you to implement two methods:
675655

676-
Often you want to perform some custom cleanup after a slice has been processed and
677-
appended to the target dataset. In this case you can write your own
678-
`zappend.api.SliceSource` by implementing its `get_dataset()` and `dispose()`
679-
methods.
656+
* `get_dataset()` to return the slice dataset of type `xarray.Dataset`, and
657+
* `dispose()` to perform any resource cleanup tasks.
680658

681-
Slice source instances are supposed to be created by _slice factories_, see
682-
subsection below.
659+
Here is the template code for your own slice source implementation:
683660

684-
#### Type `SliceFactory`
661+
```python
662+
import xarray as xr
663+
from zappend.api import SliceSource
685664

686-
A slice factory is a 1-argument function that receives a processing context of type
687-
`zappend.api.Context` and yields a slice dataset object of one of the types
688-
described above. Since a slice factory cannot have additional arguments, it is
689-
normally defined as a [closure](https://en.wikipedia.org/wiki/Closure_(computer_programming))
690-
to capture slice-specific information.
665+
class MySliceSource(SliceSource):
666+
# Pass any positional and keyword arguments that you need
667+
# to the constructor. `path` is just an example.
668+
def __init__(self, path: str):
669+
self.path = path
670+
self.ds = None
671+
672+
def get_dataset(self) -> xr.Dataset:
673+
# Write code here that obtains the dataset.
674+
self.ds = xr.open_dataset(self.path)
675+
# You can put any processing here.
676+
return self.ds
677+
678+
def dispose(self):
679+
# Write code here that performs cleanup.
680+
if self.ds is not None:
681+
self.ds.close()
682+
self.ds = None
683+
```
691684

692-
In the following example, the actual slice dataset is computed from averaging another
693-
dataset. A `SliceSource` is used to close the datasets after the slice has been
694-
processed. Slice factories are created from the custom slice source and the slice paths
695-
using the utility function [to_slice_factories()][zappend.slice.factory.to_slice_factories]:
685+
Instead of providing instances of `SliceSource` directly as a slice item, it is often
686+
easier to pass your `SliceSource` class and let `zappend` pass the slice item as
687+
arguments(s) to your `SliceSource`'s constructor. This can be achieved using the
688+
the `slice_source` configuration setting.
689+
690+
```json
691+
{
692+
"slice_source": "mymodule.MySliceSource"
693+
}
694+
```
695+
696+
The `slice_source` setting can actually be **any Python function** that returns a
697+
valid slice item as described above.
698+
699+
If a slice source is configured, each slice item passed to `zappend` is passed as
700+
argument to your slice source.
701+
702+
* Slices passed to the `zappend` CLI command become slice source arguments
703+
of type `str`.
704+
* Slice items passed to the `zappend` function via the `slices` argument can be of
705+
any type, but the `tuple`, `list`, and `dict` types have a special meaning:
706+
707+
- `tuple`: a pair of the form `(args, kwargs)`, where `args` is a list
708+
or tuple of positional arguments and `kwargs` is a dictionary of keyword
709+
arguments;
710+
- `list`: positional arguments only;
711+
- `dict`: keyword arguments only;
712+
- Any other type is interpreted as single positional argument.
713+
714+
In addition, your slice source function or class constructor specified by `slice_source`
715+
may define a 1st positional argument or keyword argument named `ctx`,
716+
which will receive the current processing context of type `zappend.api.Context`.
717+
This can be useful if you need to read configuration settings.
718+
719+
Here is a more advanced example of a slice source that opens datasets from a given
720+
file path and averages the values first along the time dimension:
696721

697722
```python
698723
import numpy as np
699724
import xarray as xr
700725
from zappend.api import SliceSource
701-
from zappend.api import to_slice_factories
702726
from zappend.api import zappend
703727

704728
def get_mean_time(slice_ds: xr.Dataset) -> xr.DataArray:
@@ -711,37 +735,25 @@ def get_mean_time(slice_ds: xr.Dataset) -> xr.DataArray:
711735

712736
def get_mean_slice(slice_ds: xr.Dataset) -> xr.Dataset:
713737
mean_slice_ds = slice_ds.mean("time")
738+
# Re-introduce time dimension of size one
714739
mean_slice_ds = mean_slice_ds.expand_dims("time", axis=0)
715740
mean_slice_ds.coords["time"] = get_mean_time(slice_ds)
716741
return mean_slice_ds
717742

718743
class MySliceSource(SliceSource):
719-
def __init__(self, ctx, slice_path):
720-
super().__init__(ctx)
744+
def __init__(self, slice_path):
721745
self.slice_path = slice_path
722746
self.ds = None
723-
self.mean_ds = None
724747

725748
def get_dataset(self):
726749
self.ds = xr.open_dataset(self.slice_path)
727-
self.mean_ds = get_mean_slice(self.ds)
728-
return self.mean_ds
750+
return get_mean_slice(self.ds)
729751

730752
def dispose(self):
731753
if self.ds is not None:
732754
self.ds.close()
733755
self.ds = None
734-
if self.mean_ds is not None:
735-
self.mean_ds.close()
736-
self.mean_ds = None
737-
738-
zappend(to_slice_factories(MySliceSource, ["slice-1.nc", "slice-2.nc", "slice-3.nc"]),
739-
target_dir="target.zarr")
740-
```
741-
742-
Note, the above example can be simplified by using the `slice_source` setting directly:
743756

744-
```python
745757
zappend(["slice-1.nc", "slice-2.nc", "slice-3.nc"],
746758
target_dir="target.zarr",
747759
slice_source=MySliceSource)

tests/config/test_config.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -169,12 +169,15 @@ def new_custom_slice_source(ctx: Context, index: int):
169169

170170
class CustomSliceSource(SliceSource):
171171
def __init__(self, ctx: Context, index: int):
172-
super().__init__(ctx)
172+
self.ctx = ctx
173173
self.index = index
174174

175175
def get_dataset(self) -> xr.Dataset:
176176
return make_test_dataset(index=self.index)
177177

178+
def dispose(self):
179+
pass
180+
178181
@staticmethod
179182
def new1(ctx: Context, index: int):
180183
return CustomSliceSource(ctx, index)

tests/slice/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)