Skip to content

Commit 4a6338a

Browse files
authored
Merge pull request #86 from bcdev/forman-82-config_setting_extra
New configuration setting `extra`
2 parents db3ab06 + 2930bfe commit 4a6338a

File tree

8 files changed

+168
-98
lines changed

8 files changed

+168
-98
lines changed

CHANGES.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Version 0.7.0 (in development)
22

3-
* Made writing custom slice sources easier: (#82)
3+
* Made writing custom slice sources easier and more flexible: (#82)
44

55
- Slice items can now be a `contextlib.AbstractContextManager`
66
so custom slice functions can now be used with
@@ -11,9 +11,12 @@
1111
is applicable. Deprecated `SliceSource.dispose()`.
1212

1313
- Introduced new optional configuration setting `slice_source_kwargs` that
14-
contains keyword-arguments, which are passed to a configured `slice_source` together with
15-
each slice item.
14+
contains keyword-arguments, which are passed to a configured `slice_source`
15+
together with each slice item.
1616

17+
- Introduced optional configuration setting `extra` that holds additional
18+
configuration not validated by default. Intended use is by a `slice_source` that
19+
expects an argument named `ctx` and therefore can access the configuration.
1720

1821
## Version 0.6.0 (from 2024-03-12)
1922

docs/config.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,11 @@ Options for the filesystem given by the URI of `target_dir`.
195195
Type _string_.
196196
The fully qualified name of a class or function that receives a slice item as argument(s) and provides the slice dataset. If a class is given, it must be derived from `zappend.api.SliceSource`. If the function is a context manager, it must yield an `xarray.Dataset`. If a plain function is given, it must return any valid slice item type. Refer to the user guide for more information.
197197

198+
## `slice_source_kwargs`
199+
200+
Type _object_.
201+
Extra keyword-arguments passed to a configured `slice_source` together with each slice item.
202+
198203
## `slice_engine`
199204

200205
Type _string_.
@@ -408,3 +413,8 @@ Type _boolean_.
408413
If `true`, log only what would have been done, but don't apply any changes.
409414
Defaults to `false`.
410415

416+
## `extra`
417+
418+
Type _object_.
419+
Extra settings. Intended use is by a `slice_source` that expects an argument named `ctx` to access the extra settings and other configuration.
420+

docs/guide.md

Lines changed: 113 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -666,7 +666,13 @@ You can also use dataset objects of type
666666
[`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html)
667667
as slice item. Such objects may originate from opening datasets from some storage, e.g.,
668668
`xarray.open_dataset(slice_store, ...)` or by composing, aggregating, resampling
669-
slice datasets from other datasets and data variables.
669+
slice datasets from other datasets and dataset variables.
670+
671+
!!! note "Datasets are not closed automatically"
672+
If you pass [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) objects to `zappend` they will not be
673+
automatically closed. This may become be issue, if you have many datasets
674+
and each one binds resources such as open file handles. Consider using a
675+
_slice source_ then, see below.
670676

671677
Chunked data arrays of an `xarray.Dataset` are usually instances of
672678
[Dask arrays](https://docs.dask.org/en/stable/array.html), to allow for out-of-core
@@ -676,9 +682,9 @@ for computed slice datasets, especially if the specified target dataset chunking
676682
different from the slice dataset chunking. This may cause Dask graphs to be
677683
computed multiple times if the source chunking overlaps multiple target chunks,
678684
potentially causing large resource overheads while recomputing and/or reloading the same
679-
source chunks multiple times. In such cases it can help to "terminate"
680-
computations for each slice by persisting the computed dataset first and then to
681-
reopen it. This can be specified using the `persist_mem_slice` setting:
685+
source chunks multiple times. In such cases it can help to "terminate" computations for
686+
each slice by persisting the computed dataset first and then to reopen it. This can be
687+
specified using the `persist_mem_slice` setting:
682688

683689
```json
684690
{
@@ -692,61 +698,96 @@ at the cost of additional i/o. It therefore defaults to `false`.
692698

693699
#### Slice Sources
694700

695-
If you need some custom cleanup after a slice has been processed and appended to the
696-
target dataset, you can use instances of `zappend.api.SliceSource` as slice items.
697-
The `SliceSource` methods with special meaning are:
701+
A slice source gives you full control about how a slice dataset is created, loaded,
702+
or generated and how its bound resources, if any, are released. In its simplest form,
703+
a slice source is a plain Python function that can take any arguments and returns
704+
an `xarray.Dataset`:
705+
706+
```python
707+
import xarray as xr
708+
709+
# Slice source argument `path` is just an example.
710+
def get_dataset(path: str) -> xr.Dataset:
711+
# Provide dataset here. No matter how, e.g.:
712+
return xr.open_dataset(path)
713+
```
714+
715+
If you need cleanup code that is executed after the slice dataset has been appended,
716+
you can turn your slice source function into a
717+
[context manager](https://docs.python.org/3/library/contextlib.html)
718+
(new in zappend v0.7):
719+
720+
```python
721+
from contextlib import contextmanager
722+
import xarray as xr
723+
724+
# Slice source argument `path` is just an example.
725+
@contextmanager
726+
def get_dataset(path: str) -> xr.Dataset:
727+
# Bind any resources and provide dataset here, e.g.:
728+
dataset = xr.open_dataset(path)
729+
try:
730+
# Yield (not return!) the dataset
731+
yield dataset
732+
finally:
733+
# Cleanup code here, release any bound resources, e.g.:
734+
dataset.close()
735+
```
736+
737+
You can also implement your slice source as a class derived from the abstract
738+
`zappend.api.SliceSource` class. Its interface methods are:
698739

699740
* `get_dataset()`: a zero-argument method that returns the slice dataset of type
700741
`xarray.Dataset`. You must implement this abstract method.
701-
* `close()`: perform any resource cleanup tasks
742+
* `close()`: Optional method. Put your cleanup code here.
702743
(in zappend < v0.7, the `close` method was called `dispose`).
703-
* `__init__()`: optional constructor that receives any arguments passed to the
704-
slice source.
705744

706-
Here is the template code for your own slice source implementation:
707745

708746
```python
709747
import xarray as xr
710748
from zappend.api import SliceSource
711749

712750
class MySliceSource(SliceSource):
713-
# Pass any positional and keyword arguments that you need
714-
# to the constructor. `path` is just an example.
751+
# Slice source argument `path` is just an example.
715752
def __init__(self, path: str):
716753
self.path = path
717-
self.ds = None
754+
self.dataset = None
718755

719756
def get_dataset(self) -> xr.Dataset:
720-
# Write code here that obtains the dataset.
721-
self.ds = xr.open_dataset(self.path)
722-
# You can put any processing here.
723-
return self.ds
757+
# Bind any resources and provide dataset here, e.g.:
758+
self.dataset = xr.open_dataset(self.path)
759+
return self.dataset
724760

725761
def close(self):
726-
# Write code here that performs cleanup.
727-
if self.ds is not None:
728-
self.ds.close()
729-
self.ds = None
762+
# Cleanup code here, release any bound resources, e.g.:
763+
if self.dataset is not None:
764+
self.dataset.close()
730765
```
731766

732-
Instead of providing instances of `SliceSource` as slice items, it is often
733-
easier to pass your `SliceSource` class and let `zappend` pass the slice item as
734-
arguments(s) to your `SliceSource`'s constructor. This can be achieved using the
735-
the `slice_source` configuration setting. If you need to access configuration
736-
settings, it is even required to use the `slice_source` setting.
767+
You may prefer implementing a class because your slice source is complex and you want
768+
to split its logic into separate methods. You may also just prefer classes as a matter
769+
of your personal taste. Another advantage of using a class is that you can pass
770+
instances of it as slice items to the `zappend` function without further configuration.
771+
However, the intended use of a slice source is to configure it by specifying the
772+
`slice_source` setting. In a JSON or YAML configuration file it specifies the fully
773+
qualified name of the slice source function or class:
737774

738775
```json
739776
{
740777
"slice_source": "mymodule.MySliceSource"
741778
}
742779
```
743780

744-
The `slice_source` setting can actually be **any Python function** that returns a
745-
valid slice item as described above such as a file path or URI, or
746-
an `xarray.Dataset`.
781+
If you use the `zappend` function, you can pass the function or class directly:
782+
783+
```python
784+
zappend(["slice-1.nc", "slice-2.nc", "slice-3.nc"],
785+
target_dir="target.zarr",
786+
slice_source=MySliceSource)
787+
```
747788

748-
If a slice source is configured, each slice item passed to `zappend` is passed as
749-
argument to your slice source.
789+
If the slice source setting is used, each slice item passed to `zappend` is passed as
790+
argument(s) to your slice source.
750791

751792
* Slices passed to the `zappend` CLI command become slice source arguments
752793
of type `str`.
@@ -762,90 +803,73 @@ argument to your slice source.
762803

763804
You can also pass extra keyword arguments to your slice source using the
764805
`slice_source_kwargs` setting. Keyword arguments passed as slice items take
765-
precedence, that is, they overwrite arguments passed by `slice_source_kawrgs`.
806+
precedence, that is, they overwrite arguments passed by `slice_source_kwargs`.
807+
808+
If your slice source has many parameters that stay the same for all slices you may
809+
prefer providing parameters as configuration settings, rather than function or class
810+
arguments. This can be achieved using the `extra` setting:
766811

767-
In addition, your slice source function or class constructor specified by
768-
`slice_source` may define a 1st positional argument or keyword argument
769-
named `ctx`, which will receive the current processing context of type
770-
`zappend.api.Context`. This can be useful if you need to read configuration
771-
settings.
812+
```json
813+
{
814+
"extra": {
815+
"quantiles": [0.1, 0.5, 0.9],
816+
"use_threshold": true,
817+
"filter": "gauss"
818+
}
819+
}
820+
```
821+
822+
To access the settings in `extra` your slice source function or class constructor
823+
must define a special argument named `ctx`. It must be a 1ˢᵗ positional argument or
824+
a keyword argument. The argument `ctx` is the current processing context of type
825+
`zappend.api.Context` that also contains the configuration. The settings in `extra`
826+
can be accessed using the dictionary returned from `ctx.config.extra`.
772827

773828
Here is a more advanced example of a slice source that opens datasets from a given
774829
file path and averages the values along the time dimension:
775830

776831
```python
777832
import numpy as np
778833
import xarray as xr
834+
from zappend.api import Context
779835
from zappend.api import SliceSource
780836
from zappend.api import zappend
781837

782-
def get_mean_time(slice_ds: xr.Dataset) -> xr.DataArray:
783-
time = slice_ds.time
784-
t0 = time[0]
785-
dt = time[-1] - t0
786-
return xr.DataArray(np.array([t0 + dt / 2],
787-
dtype=slice_ds.time.dtype),
788-
dims="time")
789-
790-
def get_mean_slice(slice_ds: xr.Dataset) -> xr.Dataset:
791-
mean_slice_ds = slice_ds.mean("time")
792-
# Re-introduce time dimension of size one
793-
mean_slice_ds = mean_slice_ds.expand_dims("time", axis=0)
794-
mean_slice_ds.coords["time"] = get_mean_time(slice_ds)
795-
return mean_slice_ds
796-
797838
class MySliceSource(SliceSource):
798-
def __init__(self, slice_path):
839+
def __init__(self, ctx: Context, slice_path: str):
840+
self.quantiles = ctx.config.extra.get("quantiles", [0.5])
799841
self.slice_path = slice_path
800842
self.ds = None
801843

802844
def get_dataset(self):
803845
self.ds = xr.open_dataset(self.slice_path)
804-
return get_mean_slice(self.ds)
846+
return self.get_agg_slice(self.ds)
805847

806848
def close(self):
807849
if self.ds is not None:
808850
self.ds.close()
809-
self.ds = None
810851

852+
def get_agg_slice(self, slice_ds: xr.Dataset) -> xr.Dataset:
853+
agg_slice_ds = slice_ds.quantile(self.quantiles, dim="time")
854+
# Re-introduce time dimension of size one
855+
agg_slice_ds = agg_slice_ds.expand_dims("time", axis=0)
856+
agg_slice_ds.coords["time"] = self.get_mean_time(slice_ds)
857+
return agg_slice_ds
858+
859+
@classmethod
860+
def get_mean_time(cls, slice_ds: xr.Dataset) -> xr.DataArray:
861+
time = slice_ds.time
862+
t0 = time[0]
863+
dt = time[-1] - t0
864+
return xr.DataArray(np.array([t0 + dt / 2],
865+
dtype=slice_ds.time.dtype),
866+
dims="time")
867+
811868
zappend(["slice-1.nc", "slice-2.nc", "slice-3.nc"],
812869
target_dir="target.zarr",
813870
slice_source=MySliceSource)
814871
```
815872

816-
Since zappend 0.7, a slice source can also be written as a Python
817-
[context manager](https://docs.python.org/3/library/contextlib.html),
818-
which allows you implementing the `get_dataset()` and `close()`
819-
methods in one single function, instead of a class. Here is the above example
820-
written as context manager.
821-
822-
```python
823-
from contextlib import contextmanager
824-
import numpy as np
825-
import xarray as xr
826-
from zappend.api import zappend
827-
828-
# Same as above here
829-
830-
@contextmanager
831-
def get_slice_dataset(slice_path):
832-
# allocate resources here
833-
ds = xr.open_dataset(slice_path)
834-
mean_ds = get_mean_slice(ds)
835-
try:
836-
# yield (!) the slice dataset
837-
# so it can be appended
838-
yield mean_ds
839-
finally:
840-
# after slice dataset has been appended
841-
# release resources here
842-
ds.close()
843-
844-
zappend(["slice-1.nc", "slice-2.nc", "slice-3.nc"],
845-
target_dir="target.zarr",
846-
slice_source=get_slice_dataset)
847-
```
848-
849873
## Profiling
850874

851875
Runtime profiling is very important for understanding program runtime behavior

tests/config/test_config.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,22 @@ def test_slice_source_kwargs(self):
180180
{"a": 1, "b": True, "c": "nearest"}, config.slice_source_kwargs
181181
)
182182

183+
def test_extra(self):
184+
config = Config(
185+
{
186+
"target_dir": "memory://target.zarr",
187+
}
188+
)
189+
self.assertEqual({}, config.extra)
190+
191+
config = Config(
192+
{
193+
"target_dir": "memory://target.zarr",
194+
"extra": {"a": 1, "b": True, "c": "nearest"},
195+
}
196+
)
197+
self.assertEqual({"a": 1, "b": True, "c": "nearest"}, config.extra)
198+
183199

184200
def new_custom_slice_source(ctx: Context, index: int):
185201
return CustomSliceSource(ctx, index)

tests/config/test_schema.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ def test_get_config_schema(self):
2121
"disable_rollback",
2222
"dry_run",
2323
"excluded_variables",
24+
"extra",
2425
"force_new",
2526
"fixed_dims",
2627
"included_variables",

zappend/config/config.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,14 @@ def dry_run(self) -> bool:
188188
"""Whether to run in dry mode."""
189189
return bool(self._config.get("dry_run"))
190190

191+
@property
192+
def extra(self) -> dict[str, Any]:
193+
"""Extra settings.
194+
Intended use is by a `slice_source` that expects an argument
195+
named `ctx` to access the extra settings and other configuration.
196+
"""
197+
return self._config.get("extra") or {}
198+
191199
@property
192200
def logging(self) -> dict[str, Any] | str | bool | None:
193201
"""Logging configuration."""

zappend/config/schema.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -726,6 +726,15 @@
726726
"type": "boolean",
727727
"default": False,
728728
},
729+
extra={
730+
"description": (
731+
"Extra settings."
732+
" Intended use is by a `slice_source` that expects an argument"
733+
" named `ctx` to access the extra settings and other configuration."
734+
),
735+
"type": "object",
736+
"additionalProperties": True,
737+
},
729738
),
730739
"additionalProperties": False,
731740
}

0 commit comments

Comments
 (0)